-
Notifications
You must be signed in to change notification settings - Fork 17
Usage on Various Compute Clusters
This document presents step-by-step instructions for installing and training Saber on various compute clusters.
These instructions will be written for the Béluga cluster in particular, but usage across all Compute Canada (CC) clusters should be nearly identical.
Start by SSH'ing into a login node, e.g.
$ ssh <username>@beluga.computecanada.ca
Then clone the repo to your PROJECT
folder
# "def-someuser" will be the group you belong to
$ PROJECT_DIR = ~/projects/<def-someuser>/<username>
$ cd $PROJECT_DIR
$ git clone https://github.com/BaderLab/saber.git
$ cd saber
Next, we will create an environment and install the package and all its dependencies. Note, you only need to do this once.
# Path to where the environment will be created
ENV_DIR=~/saber
# Create a virtual environment
module load python/3.7 cuda/10.0
virtualenv --no-download --python=python3.7 $ENV
source $ENV/bin/activate
pip install --upgrade pip
# Packages available in the CC wheelhouse
pip install scikit-learn torch pytorch_transformers Keras-Preprocessing spacy nltk neuralcoref --no-index
# Install Saber
git checkout development
pip install -e .
# Install seqeval fork (TEMPORARY)
git clone https://github.com/JohnGiorgi/seqeval.git
cd seqeval
pip install .
cd ../
rm -r seqeval
# Install Apex (OPTIONAL)
git clone https://github.com/NVIDIA/apex
cd apex
python setup.py install --cpp_ext --cuda_ext
cd ..
rm -r apex
Make a directory to store your datasets, e.g.
mkdir $PROJECT_DIR/saber/datasets
Place any datasets you would like to train on in this folder.
Because the compute nodes are air-gapped, you will need to download a BERT model on the login node. Note that you only have to do this once.
- If you want to use the default BERT model (BioBert V1.1. ; recommended), simply call a training session and cancel it as soon as training begins
python -m saber.cli.train --dataset_folder path/to/dataset
-
If you want to use one of the BERT models from pytorch-pretrained-bert (see here for list of pre-trained BERT models), first set
saber.constants.PRETRAINED_BERT_MODEL
to your model name and run a training session, cancelling it as soon as training begins (as above). -
If you want to supply your own model, simply set
saber.constants.PRETRAINED_BERT_MODEL
to your model's path on disk. There is no need to run a training session.
To train the model, you will need to create a train.sh
script. For example:
#!/bin/bash
#SBATCH --account=def-someuser
# Requested resources
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=10
# Wall time and job details
#SBATCH --time=1:00:00
#SBATCH --job-name=example
#SBATCH --output=$SCRATCH/output/%j.txt
# Emails me when job starts, ends or fails
#SBATCH --mail-user=example@gmail.com
#SBATCH --mail-type=ALL
# Use this command to run the same job interactively
# salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00
# Load required models and activate the enviornment
ENVDIR=~/saber
WORKDIR=/home/johnmg/projects/def-gbader/johnmg/saber
module load python/3.7 cuda/10.0
source $ENVDIR/bin/activate
cd $WORKDIR
# Train the model
python -m saber.cli.train --dataset_folder path/to/dataset
Submit this job with sbatch train.sh
. To run the same job interactively, use
salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00
Note, on Beluga, you should use a maximum of 10 CPUs per GPU requested.