Usage on Various Compute Clusters

This document presents step-by-step instructions for installing and training Saber on various compute clusters.

Compute Canada

These instructions will be written for the Béluga cluster in particular, but usage across all Compute Canada (CC) clusters should be nearly identical.

Installation

Start by SSH'ing into a login node, e.g.

$ ssh <username>@beluga.computecanada.ca

Then clone the repo to your PROJECT folder

# "def-someuser" will be the group you belong to
$ PROJECT_DIR = ~/projects/<def-someuser>/<username>
$ cd $PROJECT_DIR
$ git clone https://github.com/BaderLab/saber.git
$ cd saber

Next, we will create an environment and install the package and all its dependencies. Note, you only need to do this once.

# Path to where the environment will be created
ENV_DIR=~/saber

# Create a virtual environment
module load python/3.7 cuda/10.0
virtualenv --no-download --python=python3.7 $ENV
source $ENV/bin/activate
pip install --upgrade pip

# Packages available in the CC wheelhouse
pip install scikit-learn torch pytorch_transformers Keras-Preprocessing spacy nltk neuralcoref --no-index

# Install Saber
git checkout development
pip install -e .

# Install seqeval fork (TEMPORARY)
git clone https://github.com/JohnGiorgi/seqeval.git
cd seqeval
pip install .
cd ../
rm -r seqeval

# Install Apex (OPTIONAL)
git clone https://github.com/NVIDIA/apex
cd apex
python setup.py install --cpp_ext --cuda_ext
cd ..
rm -r apex

Downloading the datasets

Make a directory to store your datasets, e.g.

mkdir $PROJECT_DIR/saber/datasets

Place any datasets you would like to train on in this folder.

Download a BERT model

Because the compute nodes are air-gapped, you will need to download a BERT model on the login node. Note that you only have to do this once.

If you want to use the default BERT model (BioBert V1.1. ; recommended), simply call a training session and cancel it as soon as training begins

python -m saber.cli.train --dataset_folder path/to/dataset

If you want to use one of the BERT models from pytorch-pretrained-bert (see here for list of pre-trained BERT models), first set saber.constants.PRETRAINED_BERT_MODEL to your model name and run a training session, cancelling it as soon as training begins (as above).
If you want to supply your own model, simply set saber.constants.PRETRAINED_BERT_MODEL to your model's path on disk. There is no need to run a training session.

Training

To train the model, you will need to create a train.sh script. For example:

#!/bin/bash
#SBATCH --account=def-someuser
# Requested resources
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=10 
# Wall time and job details
#SBATCH --time=1:00:00
#SBATCH --job-name=example
#SBATCH --output=$SCRATCH/output/%j.txt
# Emails me when job starts, ends or fails
#SBATCH --mail-user=example@gmail.com
#SBATCH --mail-type=ALL
# Use this command to run the same job interactively
# salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00

# Load required models and activate the enviornment
ENVDIR=~/saber
WORKDIR=/home/johnmg/projects/def-gbader/johnmg/saber

module load python/3.7 cuda/10.0
source $ENVDIR/bin/activate

cd $WORKDIR

# Train the model
python -m saber.cli.train --dataset_folder path/to/dataset

Submit this job with sbatch train.sh. To run the same job interactively, use

salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00

Note, on Beluga, you should use a maximum of 10 CPUs per GPU requested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage on Various Compute Clusters

Compute Canada

Installation

Downloading the datasets

Download a BERT model

Training

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally