-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to our wiki!
This page describes the phylogenetic pipeline we setup to analyze COVID data in our publication (https://academic.oup.com/mbe/article/38/5/1777/6030946)
The main elements in the git repository are:
- The installation script
setup.sh
: automatically clones and installs all the dependencies. This step can take quite some time because we need to compile a lot of tools. - The
scripts
directory: contains all helpers used in the pipeline. Most general ones arecommon.py
util.py
. The scripts typically call the different software (raxml-ng, epa-ng etc.) and manipulate their inputs and outputs. - The
pipeline
directory: contains each step of the pipeline. A step is a short python executable that calls helpers from thescripts
directory.
The working directory contains all the data, runs and results of the pipeline. This directory is NOT versioned with git. The pipeline automatically creates the working directory (under GIT_REPO_ROOT/work_dir
)
A dataset is a snapshot of the available sequences. A dataset directory name contains the date of the snapshot and an index (in case we do several snapshots the same day). For instance: GIT_REPO_ROOT/work_dir/2020-05-05_00
.
A version corresponds to one alignment generated from a dataset. The different types of alignments are described in the paper. For instance: GIT_REPO_ROOT/work_dir/2020-05-05_00/FMSAO
.
A version directory contains the following subdirectories:
-
data
: the input data, created by the first step of the pipeline (typically: the initial alignment, and some information about the duplicated sequences we removed) -
results
: all the files generated by the pipeline that should be uploaded to the google drive -
runs
: all pipeline outputs that do not belong toresults
(logs, temporary files, big files that we don't want to upload etc.)
The Paths
class defined in scripts/common.py
contains the paths of all important input and output files of the pipeline, given a dataset and version.
For instance, to access the set of plausible trees of the 5th May snapshot and generated from the "full MSA with output" (FMSAO) alignment, one needs to create an instance paths
of Paths
with the version 2020-05-05_00
and the dataset FMSAO
, and then access the file path through the property paths.raxml_credible_ml_trees
.
The first step is to run ./pipeline/0_get_data.py
, which takes as its argument either a local path to your raw sequences (for example as downloaded from GISAID), or alternatively a Google Drive link to a shared file containing the same. This input file can be either a fasta or a gzipped fasta file. This first script will set up a folder structure under workdir/
, named by the version string (the date of when the script is run, plus a unique number) and place the input file there.
All subsequent steps in the pipeline are python executables that take as arguments the version string and a dataset identifier. For instance: ./pipeline/9_mptp_on_all_trees.py 2020-05-05_00 smsan
(Unofficial alternative syntax: ./pipeline/9_mptp_on_all_trees.py work_dir/2020-05-05_00/smsan/
)
Just copy any existing step, and call your own scripts. We might want to refactor the step names if we need more steps, because they are indexed with 1_
, 2_
etc.
Please keep in mind that we might need to call a step of the pipeline several times with the same parameters (if we want to restart the pipeline)!! Some of the tools we run fail to run if their output directory already exists, so please make sure that your scripts takes care of this!
Remember that the files that should be uploaded in the google drive or that are important results for the paper (and only those files!!!) should be written or copied into the results
sub-directory of the current dataset.
Make sure to update the setup.sh
script to automatically install your tool. Also, add a variable in scripts/common.py
with the path to your tool executable.