Home

Welcome to our wiki!

This page describes the phylogenetic pipeline we setup to analyze COVID data in our publication (https://academic.oup.com/mbe/article/38/5/1777/6030946).

Please note that this is not an officially released tool and that it is not particularly user-friendly. We have developed it to show that covid phylogenetic analyzis is difficult and to experiment with potential solutions (such as tree thinning), and we have made it available for transparency. If you want to use it, please make sure that it is really relevant for your purpose.

Git repository

The main elements in the git repository are:

The installation script setup.sh: automatically clones and installs all the dependencies. This step can take quite some time because we need to compile a lot of tools.
The scripts directory: contains all helpers used in the pipeline. Most general ones are common.py util.py. The scripts typically call the different software (raxml-ng, epa-ng etc.) and manipulate their inputs and outputs.
The pipeline directory: contains each step of the pipeline. A step is a short python executable that calls helpers from the scripts directory.

Working directory

The working directory contains all the data, runs and results of the pipeline. This directory is NOT versioned with git. The pipeline automatically creates the working directory (under GIT_REPO_ROOT/work_dir)

A snapshot is a directory that contains all analyses run on the same input data. A snapshot directory name contains the date of the snapshot and an index (in case we do several snapshots the same day). For instance: GIT_REPO_ROOT/work_dir/2020-05-05_00.

A version corresponds to one alignment generated from a snapshot. The different types of alignment strategies are described in the paper. For instance: GIT_REPO_ROOT/work_dir/2020-05-05_00/FMSAO. The following versions are possible:

FMSAO
FMSAN
SMSAO
SMSAN
any version generated from another version via tree thinning

A version directory contains the following subdirectories:

data: the input data, created by the first step of the pipeline (typically: the initial alignment, and some information about the duplicated sequences we removed)
results: all the files generated by the pipeline that should be uploaded to the google drive
runs: all pipeline outputs that do not belong to results (logs, temporary files, big files that we don't want to upload etc.)

The Paths class defined in scripts/common.py contains the paths of all important input and output files of the pipeline, given a snapshot and version. For instance, to access the set of plausible trees of the 5th May snapshot and generated from the "full MSA with output" (FMSAO) alignment, one needs to create an instance paths of Paths with the snapshot 2020-05-05_00 and the version FMSAO, and then access the file path through the property paths.raxml_credible_ml_trees.

Running a step of the pipeline

The first step is to run ./pipeline/0_get_data.py, which takes as its argument either a local path to your raw sequences (for example as downloaded from GISAID), or alternatively a Google Drive link to a shared file containing the same. This input file can be either a fasta or a gzipped fasta file. This first script will set up a folder structure under workdir/, named by the version string (the date of when the script is run, plus a unique number) and place the input file there.

Note that the sequences should come from GISAID or have the same format: for instance, the taxon names should look like hCoV-19/Australia/NSW14/2021|EPI_ISL_413500|2020-02-01. In addition, the file config/outgroups.txt should contain at least one taxa that is in the input dataset.

All subsequent steps in the pipeline are python executables that take as arguments the version string and a snapshot identifier. For instance: ./pipeline/9_mptp_on_all_trees.py 2020-05-05_00 smsan

Alternative syntax: ./pipeline/9_mptp_on_all_trees.py work_dir/2020-05-05_00/smsan/

Description of the steps

1_preprocess_data.py: Prepares the data for a given snapshot and a given version.
2_pargenes.py: Runs Pargenes, which generate maximum likelihood and bootstrap trees with RAxML-NG. Depends on step 1 (unless this version was generated via tree thinning).
3_export_pargenes_results.py: Runs Pargenes, which generate maximum likelihood and bootstrap trees with RAxML-NG. Depends on step 2.
**4_mptp.py **: Performs species tree delimitation on the maximum likelihood tree. Depends on step 3.
5_epa_outgroup_rooting.py: Performs phylogenetic placement of the outgroups into the maximum likelihood tree with EPA-NG. Depends on step 3.
6_root_digger_rooting.py: Infers the root of the phylogenetic tree without outgroup with RootDigger. Depends on step 3.
7_iqtree_tests.py: Performs statistical tests to extract a set of plausible tree from the set of maximum likelihood trees inferred with ParGenes. Depends on step 3.
8_tree_thinning.py: Applies tree thinning, that is, generates a new alignment with less sequences in order to obtain a more reliable. Warning: you need to run pipeline/extract_thinned_dataset.py to extract two new versions of this snapshot (one for each tree thinning technique, see our paper). This new version should have less sequences than the original version. You can then restart the pipeline on this new (and lighter) version of the snapshot, starting from step 2 (skip step 1).
9_mptp_on_all_trees.py: Runs species tree delimitation on all plausible trees. Depends on step 7.

Adding a step in the pipeline

Just copy any existing step, and call your own scripts. We might want to refactor the step names if we need more steps, because they are indexed with 1_, 2_ etc.

Please keep in mind that we might need to call a step of the pipeline several times with the same parameters (if we want to restart the pipeline)!! Some of the tools we run fail to run if their output directory already exists, so please make sure that your scripts takes care of this!

Remember that the files that should be uploaded in the google drive or that are important results for the paper (and only those files!!!) should be written or copied into the results sub-directory of the current snapshot.

Adding a software in the pipeline

Make sure to update the setup.sh script to automatically install your tool. Also, add a variable in scripts/common.py with the path to your tool executable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!