This repository harbours the benchmarking framework, built upon Snakemake, for SIEVE.
We have configured a docker file containing everything needed to run the benchmarking pipeline. This is recommended for running on server or HPC.
To acqire the docker image, pull from Docker Hub with
$ docker pull senbaikang/sieve_benchmark:0.2
or build from Dockerfile in the root of this repository with
$ docker build -t sieve_benchmark:0.2 .
This framework contains scripts written in Python 3 and R. We have provided a conda environment file (environment.yml
), where a list of conda channels and python packages with specific versions are specified. By default, this environment is named snake
. To create it, please follow the commands below:
$ git clone https://github.com/szczurek-lab/SIEVE_benchmark_pipeline.git
$ cd SIEVE_benchmark_pipeline
$ mamba env create -f environment.yml
$ mamba activate snake
The following packages should additionally be installed before running the pipeline:
- For R:
- base >= 4.0
- stringr
- scales
- dplyr
- optparse
- ape
- phangorn
A few files should be configured before running.
-
The pipeline is mainly configured in
config.yaml
. There are some entries to be filled by users, which are marked by a phrase# TO BE SET
, followed either by[MANDATORY]
(must be set) or by[OPTIONAL]
(could be ignored). Users can search for the phrase to set everything up efficiently.-
The template configuration files for SIEVE are under
templates/
. -
For the key
[configFiles][SIEVE_simulator]
, a configuration file for the data simulator SIEVE_simulator should be specified. Those simulated scenarios used in the SIEVE paper are listed insimulation_configs/
. For details, check the paper.
-
-
In
run/run.sh
, users can set a few things, e.g., the name of the conda environment containing snakemake (by default,snake
), the number of cores to use and their ranges, etc. If you plan to use docker, please skip this step.
Since SiFit requires a large amount of memory even working on a small dataset, the framework supports running SiFit alone on another server (referred to as the remote server
) with the help of git. For this to work, a few things should be noted and configured:
- The machine you plan to run the pipeline (referred to as
the local machine
) andthe remote server
should meet one of the following conditions:- They are in the same local network.
- If they are not in the same local network,
the local machine
must have a public IP address forthe remote server
to clone the git repository. However,the remote server
can behind a gate server with a public IP address, specified by the key[servers][jumpServerOfRemote]
inconfig.yaml
.
- In
Snakefile
, comment outinclude: "sifit_local.snake"
, and uncomment# include: "sifit_remote.snake"
. - Set up
run/run_remote_sifit_true_monovar.sh
similarly torun/run.sh
.
The docker image only contains executables of all the benchmarked tools. To run the pipeline, you need to mount the local directory to this repository containing the snakemake rules and supporting scripts to the docker container under /root/data
:
$ docker run --name sieve_benchmark -v /local/path/to/SIEVE_benchmark_pipeline:/root/data senbaikang/sieve_benchmark:0.2
The console output of snakemake will appear in the terminal. To run the pipeline in the background, add -d
to the command above before the image name, and access the console outputs through:
$ docker logs sieve_benchmark
With everything set up, users can run the pipeline by default with rules defined in Snakefile
under the root of this repository simply with
$ source run/run.sh
or with
$ conda activate snake
$ snakemake --use-conda --cores {NUM} -kp
If benchmarking of the efficiency is of the concern, the snakemake file containing the corresponding rules should be used. Hence, the default commands specified in the docker image must be overwritten. To do so, run the docker container with the following command:
$ docker run --name sieve_benchmark -v /local/path/to/SIEVE_benchmark_pipeline:/root/data senbaikang/sieve_benchmark:0.2 snakemake --use-conda --cores all -s efficiency_benchmark.snake --rerun-triggers mtime -kp
A list of rules defined in efficiency_benchmark.snake
is readily available and can be run with
$ source run/run.sh efficiency_benchmark.snake
or manually with
$ conda activate snake
$ snakemake -s efficiency_benchmark.snake --use-conda --cores {NUM} -kp