Reference- and alignment-free method for ctDNA detection from whole-genome sequencing (WGS) data.
ctDNAmer is a reference-free approach for ctDNA detection that finds tumor-specific somatic variation directly from unaligned sequencing data by identifying k-mers unique to the primary tumor sample. These k-mers are then used to detect ctDNA within raw cfDNA sequencing reads.
ctDNAmer leverages genome-wide information and is not limited to SNVs. Probabilistic modeling is used to estimate the circulating tumor fraction.
The method is built as a customizable snakemake workflow [1]. K-mer counting and k-mer set operations are done by KMC3 [2], probablistic models are implemented in STAN [3] and sampling is performed with the rstan package.
Detailed descripton of ctDNAmer can be found here.
ctDNAmer requires Snakemake 8.0.0 or above and uses conda for package management.
Detailed description of the required input data can be found here.
From each patient, primary tumor WGS data (~30x), matched germline WGS data (~30x) and cfDNA WGS data (~30x) are required. Each patient should have at least a pre-treatment/baseline cfDNA sample available. Sample paths can be specified in samples.tsv
and units.tsv
configuration files.
ctDNAmer applies an union of germline k-mer sets for enhanced separation of tumor and germline sequences.
Germline samples that will be combined to a representative union can be specified in the samples_glu.tsv
configuration file. This can include germline samples of the current target patient cohort or optionally, additional germline WGS data can be included for better representation of the germline information.
A set of unmatched cfDNA samples is required for the estimation of the empirical noise distribution. The cfDNA samples can be specified in the configuration file donors.tsv
. These samples can be cfDNA samples from healthy individuals or cfDNA samples from other patients not included in the current target patient cohort.
Snakemake is best to be installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it can be installed via Mambaforge. For other options see here.
Given that Mamba is installed, run:
mamba create -c conda-forge -c bioconda --name snakemake 'snakemake>=8'
to install Snakemake in an isolated environment. If you need to use conda instead of mamba, --conda-frontend conda
flag needs to be added to the snakemake commands given below.
Activate the environment via:
conda activate snakemake
Download and extract the repository:
git clone https://github.com/BesenbacherLab/ctDNAmer.git && cd ctDNAmer
To specify the parameters for running ctDNAmer and the sample paths, modify the configuration files config.yaml
, samples.tsv
, units.tsv
, samples_glu.tsv
and donors.tsv
according to your needs, following the explanations provided here.
For cluster execution of the workflow, the snakemake slurm executor plugin needs to be installed with pip install snakemake-executor-plugin-slurm
. If the slurm plugin is not installed, the -e
flag needs to be specified for the snakemake commands listed below.
The specifics for cluster execution should be defined in the workflow profile configuration file. An example workflow profile for slurm is provided here. To use the example profile, adjust the snakemake command line parameters to your needs. Importantly, a cluster account is specified in the example profile as an environment variable. To set the account name as an environment variable run export ACCOUNT_NAME=<your_account_name>
or modify the profile config file to include your account name directly.
After you have activated the conda environment with snakemake, installed the slurm executor plugin and set the account name as an environment variable, you can test the workflow remote execution by performing a dry-run:
snakemake -n
To run the workflow for a new data set, use the --directory
flag that specifies the path to the directory where the pipeline will be executed. The target directory needs to include a config folder with config.yaml
and samples.tsv
files, which specify the ctDNAmer parameters and paths to sample files that will be used during execution. You can execute the workflow with:
snakemake --directory "path/to/new/directory/"
The workflow profile that specifies the details for the cluster execution will be still automatically detected from the pipeline directory (workflow/profiles/default/config.yaml
) even when the execution directory is changed. If you want to specify a new cluster execution profile as well, use the --workflow-profile
flag:
snakemake --workflow-profile "path/to/workflow_profile/config.yaml"
For further options for local, cluster and cloud execution, see the snakemake docs.
We have found that a minimum unique tumor set size of 20 000 is needed for reliable TF estimation from samples with ~30x coverage and this is set as the default parameter value in ctDNAmer. Optionally, the a count filter test can be performed to confirm how large unique tumor set is required for reliable TF estimation in the user defined patient cohort.
Additional set of patients or a subset of the patients in the target cohort can be applied for testing the optimal size of the unique tumor set. The configuration file config_count_filter_test.yaml
can be used to specify the parameters of the test. The patient data used for testing can be specified in thesamples_count_filter_test.tsv
and units_count_filter_test.tsv
configuration files.
At least two cfDNA samples are required per patient for testing: a ctDNA-positive pre-treatment/baseline cfDNA sample and a ctDNA-negative/post-treatment cfDNA sample. The count filter test runs TF estimation for unique tumor sets of different sizes and the minimum required unique tumor set size can be determined based on the ctDNA-positive and -negative samples TF estimates difference.
To run the count-filter test, indicate the respective Snakefile that implements it with the -s
flag from the command line:
snakemake -s workflow/Snakefile_count_filter_test -n
Comparison with an alignment based tumor fraction estimates: calcuating the mean allele frequency of clonal SNVs
The subworkflow clonalSNVs_tracking
implements ctDNA detection and TF estimation based on aligned WGS data. Tumor fraction is estimated as the mean cfDNA allele frequency of clonal SNVs idenified from aligned primary tumor data. See more here.
[1] F. Mölder et al., “Sustainable data analysis with Snakemake,” F1000Research, vol. 10, p. 33, Apr. 2021, doi: 10.12688/f1000research.29032.2.
[2] M. Kokot, M. Długosz, and S. Deorowicz, “KMC 3: counting and manipulating k-mer statistics,” Bioinformatics, vol. 33, no. 17, pp. 2759–2761, Sep. 2017, doi: 10.1093/bioinformatics/btx304.
[3] Stan Development Team, “Stan Modeling Language Users Guide and Reference Manual.” Accessed: Oct. 24, 2024. [Online]. Available: https://mc-stan.org