cfDNA-Flow facilitates the accurate and reproducible analysis of cfDNA WGS data. It offers various preprocessing options to accommodate different experimental setups and research needs in the field of liquid biopsies.
cfDNA-Flow provides the flexibility to either trim or not trim the input reads based on the user's requirements. Trimming removes low-quality bases, which can impact downstream analyses.
Users can choose from the following genome builds: hg19, hs37d5 (hg19decoy), hg38 and hg38 without alternative contigs (hg38noalt). For download links, please refer to the Data Availability section in the accompanying paper.
The pipeline uses the BWA software for alignment, followed by extensive post-alignment filtering steps to ensure reliable alignments. Users can define specific filtering criteria to remove low-quality or ambiguous reads, such as secondary alignments, reads with insertion or deletion, and reads with low mapping qualities.
cfDNA-Flow offers fragment length analysis; calculating the mean, median, and standard deviation values for fragments sized 100 to 220 base pairs (bp), corresponding to the mononucleosomal size range. Additionally, cfDNA-Flow calculates the frequencies of cfDNA fragment sizes ranging from 70 bp to 1000 bp in 10 bp bins.
cfDNA-Flow utilizes two copy number analysis tools: ichorCNA (v0.2.0) and tMAD, to estimate copy number changes and tumor fraction.
cfDNA-Flow runs FrEIA to identify differences in fragment end motifs and their diversity between groups.
These features are calculated using LIQUORICE. You can find more information and access LIQUORICE through the following link.
To use the cfDNA-Flow, follow these steps:
Clone the cfDNA-Flow repository from GitHub to your local machine.
git clone https://github.com/uzh-dqbm-cmi/cfDNA-Flow.git
cd cfDNA-Flow
It is recommended to create a virtual environment to manage the project's dependencies. This ensures that the dependencies do not interfere with other Python projects on your machine. See how to create a Python environment here.
Once the virtual environment is activated, install the required Python dependencies using the requirements.txt
file.
pip install -r requirements.txt
Additionally, some R packages are required for the project. Make sure you have R installed (version 4.3). You can install these packages by running the script below:
R -f install_packages.R
After following these steps, your environment should be set up with all the necessary dependencies for both Python and R. You are now ready to proceed with using the cfDNA-Flow pipeline. See section 4. Usage.
Once you are finished using cfDNA-Flow, deactivate the virtual environment by running:
deactivate
The configuration file, test_cfDNA_pipeline.yaml
, is used to specify the input files, reference genome, and desired preprocessing options. User can customize the trimming, alignment, and filtering settings as needed.
Settings for this demo are as follows: reads are trimmed, reference genome is hg38, mapping quality is 30, SAM flag is 40, CIGAR string is D.
To start preprocessing, execute the following command. Use the -np flag for a dry run to verify everything works correctly.
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_preprocess -np
If successful, rerun the command without the -np flag.
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_preprocess
Next, bed to process BED files, use:
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_bedprocess
Do global length:
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_global_length
Do tMAD:
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_cal_blacklist
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_cal_RefSample
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_cal_t_MAD_forall
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_visualising_t_MAD_forall
Do ichorCNA:
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_createPoN
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_ichorCNA
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_ichorCNA_results
Do LIQUORICE:
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_LIQUORICE
Do FreIA:
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_FrEIA_preprocessing
snakemake -s Snakefile --configfile test/test_cfDNA_pipeline.yaml -j 2 do_FrEIA
The pipeline outputs alignment files (BAM files, BED files), and quality control reports. Additionally, it outputs features described above.
Processed BAM files of studied samples, accompanied by their .bai index files, are stored in the results/BAM/0memhg19False40
folder and have .sortByCoord.bam
suffix.
BED files are stored in the results/BED/0memhg19False40
folder. Those files store information about chromosome number, start and end positions of cfDNA fragments.
Output of QC is stored in the results/QC/0memhg19False40/multiqc_data
folder. Specifically, multiqc_report.html
file contains multiple the QC metrics: general statistics, Picard metrics (alignment summary, mean read length, mark duplicates, WGS coverage, WGS filtered bases), FastQC (sequence counts, sequence quality histograms and quality scores, per base sequence content, per sequence GC content, per base N content, sequence length distribution, sequence duplication levels, overrepresented sequences, adapter content, status checks).
The output of fragment length features is stored in the results/feature/0memhg19False40/global_length.tsv
file. Columns store fragment length features for each studied sample (rows).
Outputs of features in 1 Mbp genomic bins can be found in the results/BED/0memhg19False40
folder. Values for all the samples are stored in mergeddf.csv
file. Values for each individual sample are stored in the files with suffix binned.csv
.
Additional length features for every sample are stored in the folder results/BED/0memhg19False40
and have the following suffixes:
binned_lengths.csv
- each row contains information about the chromosome number, genomic bin number (1 Mbp wide), and the lengths of all cfDNA fragments corresponding to that bin
len.csv
- contains a single column listing the lengths of all cfDNA fragments derived from a sample
lenuniqcount.csv
- a two-column format representing the histogram of cfDNA fragment lengths along with their frequencies
Results of ichorCNA analysis can be found in the results/feature/0memhg19False40/ichorCNA
folder. For detailed ichorCNA output description see this link. Shortly, ichorCNA outputs tumor fraction estimates based on CNA analysis. Additionally, it outputs CNA plots representing log2 ratio copy number for each bin in the genome.
The outputs of tMAD are stored in the results/BED/0memhg19False40/tMAD/tMAD_results.tsv
file. This file contains sample names and their corresponding tMAD values.
The outputs of LIQUORICE are stored in the results/BED/0memhg19False40/LIQUORICE/summary_across_samples_and_ROIS.csv
file. This file contains sample names and their corresponding Dip depth and Dip area values after z-scaling. We recommend using the dip depth values, as we found them most informative.
The outputs of FrEIA are stored in the results/BED/0memhg19False40/FrEIA/0memhg19False40_FrEIA_score.csv
file. This file contains sample names and their corresponding tMAD values.
With issues or questions, please contact the maintainers.