Amplifications in MDR1, CRT, GCH1, plasmepsin2/3. Deletions in HRP2, HRP3. 24,409 samples.
This workflow integrates both coverage-based approaches (also referred to as HMM, germline CNV, or gCNV) and breakpoint evidence (described primarily as the "faceaway" method) to generate high-confidence copy number variation (CNV) calls. The complete process consists of six sequential modules that process data from raw BAM files to final CNV calls.
The codebase has been constructed such that upon cloning the repo, the user must execute each of the six stages of the CNV calling process from within its dedicated subfolder.
Visual summary of two approaches used for copy number variation (CNV) calling for Pf8. Left shows how paired end reads spanning across the breakpoint in a tandem duplication can give rise to reads mapping in the reverse orientation, in a "face away" orientation. Right shows how reads will appear to have more coverage than expected when mapped to the reference genome in the event of an amplification.For detailed information about each component, refer to the README files in the respective subdirectories:
Prepares genomic data by segmenting the reference genome into 500bp bins, filtering regions of interest, and extracting read counts from BAM files.
Categorizes samples into appropriate cohorts (gDNA/sWGA) for optimal HMM training in the next step.
Performs the core CNV detection by training HMM models on appropriate sample cohorts and inferring copy number states across all samples.
Converts raw coverage outputs to genotype calls, generates diagnostic plots, and facilitates manual curation of potentially low-confidence calls. Deletions were confirmed using IGV.
Identifies and characterizes breakpoint evidence for structural variants to complement coverage-based calls.
Integrates coverage-based and breakpoint-based evidence to produce final CNV calls and generates figures for publication.
- This codebase is very complex. Please read Pf8 Supplementary Materials and the
README.md
files in each subfolder before using the codebase. - Files and directories in this codebase relating to the Nextflow pipelines typically contain a numerical suffix corresponding to the stage in the coverage-based calling CNV process where they are ingested. For example,
03_annotated.interval_list
will be ingested at step 03 (although there is no indication as to where the file is produced). This has been done to help with navigating the codebase.
In addition to the subdirectories above, you will see:
base.config
: The base Nextflow config file from which the Nextflow pipelines in steps 01 and 03 load.containers/
: For managing Singularity containers for GATK.requirements.txt
: For all Python tasks.visual_summary/
: Holds.gif
in this README.md.assets_pf8/
: The collection of input and output files (both to be provided by the user and also created by the Nextflow pipelines) that all belong to a single experimental setup. The intention was that the user can run one experiment in thisassets_pf8
folder, then run a second experiment with slightly different parameters or samples in another folder (e.g.,assets_pf8_experimental
):- Files for stage 01: Input Generation Pipeline
01_execution/
: Output directory for stage 01 Nextflow pipeline. Should contain one.counts.tsv
per sample.01_intervals_of_interest.interval_list
: User-defined regions for analysis. We defined six 500-kilobase regions.01_paths_to_bams.tsv
: User-provided list of sample IDs and paths to corresponding.bam
files.regions-20130225.onebased.txt
: Definition of core genome for helping deciding01_intervals_of_interest.interval_list
and03_blacklist.interval_list
.PlasmoDB-55_Pfalciparum3D7.gff
: GFF for helping deciding01_intervals_of_interest.interval_list
, and for plotting in stage 04.
- Files for stage 02: Sample Clustering
02_paths_to_readcounts.tsv
: Generated by stage 02 pipeline. Sample IDs and paths to corresponding.count.tsv
files. Used in stage 02 for generating manifest file for stage 03.Pf_8_samples_20241212.txt
: Sample metadata file for deciding train/test cohorts when generating manifest file for stage 03.
- Files for stage 03: Coverage-based Pipeline
03_sample_cluster_assignment.tsv
: Manifest file generated by stage 02.03_contig_ploidy_priors.tsv
: User-defined contig ploidy priors.03_blacklist.interval_list
: User-defined regions to prevent the HMM training on.03_restricted.interval_list
: List of 500-bp bins as defined by01_intervals_of_interest.interval_list
produced by stage 01.03_annotated.interval_list
:03_restricted.interval_list
but annotated with GC-content, produced by stage 01.03_filtered.interval_list
:03_annotated.interval_list
but filtered in stage 03 to only include useful bins. Note that although 6000 bins was requested, only 4850 were used for training the HMMs.03_execution/
: Output directory of stage 03. Contains lists of samples and contigs with inappropriate ploidy numbers and a subfolder (inpostprocessing/
) for each sample containing 5 raw pipeline output (described in more detail in04_gcnv_calls_validation/README.md
).models/
: Trained contig ploidy models and gCNV models generated from Pf8. Ready for use.
- Files for stage 04: gCNV Calls Validation
04_call_regions.tsv
: User-defined regions for making coverage-based CNV genotype calls (typically single genes, except from plasmepsin2/3).chromosomes_lengths.tsv
: Lengths of each Pf3D7 reference chromosome for plotting.
- Files for stage 01: Input Generation Pipeline