Skip to content

malariagen/malariagen-pf8-cnv-calling

Repository files navigation

malariagen-pf8-cnv-calling

The code in this repo was used for making Pf8's copy number variation (CNV) calls.

Amplifications in MDR1, CRT, GCH1, plasmepsin2/3. Deletions in HRP2, HRP3. 24,409 samples.


Overview

This workflow integrates both coverage-based approaches (also referred to as HMM, germline CNV, or gCNV) and breakpoint evidence (described primarily as the "faceaway" method) to generate high-confidence copy number variation (CNV) calls. The complete process consists of six sequential modules that process data from raw BAM files to final CNV calls.

The codebase has been constructed such that upon cloning the repo, the user must execute each of the six stages of the CNV calling process from within its dedicated subfolder.

CNV visual summary for Pf8

Visual summary of two approaches used for copy number variation (CNV) calling for Pf8. Left shows how paired end reads spanning across the breakpoint in a tandem duplication can give rise to reads mapping in the reverse orientation, in a "face away" orientation. Right shows how reads will appear to have more coverage than expected when mapped to the reference genome in the event of an amplification.

Workflow Structure

For detailed information about each component, refer to the README files in the respective subdirectories:

1. Input Generation Pipeline - 01_input_generation_pipeline/

Prepares genomic data by segmenting the reference genome into 500bp bins, filtering regions of interest, and extracting read counts from BAM files.

2. Sample Clustering - 02_sample_clustering/

Categorizes samples into appropriate cohorts (gDNA/sWGA) for optimal HMM training in the next step.

3. Coverage-based Pipeline - 03_coverage_based_pipeline/

Performs the core CNV detection by training HMM models on appropriate sample cohorts and inferring copy number states across all samples.

4. gCNV Calls Validation - 04_gcnv_calls_validation/

Converts raw coverage outputs to genotype calls, generates diagnostic plots, and facilitates manual curation of potentially low-confidence calls. Deletions were confirmed using IGV.

5. Faceaway Data Generation - 05_faceaway_data_generation/

Identifies and characterizes breakpoint evidence for structural variants to complement coverage-based calls.

6. Final Calls and Publication Figures - 06_final_calls_and_pf8_figures/

Integrates coverage-based and breakpoint-based evidence to produce final CNV calls and generates figures for publication.


Additional comments

Before using this repo, please read the following:

  • This codebase is very complex. Please read Pf8 Supplementary Materials and the README.md files in each subfolder before using the codebase.
  • Files and directories in this codebase relating to the Nextflow pipelines typically contain a numerical suffix corresponding to the stage in the coverage-based calling CNV process where they are ingested. For example, 03_annotated.interval_list will be ingested at step 03 (although there is no indication as to where the file is produced). This has been done to help with navigating the codebase.

Additional files

In addition to the subdirectories above, you will see:

  • base.config : The base Nextflow config file from which the Nextflow pipelines in steps 01 and 03 load.
  • containers/ : For managing Singularity containers for GATK.
  • requirements.txt: For all Python tasks.
  • visual_summary/: Holds .gif in this README.md.
  • assets_pf8/ : The collection of input and output files (both to be provided by the user and also created by the Nextflow pipelines) that all belong to a single experimental setup. The intention was that the user can run one experiment in this assets_pf8 folder, then run a second experiment with slightly different parameters or samples in another folder (e.g., assets_pf8_experimental):
    • Files for stage 01: Input Generation Pipeline
      • 01_execution/: Output directory for stage 01 Nextflow pipeline. Should contain one .counts.tsv per sample.
      • 01_intervals_of_interest.interval_list: User-defined regions for analysis. We defined six 500-kilobase regions.
      • 01_paths_to_bams.tsv: User-provided list of sample IDs and paths to corresponding .bam files.
      • regions-20130225.onebased.txt: Definition of core genome for helping deciding 01_intervals_of_interest.interval_list and 03_blacklist.interval_list.
      • PlasmoDB-55_Pfalciparum3D7.gff: GFF for helping deciding 01_intervals_of_interest.interval_list, and for plotting in stage 04.
    • Files for stage 02: Sample Clustering
      • 02_paths_to_readcounts.tsv: Generated by stage 02 pipeline. Sample IDs and paths to corresponding .count.tsv files. Used in stage 02 for generating manifest file for stage 03.
      • Pf_8_samples_20241212.txt: Sample metadata file for deciding train/test cohorts when generating manifest file for stage 03.
    • Files for stage 03: Coverage-based Pipeline
      • 03_sample_cluster_assignment.tsv: Manifest file generated by stage 02.
      • 03_contig_ploidy_priors.tsv: User-defined contig ploidy priors.
      • 03_blacklist.interval_list: User-defined regions to prevent the HMM training on.
      • 03_restricted.interval_list: List of 500-bp bins as defined by 01_intervals_of_interest.interval_list produced by stage 01.
      • 03_annotated.interval_list: 03_restricted.interval_list but annotated with GC-content, produced by stage 01.
      • 03_filtered.interval_list: 03_annotated.interval_list but filtered in stage 03 to only include useful bins. Note that although 6000 bins was requested, only 4850 were used for training the HMMs.
      • 03_execution/: Output directory of stage 03. Contains lists of samples and contigs with inappropriate ploidy numbers and a subfolder (in postprocessing/) for each sample containing 5 raw pipeline output (described in more detail in 04_gcnv_calls_validation/README.md).
      • models/: Trained contig ploidy models and gCNV models generated from Pf8. Ready for use.
    • Files for stage 04: gCNV Calls Validation
      • 04_call_regions.tsv: User-defined regions for making coverage-based CNV genotype calls (typically single genes, except from plasmepsin2/3).
      • chromosomes_lengths.tsv: Lengths of each Pf3D7 reference chromosome for plotting.

About

For making Pf8's copy number variation (CNV) calls.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published