This repository contains custom code (jupyter notebooks, small snakemake workflows and python scripts) implementing various analyses for the HGSVC phase 3 manuscript:
Logsdon, Ebert, Audano, Loftus et al.,
Complex genetic variation in nearly complete human genomes
https://doi.org/10.1101/2024.09.24.614721 bioRxiv
The code was written specifically for the purposes of this project and, thus, does not address generalized use cases.
Processes and analyses realized with the code in this repository are, e.g., internal data management, post-processing of workflow results (de novo assembly and evaluation, see workflow info in pyproject.toml) mostly for the purpose of plotting and creating summary tables.
The snakemake workflows implement small-scale processes such as extracting the MHC/HLA region or the Chromosome Y from each assembly.
Following the update of the HG00514 Verkko assembly, only a subset of customized workflows was executed.
Updated:
# update of for SIG-MHC, extracted MHC region from new assembly
workflow::modules::regions::hla::extract.smk
# update of rDNA / ribotin runs, forwarded to Mir Henglin
workflow::modules::rdna::ribotin.smk
# update of alignment support for CHS trio (child HG00514)
workflow::modules::asm_compare::trio_align.smk
# update of HPRC gap evaluation
workflow::modules::regions::gaps::annotate_gaps.smk
The integrative QC analysis ("assembly label QC") was not updated because the complete run requires the annotation of centromeres, which is not available for HG00514 v2.
Last commit before updating MBG and GraphAligner to get bug fixes for last set of samples:
commit: #dbd3f7c88d9e28b052164c650e6ed56b7ba837de
- graphaligner=1.0.17
- mbg=1.0.15
Updated to
commit: #2fe632d38ea615fb93d1af63c462078490e285f9
- graphaligner=1.0.18
- mbg=1.0.16
for samples:
- YRI trio: NA19238, NA19239, NA19240
- CHS trio: HG00512, HG00513, HG00514
- HG00096
- HG00732 / PUR mother
NA19320 - cell line does not grow, insufficient ONT
NA18939 HiFi - resequencing
HG04036 HiFi - v1.4.1+dirty assembly completed NA21487 HiFi - v1.4.1+dirty assembly completed
Production version currently is v1.4+dirty [added commits #3119b39 and #4f6a54e]
- Trio kmer DBs
- HXT - Illumina HiSeq X Ten
- NVS - Illumina NovaSeq (6000)