GAM phased pipeline from the publication Irastorza-Azcarate I, Kukalev A, Kempfer R et al "Extensive folding variability between homologous chromosomes in mammalian cells", bioRxiv, 2024
01_SNP_calling
Here you can find the collection of scripts used to call SNPs from raw genomic sequencing data. We used publicly available data. The genome sequence of Mus musculus castaneus was downloaded from the European Nucleotide Archive (accession number ERP000042). Mus musculus musculus S129/SvJae genome sequence data was downloaded from the Sequence Read Archive (accession number SRX037820). It includes read quality trimming, mapping with WGA, PCR duplicate removal, and SNP calling using bcftools mpileup
. For the downstream analysis, we kept SNPs that had >= 5 reads per SNP and quality >=30.
02_genome_preparation
For the next step, we used the SNPsplit package
(Krueger & Andrews, 2016) to mask high-quality paternal and maternal SNPs with N-characters in the mm10 mouse genome assembly.
This folder also contains a gzip-archived, formatted VCF file with all good-quality F123 SNPs, which is fully compatible with the SNPsplit package. The SNPsplit_genome_preparation.sh
script will create multiple folders, including N-masked sequences for each haplotype separately, and for hybrid cells, as described in the SNPsplit_User_Guide.pdf
.
The dual hybrid N-masked genome in FASTA format should be used to generate mapping indexes for your mapper of choice (e.g. Bowtie2, BWA, etc.) for the subsequent mapping of your data from hybrid cells.
03_reads_phasing
SNPsplit
and tag2sort
use the list of SNPs (with genomic coordinates) to effectively phase reads from two haplotypes. This list is provided as a ZIP archive and is also generated in the previous step by SNPsplit_genome_preparation.sh
.
These tools take the mapped reads (in BAM format) aligned to the previously created N-masked genome and sort them into four separate BAM files: genome1, genome2, unassigned, or conflicted reads.
The script 03.reads.coverage.sh
helps generate genome coverage tables for the next step of the pipeline using these BAM files.
04_GAM_phasing
The Python script GAM_phasing.Segregation_tables.py
takes three genome coverage files as input: nonphased_coverage
, genome1_coverage
, genome2_coverage
. They can be generated by 03.reads.coverage.sh
in the previous step.
The script calculates the optimal threshold to distinguish sequencing noise from signal, based on the nonphased genome coverage data, separately for each sample. Windows will be phased to a specific haplotype when the number of nucleotides covered by reads phased to that haplotype is greater than or equal to the threshold computed from the nonphased data.