Skip to content

pombo-lab/GAM_phaser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GAM phased pipeline from the publication Irastorza-Azcarate I, Kukalev A, Kempfer R et al "Extensive folding variability between homologous chromosomes in mammalian cells", bioRxiv, 2024

01_SNP_calling

Here you can find the collection of scripts used to call SNPs from raw genomic sequencing data. We used publicly available data. The genome sequence of Mus musculus castaneus was downloaded from the European Nucleotide Archive (accession number ERP000042). Mus musculus musculus S129/SvJae genome sequence data was downloaded from the Sequence Read Archive (accession number SRX037820). It includes read quality trimming, mapping with WGA, PCR duplicate removal, and SNP calling using bcftools mpileup. For the downstream analysis, we kept SNPs that had >= 5 reads per SNP and quality >=30.

02_genome_preparation

For the next step, we used the SNPsplit package (Krueger & Andrews, 2016) to mask high-quality paternal and maternal SNPs with N-characters in the mm10 mouse genome assembly.

This folder also contains a gzip-archived, formatted VCF file with all good-quality F123 SNPs, which is fully compatible with the SNPsplit package. The SNPsplit_genome_preparation.sh script will create multiple folders, including N-masked sequences for each haplotype separately, and for hybrid cells, as described in the SNPsplit_User_Guide.pdf.

The dual hybrid N-masked genome in FASTA format should be used to generate mapping indexes for your mapper of choice (e.g. Bowtie2, BWA, etc.) for the subsequent mapping of your data from hybrid cells.

03_reads_phasing

SNPsplit and tag2sort use the list of SNPs (with genomic coordinates) to effectively phase reads from two haplotypes. This list is provided as a ZIP archive and is also generated in the previous step by SNPsplit_genome_preparation.sh.

These tools take the mapped reads (in BAM format) aligned to the previously created N-masked genome and sort them into four separate BAM files: genome1, genome2, unassigned, or conflicted reads.
The script 03.reads.coverage.sh helps generate genome coverage tables for the next step of the pipeline using these BAM files.

04_GAM_phasing

The Python script GAM_phasing.Segregation_tables.py takes three genome coverage files as input: nonphased_coverage, genome1_coverage, genome2_coverage. They can be generated by 03.reads.coverage.sh in the previous step.

The script calculates the optimal threshold to distinguish sequencing noise from signal, based on the nonphased genome coverage data, separately for each sample. Windows will be phased to a specific haplotype when the number of nucleotides covered by reads phased to that haplotype is greater than or equal to the threshold computed from the nonphased data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published