GitHub - pombo-lab/GAM_phaser

GAM phased pipeline from the publication Irastorza-Azcarate I, Kukalev A, Kempfer R et al "Extensive folding variability between homologous chromosomes in mammalian cells", bioRxiv, 2024

01_SNP_calling

Here you can find the collection of scripts used to call SNPs from raw genomic sequencing data. We used publicly available data. The genome sequence of Mus musculus castaneus was downloaded from the European Nucleotide Archive (accession number ERP000042). Mus musculus musculus S129/SvJae genome sequence data was downloaded from the Sequence Read Archive (accession number SRX037820). It includes read quality trimming, mapping with WGA, PCR duplicate removal, and SNP calling using bcftools mpileup. For the downstream analysis, we kept SNPs that had >= 5 reads per SNP and quality >=30.

02_genome_preparation

For the next step, we used the SNPsplit package (Krueger & Andrews, 2016) to mask high-quality paternal and maternal SNPs with N-characters in the mm10 mouse genome assembly.

This folder also contains a gzip-archived, formatted VCF file with all good-quality F123 SNPs, which is fully compatible with the SNPsplit package. The SNPsplit_genome_preparation.sh script will create multiple folders, including N-masked sequences for each haplotype separately, and for hybrid cells, as described in the SNPsplit_User_Guide.pdf.

The dual hybrid N-masked genome in FASTA format should be used to generate mapping indexes for your mapper of choice (e.g. Bowtie2, BWA, etc.) for the subsequent mapping of your data from hybrid cells.

03_reads_phasing

SNPsplit and tag2sort use the list of SNPs (with genomic coordinates) to effectively phase reads from two haplotypes. This list is provided as a ZIP archive and is also generated in the previous step by SNPsplit_genome_preparation.sh.

These tools take the mapped reads (in BAM format) aligned to the previously created N-masked genome and sort them into four separate BAM files: genome1, genome2, unassigned, or conflicted reads.
The script 03.reads.coverage.sh helps generate genome coverage tables for the next step of the pipeline using these BAM files.

04_GAM_phasing

The Python script GAM_phasing.Segregation_tables.py takes three genome coverage files as input: nonphased_coverage, genome1_coverage, genome2_coverage. They can be generated by 03.reads.coverage.sh in the previous step.

The script calculates the optimal threshold to distinguish sequencing noise from signal, based on the nonphased genome coverage data, separately for each sample. Windows will be phased to a specific haplotype when the number of nucleotides covered by reads phased to that haplotype is greater than or equal to the threshold computed from the nonphased data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
01_SNP_calling		01_SNP_calling
02_genome_preparation		02_genome_preparation
03_reads_phasing		03_reads_phasing
04_GAM_phasing		04_GAM_phasing
README.md		README.md

pombo-lab/GAM_phaser

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages