Accompanying code for Cyril J. Versoza*, Audald Lloret-Villas*, Jeffrey D. Jensen, Susanne P. Pfeifer. 2025. A pedigree-based map of crossovers and non-crossovers in aye-ayes (Daubentonia madagascariensis). Genome Biology and Evolution:evaf072
.
Snakemake
workflows are provided for the three processes described in the manuscript and in the sections below: variant filtering, pedigree-based approach and family-based approach.
The step-by-step document contains the relevant information to call and implement the Snakemake
pipelines with standard input files.
The variant filtering workflow defines how the autosomal variants were processed and filtered to obtain a high-confidence SNP
set. The rules included in this workflow are:
raw_indel_VCFs
, retrieves the normalizedINDELs
(this is, left-aligns the insertions and deletions for a consistent representation, and splits multiallelic sites into biallelic records) from the raw genotypedVCF
withbcftools norm
andbcftools view
.bam_coverage
, calculates the sequencing coverage of theBAM
files withsamtools coverage
.vcf_prep
, concatenates the genotype per-chromosome autosomalVCF
files withbcftools concat
and calls segregating sites withbcftools view
.- The custom script
vcf_autosomes.sh
is used to create the list of autosomes required bybcftools concat
.
- The custom script
mask_filter
, provided a.bed
file with the coordinates of repetitive regions,bcftools view
is used to mask theVCF
file.dp_table
, extracts the read depth (DP
) in each position and formats the output as a table withbcftools query
dp_filter
, filters out the variants that have aDP
less than half or more than twice the averageDP
for that particular sample.- The custom script
DP.sh
usesbash
code andbcftools view
to perform such filtering.
- The custom script
gq_filter
, filters out the variants with a genotype quality (GQ
) lower than30
, withbcftools view
.het_filter
, filters out the variants wih an excess of heterozygosity (defined as ap-value
of0.01
) with thebcftools
plug-in+fill-tags
andbcftools view
.men_filter
, provded a.ped
file with pedigree information, it filters out the variants that violate the patterns expected by Mendelian inheritance, with thebcftools
plugin+mendelian2
.auto_split
, splits theVCF
files into autosome-specificVCF
files, withbcftools view
.snp_cluster_filter
, removes cluster of variants (defined as≥ 3 SNPs
within a10 bp
window) from the autosomalVCF
files.- The custom script
SNP_clusters.py
, developed withpysam
is used for such purpose.
- The custom script
indel_filter
, removes variants located within10 bp
of an insertion/deletion (INDEL
).- The custom script
Indels.py
, developed withpysam
is used for such purpose.
- The custom script
chop_ends
, removes variants located within2 Mb
from the autosome ends.- The custom script
Chop_ends.py
, developed withpysam
is used for such purpose.
- The custom script
Phase-informative markers are identified from the high-confidence SNPs
detected above by applying a pedigree-based approach (for a schematic of the workflow, see the Figure below, included in Versoza, Weiss et al. (2024)) with the following rules:
ped_split
, generates the six three-generation pedigree-specific sets of segregatingSNPs
withbcftools view
.supreads_filter
, keeps only the positions supported by more than25%
but less than75%
of the mapped reads withbcftools view
.ped_F1_het
, keeps only the positions where theF1
individual is heterozygous, withbcftools view
ped_F0_diff
, keeps only the positions where theF0
individuals (parents) exhibited non-identical genotypes, with a combination ofbcftools view
andbcftools sort
.partner_F2_hom_ped
, keeps only the positions where either theF1
's partner or their jointF2
offspring was homozygous. The variants resulting from this step are considered phase-informative markers (see Figure underneath this list).phase_script
, phases the phase-informative markers. This is, it indicates whether the variants have a grandpaternal (gpat
) or grandmaternal (gpat
) origin. A combination ofbcftools query
andbash
commands indentify the origin and formats the calls.ph_events
, simplifies and formats the output indicating the phase (gpat
orgpat
), the coordinates, theREF
/ALT
alleles and the genotype of each member of the pedigree for the phase-informative markers.- The custom script
pedigrees.py
, developed withpandas
is used for such purpose.
- The custom script
clean_blocks
, detects breakpoints (changes of phase) in the phase-informative markers, removes short regions (5 Kb
) with multiple changes of phase, identifies breakpoints encompassing more than one phase-informative marker, re-phases the markers and midpoint values and resolution tract of the events.- The custom script
Clean_blocks.py
, developed withpandas
is used for such purpose. - The resulting filtered files are then subject to a visual exploration to identify crossover (
CO
) and noncrossover (NCOs
) events.
- The custom script
Phase-informative markers are identified from the high-confidence SNPs
detected above by applying a family-based approach (for a schematic of the workflow, see the Figure below, included in Versoza, Lloret-Villas et al. (2025), following the methodology outlined in Coop et al. (2008)) with the following rules:
fam_split
, generates the three two-generation nuclear family-specific sets of segregatingSNPs
withbcftools view
.supreads_filter
, keeps only the positions supported by more than25%
but less than75%
of the mapped reads withbcftools view
.inf_markers
, identifies and separates maternally phase-informative markers (at which the dam was heterozygous and the sire homozygous) and paternally phase-informative markers (at which the sire was heterozygous and the dam homozygous) with a combination ofbcftools view
andbcftools query
.ph_events
, simplifies and conditionally formats the output depending on the number of offspring. It indicates the origin of transmitted alleles (maternal
orpaternal
), the coordinates, theREF
/ALT
alleles and the genotype of each member of the family for the phase-informative markers, as well as the phase of the offspring in comparison to the template offspring.- The custom script
families.py
, developed withpandas
is used for such purpose.
- The custom script
clean_blocks
, detects breakpoints (changes of phase) in the phase-informative markers, removes short regions (5 Kb
) with multiple changes of phase, identifies breakpoints encompassing more than one phase-informative marker, re-phases the markers and midpoint values and resolution tract of the events.- The custom script
Clean_blocks.py
, developed withpandas
is used for such purpose. - The resulting filtered files are then subject to a visual exploration to identify crossover (
CO
) and noncrossover (NCOs
) events.
- The custom script
All the software and tools used during for the development of the Snakemake
workflow and accompanying scripts can be downloaded and installed via conda
/mamba
. These are the links to package recipes and versions used: