A tool for analyzing Short Tandem Repeats (STRs) from Long-Read Sequencing data
Abacus is a tool for analyzing STR (Short Tandem Repeat) data from Long-Read Sequencing technologies. It is designed to work with data from the Oxford Nanopore Technologies (ONT) platform, but has also been tested with data from the Pacific Biosciences (PacBio) platform. The main goal of Abacus is to provide a user-friendly interface for analyzing STR data and to provide a comprehensive report of the analysis results.
Abacus works by first converting the entries of an STR catalogue (JSON) into graphs, which are then used to analyze the reads from an aligned BAM file. Each read in the BAM file is first mapped to the graph using minigraph, and the number of repeats of each satellite is calculated based on the resulting path. The reads are then grouped according to the satellite repeat counts, and the STR alleles (haplotypes) are called based on these groups. The analysis results are then saved in an HTML report file, which contains information about the STR loci, the called STR alleles, and visualizations of the data.
To set up the environment for this project, follow these steps:
-
Clone the repository:
git clone https://github.com/MOMA-AUH/abacus cd abacus
-
Create and activate the conda environment using the provided
env.yml
file:conda env create -f env.yml -n abacus conda activate abacus
-
Install Abacus using
pip
:pip install .
To run Abacus, you need to provide the following arguments:
--bam
: The path to the BAM file that contains aligned reads from the Long-Read Sequencing data.--ref
: The path to the reference FASTA file that was used to align the reads in the BAM file.--str-catalouge
: The path to the STR catalogue (JSON) that contains the information about the STR loci that you want to analyze. See the provided examples in the repository.--report
: The path to the HTML file where the analysis results will be saved.--vcf
: The path to the VCF file where the STR genotyping results will be saved.--sample-id
: The identifier of the sample that you are analyzing.--sex
: The sex of the sample (default: XX). UseXX
for female andXY
for male.--loci-subset
: A subset of loci to process. Use multiple times to specify multiple loci.
abacus \
--bam input.bam \
--ref reference.fa \
--str-catalouge str_catalogue.json \
--report output.html \
--vcf output.vcf \
--sample-id my_sample
abacus \
--bam input.bam \
--ref reference.fa \
--str-catalouge str_catalogue.json \
--report output.html \
--vcf output.vcf \
--sample-id my_sample \
--sex XY \
--loci-subset FGF14 \
--loci-subset RFC1
The following configuration parameters allow fine-tuning of the analysis:
--anchor-length
: Length of the left and right anchor sequences. Default:500
.--min-anchor-overlap
: Minimum overlap between the read and the anchor sequence. Default:200
.--min-end-qual
: Minimum base quality at read ends. Used for trimming. Default:17
.--trim-window-size
: Window size for trimming low-quality bases. Default:10
.--max-trim
: Maximum number of bases to trim from the ends of reads. Default:50
.
--min-str-qual
: Minimum mean base quality in the STR region. Reads with lower quality will be filtered out. Default:20
.--tol-str-qual
: Tolerance for mean base quality in the STR region. Default:30
.--min-q10-str-quality
: Minimum Q10 base quality in STR region. Default:15
.--tol-q10-str-quality
: Tolerance for Q10 base quality in STR region. Default:30
.--max-error-rate
: Maximum allowed error rate in the STR region. Reads with higher error rates will be filtered out. Default:0.01
.--tol-error-rate
: Tolerance for error rate in the STR region. Default:0.005
.--max-ref-divergence
: Maximum allowed reference divergence in the STR region. Default:0.34
.--min-n-outlier-detection
: Minimum number of reads required for outlier detection. Default:10
.
--min-haplotyping-depth
: Minimum allowed depth for each called haplotype. If the depth is lower, the locus will be called as homozygous. Default:10
.--heterozygozity-alpha
: Sensitivity cutoff for the heterozygosity test. This test focuses on differences in length between haplotypes. Default:0.05
.
--log-file
: Path to the log file. Default:abacus.log
.--keep-temp-files
: Keep temporary files generated during analysis. Default:False
.--add-consensus-to-vcf
: Add consensus calls to the VCF output. Default:False
.--add-contracted-consensus-to-vcf
: Add contracted consensus calls to the VCF output. Default:False
.
The STR catalogue is a JSON file that contains information about the STR loci that you want to analyze. Each entry in the catalogue should contain the following information:
LocusId
: The identifier of the STR locus. This can be any string that uniquely identifies the locus. It is used to refer to the locus in the analysis results.LocusStructure
: The structure of the STR locus, where each repeat unit is enclosed in parentheses and followed by an asterisk. For example, the structure of the ATXN1 locus:(CTG)*
. The structure can contain any number of repeat units of any length and can contain IUPAC base symbols, such asN
orY
. The structure can also contain non-repeating sequences, such as flanking regions or interruptions. For example, the structure of the HTT locus:(CAG)*CAACAG(CCG)*
, whereCAACAG
is a non-repeating sequence.ReferenceRegion
: The genomic region of the STR locus in the reference genome. This can be a single region of the entire structure of the STR locus, or a list of regions that cover the entire structure of the STR locus. The regions should be in the formatchr:start-end
, wherechr
is the chromosome name andstart
andend
are the start and end positions of the region, respectively.
In str_catalouges/moma_repeat_variants_catalogue_240521.json
you can find a comprehensive list of STRs that are known to be variable in the human genome, which can be used as a starting point for your analysis. You can also create your own STR catalogue by following the format described above. Underneath is an example of the structure of the STR catalogue:
[
{
"LocusId": "ATXN1",
"LocusStructure": "(CTG)*",
"ReferenceRegion": "chr6:16327635-16327722"
},
{
"LocusId": "HTT",
"LocusStructure": "(CAG)*CAACAG(CCG)*",
"ReferenceRegion": [
"chr4:3074876-3074933",
"chr4:3074939-3074966"
]
}
]
FGF14 is a complex locus with multiple haplotypes and a large number of variants. In the provided catalouge provided examples we have included a FGF14_complex
entry that contains the following information:
[
{
"LocusId": "FGF14_complex",
"LocusStructure": "(TAGTCATAGTACCCCAA)*(GAA)*",
"ReferenceRegion": "chr13:102161565-102161726"
},
]
The insertion (TAGTCATAGTACCCCAA)*
is described in the following paper:
https://www.nature.com/articles/s41588-024-01808-5/figures/1
A lot of additional variation around the FGF14 locus is discussed in the following paper: https://www.nature.com/articles/s41467-024-52148-1