StrainR is a Python-based toolkit for k-mer based metagenomic classification. It aims to assign strain labels to DNA sequences by analyzing k-mer distributions against a custom-built database. StrainR is designed for efficiency, enabling database creation and sample classification to be performed with modest computational resources.
The toolkit has recently undergone significant refactoring to improve code quality, maintainability, and user experience.
The StrainR toolkit is built around several key Python classes that manage different parts of the workflow:
StrainKmerDatabase
(src/strainr/database.py
): Manages the k-mer presence/absence data. This class loads and provides access to the database generated bysrc/strainr/build_db.py
.DatabaseBuilder
(insrc/strainr/build_db.py
): Orchestrates the creation of the k-mer database. It handles genome acquisition (from NCBI or custom files) and the subsequent k-mer extraction and matrix generation.KmerClassificationWorkflow
(insrc/strainr/classify.py
): Manages the primary workflow for classifying sequence reads. It uses aStrainKmerDatabase
and theClassificationAnalyzer
.PangenomeAnalysisWorkflow
(instrainr/strainr-pangenome.py
): Provides another workflow for k-mer based analysis, potentially suited for pangenome-scale investigations or alternative analysis strategies.ClassificationAnalyzer
(strainr/analyze.py
): Handles post-classification analysis, including categorizing k-mer hits (clear, ambiguous, none), calculating strain priors, and resolving ambiguous assignments using various statistical methods.AbundanceCalculator
(strainr/output.py
): Intended for calculating strain abundances, normalization, thresholding, and generating output reports (Note: This component's refactoring faced tool-related issues, its full integration might be pending). The workflow scripts currently handle abundance calculations internally.
Recent development has focused on:
- Class-Based Workflows: Key processes like database creation and classification are now encapsulated in classes (e.g.,
DatabaseBuilder
,KmerClassificationWorkflow
) for better organization and reusability. - Improved Argument Parsing: Command-line scripts utilize
argparse
(e.g.,strainr-db.py
) or Pydantic (strainr/classify.py
) for robust and user-friendly argument handling and validation. - Type Hinting: Comprehensive type hints have been added throughout the codebase, enhancing code clarity and enabling static analysis.
- Docstrings: Google-style docstrings have been implemented for modules, classes, and functions to improve documentation and maintainability.
The k-mer database, a crucial component for StrainR, is built using the src/strainr/build_db.py
script. This script leverages the DatabaseBuilder
class.
Key Features:
- Genome Acquisition: Downloads bacterial genomes from NCBI (RefSeq or GenBank) based on taxonomic ID, a list of assembly accessions, or genus name. It can also use a custom-provided folder of genome FASTA files.
- Filtering: Optionally filters downloaded genomes, for instance, to include only those with unique strain-level taxonomic IDs.
- K-mer Extraction: Processes FASTA files to extract k-mers of a specified length.
- Database Construction: Creates a presence/absence matrix where rows are unique k-mers and columns represent different strains/genomes.
- Output: Saves the final k-mer matrix as a Parquet file (e.g.,
mydatabase.db.parquet
).
Example Usage:
# Using a taxonomic ID
python src/strainr/build_db.py --taxid 562 --out Escherichia_coli_db --procs 8
# Using a custom folder of genomes
python src/strainr/build_db.py --custom ./my_custom_genomes/ --out custom_genome_db --kmerlen 25
# Using a list of assembly accessions from a file
python src/strainr/build_db.py --assembly_accessions accessions.txt --out my_accession_db
Sequence classification is primarily handled by the strainr/classify.py
script, which uses the KmerClassificationWorkflow
class. The strainr-pangenome.py
script offers a similar workflow.
General Workflow:
- Load Database: The specified k-mer database (created by
src/strainr/build_db.py
) is loaded usingStrainKmerDatabase
. - Parse Input Reads: Input sequence files (FASTA/FASTQ, single or paired-end, possibly gzipped) are parsed.
- K-mer Counting: For each read, k-mers are extracted, and their presence in the database is checked to generate a hit profile (CountVector) against all strains. This step is parallelized.
- Analysis (via
ClassificationAnalyzer
):- Raw hit profiles are categorized into clear, ambiguous, or no-hit assignments.
- Prior probabilities for strains are calculated based on clear hits.
- Ambiguous hits are resolved using a selected disambiguation strategy (e.g., 'max', 'random').
- Final assignments (strain index or "NA") are compiled for each read.
- Abundance Calculation & Reporting:
- Read assignments are translated to strain names.
- Raw hit counts per strain are calculated.
- Relative abundances are computed (both sample-level and intra-sample relative to strains passing a threshold).
- Results are saved to a TSV file (e.g.,
samplename_abundance.tsv
) and displayed on the console.
Example Usage (strainr/classify.py
):
# Classifying single-end reads
python src/strainr/classify.py \
--db my_database.db.parquet \
--input_forward reads.fastq.gz \
--out classification_output_dir \
--procs 8 \
--mode max \
--thresh 0.005
# Classifying paired-end reads (example, assuming two lists of files)
# python src/strainr/classify.py \
# --db my_database.db.parquet \
# --input_forward r1_file1.fastq.gz r1_file2.fastq.gz \
# --input_reverse r2_file1.fastq.gz r2_file2.fastq.gz \
# --out paired_classification_out \
# --procs 8
(Note: For multiple paired-end files with strainr/classify.py
, ensure the order of forward and reverse files matches in the respective lists.)
- Python 3.8+
- NumPy
- Pandas
- BioPython
- Pydantic (for
src/strainr/classify.py
argument parsing) ncbi-genome-download
(forsrc/strainr/build_db.py
NCBI download functionality)tqdm
(for progress bars, e.g. insrc/strainr/build_db.py
)pyarrow
orfastparquet
(for reading/writing Parquet files with pandas).argparse
(standard library, but good to note its use for scripts likesrc/strainr/build_db.py
).