StrainR: Strain-Resolved Metagenomic Classification Toolkit

Introduction

StrainR is a Python-based toolkit for k-mer based metagenomic classification. It aims to assign strain labels to DNA sequences by analyzing k-mer distributions against a custom-built database. StrainR is designed for efficiency, enabling database creation and sample classification to be performed with modest computational resources.

The toolkit has recently undergone significant refactoring to improve code quality, maintainability, and user experience.

Core Components

The StrainR toolkit is built around several key Python classes that manage different parts of the workflow:

StrainKmerDatabase (src/strainr/database.py): Manages the k-mer presence/absence data. This class loads and provides access to the database generated by src/strainr/build_db.py.
DatabaseBuilder (in src/strainr/build_db.py): Orchestrates the creation of the k-mer database. It handles genome acquisition (from NCBI or custom files) and the subsequent k-mer extraction and matrix generation.
KmerClassificationWorkflow (in src/strainr/classify.py): Manages the primary workflow for classifying sequence reads. It uses a StrainKmerDatabase and the ClassificationAnalyzer.
PangenomeAnalysisWorkflow (in strainr/strainr-pangenome.py): Provides another workflow for k-mer based analysis, potentially suited for pangenome-scale investigations or alternative analysis strategies.
ClassificationAnalyzer (strainr/analyze.py): Handles post-classification analysis, including categorizing k-mer hits (clear, ambiguous, none), calculating strain priors, and resolving ambiguous assignments using various statistical methods.
AbundanceCalculator (strainr/output.py): Intended for calculating strain abundances, normalization, thresholding, and generating output reports (Note: This component's refactoring faced tool-related issues, its full integration might be pending). The workflow scripts currently handle abundance calculations internally.

Code Quality and Design

Recent development has focused on:

Class-Based Workflows: Key processes like database creation and classification are now encapsulated in classes (e.g., DatabaseBuilder, KmerClassificationWorkflow) for better organization and reusability.
Improved Argument Parsing: Command-line scripts utilize argparse (e.g., strainr-db.py) or Pydantic (strainr/classify.py) for robust and user-friendly argument handling and validation.
Type Hinting: Comprehensive type hints have been added throughout the codebase, enhancing code clarity and enabling static analysis.
Docstrings: Google-style docstrings have been implemented for modules, classes, and functions to improve documentation and maintainability.

1. Building a Database

The k-mer database, a crucial component for StrainR, is built using the src/strainr/build_db.py script. This script leverages the DatabaseBuilder class.

Key Features:

Genome Acquisition: Downloads bacterial genomes from NCBI (RefSeq or GenBank) based on taxonomic ID, a list of assembly accessions, or genus name. It can also use a custom-provided folder of genome FASTA files.
Filtering: Optionally filters downloaded genomes, for instance, to include only those with unique strain-level taxonomic IDs.
K-mer Extraction: Processes FASTA files to extract k-mers of a specified length.
Database Construction: Creates a presence/absence matrix where rows are unique k-mers and columns represent different strains/genomes.
Output: Saves the final k-mer matrix as a Parquet file (e.g., mydatabase.db.parquet).

Example Usage:

# Using a taxonomic ID
python src/strainr/build_db.py --taxid 562 --out Escherichia_coli_db --procs 8

# Using a custom folder of genomes
python src/strainr/build_db.py --custom ./my_custom_genomes/ --out custom_genome_db --kmerlen 25

# Using a list of assembly accessions from a file
python src/strainr/build_db.py --assembly_accessions accessions.txt --out my_accession_db

2. Classification and Analysis

Sequence classification is primarily handled by the strainr/classify.py script, which uses the KmerClassificationWorkflow class. The strainr-pangenome.py script offers a similar workflow.

General Workflow:

Load Database: The specified k-mer database (created by src/strainr/build_db.py) is loaded using StrainKmerDatabase.
Parse Input Reads: Input sequence files (FASTA/FASTQ, single or paired-end, possibly gzipped) are parsed.
K-mer Counting: For each read, k-mers are extracted, and their presence in the database is checked to generate a hit profile (CountVector) against all strains. This step is parallelized.
Analysis (via ClassificationAnalyzer):
- Raw hit profiles are categorized into clear, ambiguous, or no-hit assignments.
- Prior probabilities for strains are calculated based on clear hits.
- Ambiguous hits are resolved using a selected disambiguation strategy (e.g., 'max', 'random').
- Final assignments (strain index or "NA") are compiled for each read.
Abundance Calculation & Reporting:
- Read assignments are translated to strain names.
- Raw hit counts per strain are calculated.
- Relative abundances are computed (both sample-level and intra-sample relative to strains passing a threshold).
- Results are saved to a TSV file (e.g., samplename_abundance.tsv) and displayed on the console.

Example Usage (strainr/classify.py):

# Classifying single-end reads
python src/strainr/classify.py \
    --db my_database.db.parquet \
    --input_forward reads.fastq.gz \
    --out classification_output_dir \
    --procs 8 \
    --mode max \
    --thresh 0.005

# Classifying paired-end reads (example, assuming two lists of files)
# python src/strainr/classify.py \
#    --db my_database.db.parquet \
#    --input_forward r1_file1.fastq.gz r1_file2.fastq.gz \
#    --input_reverse r2_file1.fastq.gz r2_file2.fastq.gz \
#    --out paired_classification_out \
#    --procs 8

(Note: For multiple paired-end files with strainr/classify.py, ensure the order of forward and reverse files matches in the respective lists.)

Dependencies

Python 3.8+
NumPy
Pandas
BioPython
Pydantic (for src/strainr/classify.py argument parsing)
ncbi-genome-download (for src/strainr/build_db.py NCBI download functionality)
tqdm (for progress bars, e.g. in src/strainr/build_db.py)
pyarrow or fastparquet (for reading/writing Parquet files with pandas).
argparse (standard library, but good to note its use for scripts like src/strainr/build_db.py).

Name		Name	Last commit message	Last commit date
Latest commit History 303 Commits
.github		.github
.vscode		.vscode
dist		dist
docs		docs
kmer_counter_rs		kmer_counter_rs
notebooks		notebooks
profiling_genomes		profiling_genomes
scripts		scripts
strainr		strainr
strainr_kmer_database_files/ncbi_genomes_s290		strainr_kmer_database_files/ncbi_genomes_s290
test_custom_genomes		test_custom_genomes
tests		tests
.DS_Store		.DS_Store
.coverage		.coverage
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_db.prof		build_db.prof
environment.yml		environment.yml
inspect_parquet.py		inspect_parquet.py
mp_test_db.db.parquet		mp_test_db.db.parquet
mp_test_db_p1.db.parquet		mp_test_db_p1.db.parquet
profiled_db_k21.db.parquet		profiled_db_k21.db.parquet
pyproject.toml		pyproject.toml
python-package-conda.yml		python-package-conda.yml
run_strainr_pipeline.py		run_strainr_pipeline.py
strainr_plus.code-workspace		strainr_plus.code-workspace
test_db_disk.db.parquet		test_db_disk.db.parquet
test_db_inmem.db.parquet		test_db_inmem.db.parquet
test_db_skip_n_false.db.parquet		test_db_skip_n_false.db.parquet
test_db_skip_n_true.db.parquet		test_db_skip_n_true.db.parquet
test_rust_build.py		test_rust_build.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StrainR: Strain-Resolved Metagenomic Classification Toolkit

Introduction

Core Components

Code Quality and Design

1. Building a Database

2. Classification and Analysis

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

mladen5000/strainr

Folders and files

Latest commit

History

Repository files navigation

StrainR: Strain-Resolved Metagenomic Classification Toolkit

Introduction

Core Components

Code Quality and Design

1. Building a Database

2. Classification and Analysis

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages