Skip to content

mladen5000/strainr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StrainR: Strain-Resolved Metagenomic Classification Toolkit

Introduction

StrainR is a Python-based toolkit for k-mer based metagenomic classification. It aims to assign strain labels to DNA sequences by analyzing k-mer distributions against a custom-built database. StrainR is designed for efficiency, enabling database creation and sample classification to be performed with modest computational resources.

The toolkit has recently undergone significant refactoring to improve code quality, maintainability, and user experience.

Core Components

The StrainR toolkit is built around several key Python classes that manage different parts of the workflow:

  • StrainKmerDatabase (src/strainr/database.py): Manages the k-mer presence/absence data. This class loads and provides access to the database generated by src/strainr/build_db.py.
  • DatabaseBuilder (in src/strainr/build_db.py): Orchestrates the creation of the k-mer database. It handles genome acquisition (from NCBI or custom files) and the subsequent k-mer extraction and matrix generation.
  • KmerClassificationWorkflow (in src/strainr/classify.py): Manages the primary workflow for classifying sequence reads. It uses a StrainKmerDatabase and the ClassificationAnalyzer.
  • PangenomeAnalysisWorkflow (in strainr/strainr-pangenome.py): Provides another workflow for k-mer based analysis, potentially suited for pangenome-scale investigations or alternative analysis strategies.
  • ClassificationAnalyzer (strainr/analyze.py): Handles post-classification analysis, including categorizing k-mer hits (clear, ambiguous, none), calculating strain priors, and resolving ambiguous assignments using various statistical methods.
  • AbundanceCalculator (strainr/output.py): Intended for calculating strain abundances, normalization, thresholding, and generating output reports (Note: This component's refactoring faced tool-related issues, its full integration might be pending). The workflow scripts currently handle abundance calculations internally.

Code Quality and Design

Recent development has focused on:

  • Class-Based Workflows: Key processes like database creation and classification are now encapsulated in classes (e.g., DatabaseBuilder, KmerClassificationWorkflow) for better organization and reusability.
  • Improved Argument Parsing: Command-line scripts utilize argparse (e.g., strainr-db.py) or Pydantic (strainr/classify.py) for robust and user-friendly argument handling and validation.
  • Type Hinting: Comprehensive type hints have been added throughout the codebase, enhancing code clarity and enabling static analysis.
  • Docstrings: Google-style docstrings have been implemented for modules, classes, and functions to improve documentation and maintainability.

1. Building a Database

The k-mer database, a crucial component for StrainR, is built using the src/strainr/build_db.py script. This script leverages the DatabaseBuilder class.

Key Features:

  • Genome Acquisition: Downloads bacterial genomes from NCBI (RefSeq or GenBank) based on taxonomic ID, a list of assembly accessions, or genus name. It can also use a custom-provided folder of genome FASTA files.
  • Filtering: Optionally filters downloaded genomes, for instance, to include only those with unique strain-level taxonomic IDs.
  • K-mer Extraction: Processes FASTA files to extract k-mers of a specified length.
  • Database Construction: Creates a presence/absence matrix where rows are unique k-mers and columns represent different strains/genomes.
  • Output: Saves the final k-mer matrix as a Parquet file (e.g., mydatabase.db.parquet).

Example Usage:

# Using a taxonomic ID
python src/strainr/build_db.py --taxid 562 --out Escherichia_coli_db --procs 8

# Using a custom folder of genomes
python src/strainr/build_db.py --custom ./my_custom_genomes/ --out custom_genome_db --kmerlen 25

# Using a list of assembly accessions from a file
python src/strainr/build_db.py --assembly_accessions accessions.txt --out my_accession_db

2. Classification and Analysis

Sequence classification is primarily handled by the strainr/classify.py script, which uses the KmerClassificationWorkflow class. The strainr-pangenome.py script offers a similar workflow.

General Workflow:

  1. Load Database: The specified k-mer database (created by src/strainr/build_db.py) is loaded using StrainKmerDatabase.
  2. Parse Input Reads: Input sequence files (FASTA/FASTQ, single or paired-end, possibly gzipped) are parsed.
  3. K-mer Counting: For each read, k-mers are extracted, and their presence in the database is checked to generate a hit profile (CountVector) against all strains. This step is parallelized.
  4. Analysis (via ClassificationAnalyzer):
    • Raw hit profiles are categorized into clear, ambiguous, or no-hit assignments.
    • Prior probabilities for strains are calculated based on clear hits.
    • Ambiguous hits are resolved using a selected disambiguation strategy (e.g., 'max', 'random').
    • Final assignments (strain index or "NA") are compiled for each read.
  5. Abundance Calculation & Reporting:
    • Read assignments are translated to strain names.
    • Raw hit counts per strain are calculated.
    • Relative abundances are computed (both sample-level and intra-sample relative to strains passing a threshold).
    • Results are saved to a TSV file (e.g., samplename_abundance.tsv) and displayed on the console.

Example Usage (strainr/classify.py):

# Classifying single-end reads
python src/strainr/classify.py \
    --db my_database.db.parquet \
    --input_forward reads.fastq.gz \
    --out classification_output_dir \
    --procs 8 \
    --mode max \
    --thresh 0.005

# Classifying paired-end reads (example, assuming two lists of files)
# python src/strainr/classify.py \
#    --db my_database.db.parquet \
#    --input_forward r1_file1.fastq.gz r1_file2.fastq.gz \
#    --input_reverse r2_file1.fastq.gz r2_file2.fastq.gz \
#    --out paired_classification_out \
#    --procs 8

(Note: For multiple paired-end files with strainr/classify.py, ensure the order of forward and reverse files matches in the respective lists.)

Dependencies

  • Python 3.8+
  • NumPy
  • Pandas
  • BioPython
  • Pydantic (for src/strainr/classify.py argument parsing)
  • ncbi-genome-download (for src/strainr/build_db.py NCBI download functionality)
  • tqdm (for progress bars, e.g. in src/strainr/build_db.py)
  • pyarrow or fastparquet (for reading/writing Parquet files with pandas).
  • argparse (standard library, but good to note its use for scripts like src/strainr/build_db.py).

About

Strain-resolved classification and assembly primer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •