Strainify is an accurate strain-level abundance analysis tool for short-read metagenomics.
Strainify uses Git LFS to manage large files (e.g., genomes, test data). Please ensure that it is set up before cloning the repository.
git clone https://github.com/treangenlab/Strainify.git
cd Strainify
Use the following commands to initialize git lfs
and download the example files:
git lfs install
git lfs checkout
# For Linux:
conda env create -f environment.yaml
# For Mac:
CONDA_SUBDIR=osx-64 conda env create -f environment_mac.yaml
# Activate conda environment
conda activate strainify
Strainify uses a config.yaml
file to manage input files and parameters.
These fields can be set in config.yaml
:
Parameter | Description | Default | Accepted Values |
---|---|---|---|
genome_folder |
(Required) Path to the folder containing reference genome files (FASTA format). | — | A valid directory path |
fastq_folder |
(Required) Path to the folder containing input FASTQ files (must be unzipped and named using *_r1.fq , *_r2.fq ). Multiple samples can be added to this folder (for paired-end reads, make sure the two FASTQ files for each sample have matching names). |
— | A valid directory path |
output_dir |
(Required) Directory where all output files will be saved. | — | A valid directory path |
read_type |
Type of sequencing reads. | paired |
paired or single |
window_size |
Size of the genomic window for variant grouping. Can also be set to average_LCB_length to use the average length of local colinear blocks from Parsnp. |
500 |
Any positive integer or average_LCB_length |
window_overlap |
Proportion of overlap between consecutive windows. | 0 |
Float between 0 and 1 (e.g., 0.5 ) |
weight_by_entropy |
Whether to weight variants by their Shannon entropy when estimating strain abundances. | false |
true or false |
use_precomputed_variants |
Use existing filtered variant matrix instead of recomputing from scratch. | false |
true or false |
precomputed_output_dir |
Path to directory where new output will be saved. | output_dir/precomputed_results |
A valid directory path |
Open config.yaml
in a text editor. Modify the fields to match your input and desired options. Example:
genome_folder: path/to/genomes
output_dir: path/to/output
fastq_folder: path/to/fastqs
read_type: paired
modify_windows: --window_size 500 --window_overlap 0
weight_by_entropy: false
use_precomputed_variants: false
precomputed_output_dir: path/to/new_output
Use Snakemake to run the pipeline:
snakemake --cores 12 --configfile config.yaml
Tip: Replace
12
with the number of CPU cores you want to allocate. You can set--cores
to any positive integer, depending on your system’s available resources.
Strainify includes example input data to help you get started quickly.
To run the pipeline on the example data:
- Unzip the compressed FASTQ files. The following command line can be used for Linux systems.
gunzip example/fastq/paired/*.gz
gunzip example/fastq/single/*.gz
- Make sure your
config.yaml
is set like this:
genome_folder: example/genomes
fastq_folder: example/fastq/paired
output_dir: example/results
read_type: paired
modify_windows: --window_size 500 --window_overlap 0
weight_by_entropy: false
use_precomputed_variants: false
- Run Strainify with Snakemake:
snakemake --cores 12 --configfile config.yaml
The output will be written to the example/results
directory.
If you are running Strainify again on the same set of genomes, you can use the precomputed variant matrix by doing the following:
- In the
config.yaml
file, setuse_precomputed_variants
totrue
andprecomputed_output_dir
to your desired directory for storing the new output. If you do not provide a path forprecomputed_output_dir
, your output directory will be automatically set tooutput_dir/precomputed_results
. Make sure you setoutput_dir
to the directory that contains the precomputed variant matrix. For example, let's use the variant matrix from the previous example, and theconfig.yaml
file would look like this:
genome_folder: example/genomes
fastq_folder: example/fastq/single
output_dir: example/results
read_type: single
modify_windows: --window_size 500 --window_overlap 0
weight_by_entropy: false
use_precomputed_variants: true
precomputed_output_dir: example/new_output
- Run Strainify with Snakemake:
snakemake --cores 12 --configfile config.yaml
The output will be written to the example/new_output
directory.
The abundance estimates are stored in a CSV file named abundance_estimates_combined.csv
.
- Each row corresponds to a strain.
- Each column (after the first) corresponds to a sample.
- Values represent the estimated relative abundances.
Example CSV output:
strain name,10x_ratio_2
E24377A,23.7987
H10407,24.8376
Sakai,24.9406
UTI89,26.423
Note: these numbers are percentages.
Other important output files:
sites.txt
contains a list of variant positions that passed the filter. Read counts supporting the allele and reference base at these positions are then obtained and used as input to the MLE model.filtered_variant_matrix.csv
contains the filtered variant matrix. Confounding variants (potential recombination sites) have been removed. For metagenomic samples that share the same set of strains (i.e. query genomes), this file can be reused to avoid rerunning the genome alignment and variant filtering steps. For more details, see instructions above for running Strainify with a precomputed variant matrix.significantly_enriched_windows.tsv
contains the start and end coordinates of windows that are flagged as potential recombination sites. The z-score and p-values of each window are also shown. Variants in these windows are removed from downstream analysis (i.e. excluded from the filtered variant matrix).
Strainify: Strain-Level Microbiome Profiling for Low-Coverage Short-Read Metagenomic Datasets https://www.biorxiv.org/content/10.1101/2025.10.10.681738v1
For questions or suggestions, open an issue or contact Rossie Luo at rl152@rice.edu.