██ ██ █████ ██████ ████████ ██████ █████ ██████ ██ ██ ███████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ██ ███████ ██████ ██ ██████ ███████ ██ █████ █████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
████ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██████ ██ ██ ███████ ██ ██
A bioinformatics pipeline to summarise variants called against a reference in a longitudinal study design. Written to investigate longitudinal sequencing data from long-term passaging of SARS-CoV-2. In theory, it could be expanded for other organisms too.
Author: Dr Charles Foster
- Track mutation persistence across longitudinal samples
- Comprehensive variant analysis including amino acid consequences
- Built-in SARS-CoV-2 reference data and annotations
- Integration with functional mutation databases (pokay)
- Automated plotting and statistical analysis
- Support for both SNPs and indels
- Quality control metrics for variants
pip install vartracker
vartracker requires the following external tools to be installed and available in your PATH:
- bcftools (>=1.10): For VCF file processing and variant calling consequences
- tabix: For indexing compressed files
On macOS:
# Using Homebrew
brew install bcftools htslib
# Using MacPorts
sudo port install bcftools htslib
On Linux (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install bcftools tabix
On Linux (CentOS/RHEL/Fedora):
# CentOS/RHEL with EPEL
sudo yum install epel-release
sudo yum install bcftools htslib
# Fedora
sudo dnf install bcftools htslib
Using conda:
conda install -c bioconda bcftools htslib
For development or to get the latest version:
git clone https://github.com/charlesfoster/vartracker.git
cd vartracker
pip install -e .[dev]
pre-commit install
After installation, vartracker
will be available as a command-line tool:
vartracker --help
vartracker input_data.csv -o output_directory
The input CSV file should have four columns:
- vcf: Full paths to VCF files (bgzipped or uncompressed) containing variants for each sample called against the NC_045512.2 reference genome. The VCF must have depth ("DP") and variant allele frequency tags in the INFO field. The variant allele frequency tag name can be specified via the
vartracker
CLI (see below). - coverage: Full paths to coverage files. These files are expected to be in TSV format
with three columns (no header):
reference\t1-based position\tdepth
. For example:
NC_045512.2 1 0
NC_045512.2 2 0
NC_045512.2 3 0
NC_045512.2 4 0
NC_045512.2 5 0
NC_045512.2 6 0
NC_045512.2 7 0
NC_045512.2 8 0
NC_045512.2 9 0
NC_045512.2 10 0
...
NC_045512.2 250 932
NC_045512.2 251 929
NC_045512.2 252 931
NC_045512.2 253 900
NC_045512.2 254 900
NC_045512.2 255 897
NC_045512.2 256 891
NC_045512.2 257 895
NC_045512.2 258 903
NC_045512.2 259 898
NC_045512.2 260 895
Note: this is the format produced by bedtools genomecov
via the command bedtools genomecov -ibam input.bam -d > coverage.txt
or samtools depth
via the command samtools depth -a input.bam
.
- sample_name: Name of the sample in the VCF file
- sample_number: Sample number (e.g., 0, 1, 2, ..., 15 for passages)
Example input CSV:
vcf,coverage,sample_name,sample_number
/path/to/sample0.vcf.gz,/path/to/sample0.coverage,passage_0,0
/path/to/sample1.vcf.gz,/path/to/sample1.coverage,passage_1,1
/path/to/sample2.vcf.gz,/path/to/sample2.coverage,passage_2,2
To search mutations against functional databases:
- Set up pokay database (optional):
parse_pokay pokay_database.csv
This command automatically downloads the required literature files from the
pokay repository into pokay_literature/NC_045512
(override with
--download-dir
) and writes the processed CSV for downstream analysis. You
can also let vartracker
download and parse the database on demand with the
--download-pokay
flag.
- Run vartracker with pokay search:
vartracker input_data.csv --search-pokay --pokay-csv pokay_database.csv -o results/
Alternatively, omit --pokay-csv
and pass --download-pokay
to fetch the
database automatically during execution.
usage: vartracker [options] <input_csv>
vartracker: track the persistence (or not) of mutations during long-term passaging
positional arguments:
input_csv Input CSV file. See below.
options:
-h, --help show this help message and exit
-a, --annotation ANNOTATION
Annotations to use in GFF3 format (default: uses packaged SARS-CoV-2 annotations)
-m, --min-snv-freq MIN_SNV_FREQ
Minimum allele frequency of SNV variants to keep (default: 0.03)
-M, --min-indel-freq MIN_INDEL_FREQ
Minimum allele frequency of indel variants to keep (default: 0.1)
-n, --name NAME Optional: add a column to results with the name specified here
-o, --outdir OUTDIR Output directory (default: current directory)
--pokay-csv POKAY_CSV
Path to a pre-parsed pokay database CSV file
--search-pokay Run literature lookups against the pokay database
--download-pokay Automatically download and parse the pokay database
--test Run vartracker against the bundled demonstration dataset
-f, --filename FILENAME
Output file name (default: results.csv)
-r, --reference REFERENCE
Reference genome (default: uses packaged SARS-CoV-2 reference)
--passage-cap PASSAGE_CAP
Cap the number of passages at this number
--debug Print commands being run for debugging
--keep-temp Keep temporary files for debugging
--allele-frequency-tag ALLELE_FREQUENCY_TAG
INFO tag name for allele frequency (default: AF)
-V, --version show program's version number and exit
After installation you can verify the full pipeline using the bundled demonstration dataset:
vartracker --test --outdir vartracker_test_results
This command runs vartracker end-to-end with packaged inputs, producing outputs in the specified directory and confirming that external dependencies (such as bcftools) are available.
vartracker produces several output files:
- results.csv: Comprehensive variant analysis with all metrics
- new_mutations.csv: Mutations not present in the first sample
- persistent_new_mutations.csv: New mutations that persist to the final sample
- cumulative_mutations.pdf: Plot showing mutation accumulation over time
- mutations_per_gene.pdf: Gene-wise mutation statistics
- variant_allele_frequency_heatmap.pdf: Heatmap of variant allele frequencies across passages
- pokay_database_hits.*.csv: Functional annotation results (if pokay used)
The pipeline performs the following analysis:
-
VCF Standardization: Normalizes and standardizes input VCF files
-
Annotation: Adds amino acid consequences using
bcftools csq
-
Variant Merging: Combines all longitudinal samples
-
Comprehensive Analysis: For each variant, determines:
- Gene location and amino acid consequences
- Variant type (SNP/indel) and change type (synonymous/missense/etc.)
- Persistence across samples (new/original, persistent/transient)
- Quality control metrics
- Amino acid property changes
- Allele frequency dynamics
-
Visualization: Generates plots for mutation accumulation and gene-wise statistics
-
Functional Annotation: (optional) Searches against literature databases for known functional impacts
When using vartracker, please cite:
- The vartracker tool:
https://github.com/charlesfoster/vartracker
- bcftools: Li H. (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987-2993.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any issues or have questions:
- Check the documentation
- Search existing issues
- Create a new issue with detailed information about your problem