Skip to content

charlesfoster/vartracker

Repository files navigation

Code style: black PyPI version

██    ██  █████  ██████  ████████ ██████   █████   ██████ ██   ██ ███████ ██████
██    ██ ██   ██ ██   ██    ██    ██   ██ ██   ██ ██      ██  ██  ██      ██   ██
██    ██ ███████ ██████     ██    ██████  ███████ ██      █████   █████   ██████
 ██  ██  ██   ██ ██   ██    ██    ██   ██ ██   ██ ██      ██  ██  ██      ██   ██
  ████   ██   ██ ██   ██    ██    ██   ██ ██   ██  ██████ ██   ██ ███████ ██   ██


vartracker

A bioinformatics pipeline to summarise variants called against a reference in a longitudinal study design. Written to investigate longitudinal sequencing data from long-term passaging of SARS-CoV-2. In theory, it could be expanded for other organisms too.

Author: Dr Charles Foster

Features

  • Track mutation persistence across longitudinal samples
  • Comprehensive variant analysis including amino acid consequences
  • Built-in SARS-CoV-2 reference data and annotations
  • Integration with functional mutation databases (pokay)
  • Automated plotting and statistical analysis
  • Support for both SNPs and indels
  • Quality control metrics for variants

Installation

From PyPI (Recommended)

pip install vartracker

External Dependencies

vartracker requires the following external tools to be installed and available in your PATH:

  • bcftools (>=1.10): For VCF file processing and variant calling consequences
  • tabix: For indexing compressed files

Installing bcftools and tabix

On macOS:

# Using Homebrew
brew install bcftools htslib

# Using MacPorts
sudo port install bcftools htslib

On Linux (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install bcftools tabix

On Linux (CentOS/RHEL/Fedora):

# CentOS/RHEL with EPEL
sudo yum install epel-release
sudo yum install bcftools htslib

# Fedora
sudo dnf install bcftools htslib

Using conda:

conda install -c bioconda bcftools htslib

Development Installation

For development or to get the latest version:

git clone https://github.com/charlesfoster/vartracker.git
cd vartracker
pip install -e .[dev]
pre-commit install

Quick Start

After installation, vartracker will be available as a command-line tool:

vartracker --help

Basic Usage

vartracker input_data.csv -o output_directory

Input Format

The input CSV file should have four columns:

  1. vcf: Full paths to VCF files (bgzipped or uncompressed) containing variants for each sample called against the NC_045512.2 reference genome. The VCF must have depth ("DP") and variant allele frequency tags in the INFO field. The variant allele frequency tag name can be specified via the vartracker CLI (see below).
  2. coverage: Full paths to coverage files. These files are expected to be in TSV format with three columns (no header): reference\t1-based position\tdepth. For example:
NC_045512.2	1	0
NC_045512.2	2	0
NC_045512.2	3	0
NC_045512.2	4	0
NC_045512.2	5	0
NC_045512.2	6	0
NC_045512.2	7	0
NC_045512.2	8	0
NC_045512.2	9	0
NC_045512.2	10	0
...
NC_045512.2	250	932
NC_045512.2	251	929
NC_045512.2	252	931
NC_045512.2	253	900
NC_045512.2	254	900
NC_045512.2	255	897
NC_045512.2	256	891
NC_045512.2	257	895
NC_045512.2	258	903
NC_045512.2	259	898
NC_045512.2	260	895

Note: this is the format produced by bedtools genomecov via the command bedtools genomecov -ibam input.bam -d > coverage.txt or samtools depth via the command samtools depth -a input.bam.

  1. sample_name: Name of the sample in the VCF file
  2. sample_number: Sample number (e.g., 0, 1, 2, ..., 15 for passages)

Example input CSV:

vcf,coverage,sample_name,sample_number
/path/to/sample0.vcf.gz,/path/to/sample0.coverage,passage_0,0
/path/to/sample1.vcf.gz,/path/to/sample1.coverage,passage_1,1
/path/to/sample2.vcf.gz,/path/to/sample2.coverage,passage_2,2

Using with pokay Database

To search mutations against functional databases:

  1. Set up pokay database (optional):
parse_pokay pokay_database.csv

This command automatically downloads the required literature files from the pokay repository into pokay_literature/NC_045512 (override with --download-dir) and writes the processed CSV for downstream analysis. You can also let vartracker download and parse the database on demand with the --download-pokay flag.

  1. Run vartracker with pokay search:
vartracker input_data.csv --search-pokay --pokay-csv pokay_database.csv -o results/

Alternatively, omit --pokay-csv and pass --download-pokay to fetch the database automatically during execution.

Command Line Options

usage: vartracker [options] <input_csv>

vartracker: track the persistence (or not) of mutations during long-term passaging

positional arguments:
  input_csv             Input CSV file. See below.

options:
  -h, --help            show this help message and exit
  -a, --annotation ANNOTATION
                        Annotations to use in GFF3 format (default: uses packaged SARS-CoV-2 annotations)
  -m, --min-snv-freq MIN_SNV_FREQ
                        Minimum allele frequency of SNV variants to keep (default: 0.03)
  -M, --min-indel-freq MIN_INDEL_FREQ
                        Minimum allele frequency of indel variants to keep (default: 0.1)
  -n, --name NAME       Optional: add a column to results with the name specified here
  -o, --outdir OUTDIR   Output directory (default: current directory)
  --pokay-csv POKAY_CSV
                        Path to a pre-parsed pokay database CSV file
  --search-pokay        Run literature lookups against the pokay database
  --download-pokay      Automatically download and parse the pokay database
  --test                Run vartracker against the bundled demonstration dataset
  -f, --filename FILENAME
                        Output file name (default: results.csv)
  -r, --reference REFERENCE
                        Reference genome (default: uses packaged SARS-CoV-2 reference)
  --passage-cap PASSAGE_CAP
                        Cap the number of passages at this number
  --debug               Print commands being run for debugging
  --keep-temp           Keep temporary files for debugging
  --allele-frequency-tag ALLELE_FREQUENCY_TAG
                        INFO tag name for allele frequency (default: AF)
  -V, --version         show program's version number and exit

Installation Test

After installation you can verify the full pipeline using the bundled demonstration dataset:

vartracker --test --outdir vartracker_test_results

This command runs vartracker end-to-end with packaged inputs, producing outputs in the specified directory and confirming that external dependencies (such as bcftools) are available.

Output

vartracker produces several output files:

  • results.csv: Comprehensive variant analysis with all metrics
  • new_mutations.csv: Mutations not present in the first sample
  • persistent_new_mutations.csv: New mutations that persist to the final sample
  • cumulative_mutations.pdf: Plot showing mutation accumulation over time
  • mutations_per_gene.pdf: Gene-wise mutation statistics
  • variant_allele_frequency_heatmap.pdf: Heatmap of variant allele frequencies across passages
  • pokay_database_hits.*.csv: Functional annotation results (if pokay used)

What does vartracker do?

The pipeline performs the following analysis:

  1. VCF Standardization: Normalizes and standardizes input VCF files

  2. Annotation: Adds amino acid consequences using bcftools csq

  3. Variant Merging: Combines all longitudinal samples

  4. Comprehensive Analysis: For each variant, determines:

    • Gene location and amino acid consequences
    • Variant type (SNP/indel) and change type (synonymous/missense/etc.)
    • Persistence across samples (new/original, persistent/transient)
    • Quality control metrics
    • Amino acid property changes
    • Allele frequency dynamics
  5. Visualization: Generates plots for mutation accumulation and gene-wise statistics

  6. Functional Annotation: (optional) Searches against literature databases for known functional impacts

Citation

When using vartracker, please cite:

  • The vartracker tool: https://github.com/charlesfoster/vartracker
  • bcftools: Li H. (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987-2993.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any issues or have questions:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue with detailed information about your problem

About

vartracker: track the persistence (or not) or mutations during longitudinal sequencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages