This pipeline performs comprehensive analysis of VCF (Variant Call Format) files, generating detailed reports and statistics for variant calling quality assessment. It's particularly designed for analyzing Plasmodium falciparum whole genome sequencing data.
The following software and tools are required:
-
BCFtools (version 1.9 or higher)
- Required for VCF file manipulation and basic statistics
- Download and Installation Guide
-
VCFtools (version 0.1.16 or higher)
- Required for advanced VCF analysis
- Download and Installation Guide
-
R (version 4.0.0 or higher) with the following packages:
- ggplot2
- dplyr
- tidyr
- knitr
- viridis
- gridExtra
-
Install BCFtools:
# For Ubuntu/Debian sudo apt-get install bcftools # For CentOS/RHEL sudo yum install bcftools # For macOS brew install bcftools
-
Install VCFtools:
# For Ubuntu/Debian sudo apt-get install vcftools # For CentOS/RHEL sudo yum install vcftools # For macOS brew install vcftools
-
Install R and Required Packages:
# Install R packages R -e "install.packages(c('ggplot2', 'dplyr', 'tidyr', 'knitr', 'viridis', 'gridExtra'), repos='https://cran.rstudio.com/')"
- Install Miniconda if not already installed
- Create and activate environment:
conda create -n vcf_analysis -c bioconda -c conda-forge bcftools vcftools r-essentials r-ggplot2 r-dplyr r-tidyr r-knitr r-viridis r-gridextra conda activate vcf_analysis
-
Prepare Your Data:
- Ensure your VCF file is compressed (
.vcf.gz
) and indexed (.vcf.gz.tbi
) - Place your VCF file in the working directory
- Ensure your VCF file is compressed (
-
Run the Analysis:
# Make the script executable chmod +x run_vcf_analysis.sh # Run the analysis ./run_vcf_analysis.sh
-
View Results:
- All outputs will be stored in the
VCF_Analysis_Outputs
directory - The main report will be available as
vcf_analysis_report.html
- All outputs will be stored in the
The pipeline generates several output files in the VCF_Analysis_Outputs
directory:
-
Basic Statistics:
snps_only.vcf.gz
: Filtered VCF containing only SNPsgenotype_table.csv
: Genotype information for all variantsallelic_depth.csv
: Allele depth informationdepth_per_site.csv
: Read depth per site
-
Quality Metrics:
site_qualities.csv
: Quality scores for each variantallele_frequencies.frq
: Allele frequency statisticsmean_depth_individual.idepth
: Mean depth per individualmean_depth_site.ldepth.mean
: Mean depth per site
-
Missing Data Analysis:
missing_individuals.imiss
: Missing data per individualmissing_sites.lmiss
: Missing data per site
-
Heterozygosity Analysis:
heterozygosity.het
: Heterozygosity and inbreeding coefficients
-
Report:
vcf_analysis_report.html
: Comprehensive HTML report with visualizations
-
BCFtools/VCFtools not found:
# Check if tools are in PATH which bcftools which vcftools # If not found, add to PATH or use full path export PATH=/path/to/tools:$PATH
-
R package installation fails:
# Try installing packages individually R -e "install.packages('ggplot2', repos='https://cran.rstudio.com/')"
-
Permission denied:
# Make script executable chmod +x run_vcf_analysis.sh
-
Memory issues:
- Ensure sufficient RAM (recommended: 16GB minimum)
- For large VCF files, consider increasing system swap space
- For BCFtools issues: BCFtools GitHub
- For VCFtools issues: VCFtools GitHub
- For R package issues: R Documentation
This documentation explains each plot and statistic in the VCF Analysis Report, providing guidance on interpretation and the significance of each metric in the context of variant calling analysis.
Purpose: This section lists all samples included in the VCF file.
Interpretation: The table displays sample IDs, which are important for tracking which individuals are included in the analysis. The number of samples influences statistical power and representation.
Importance: Sample information helps verify that all expected samples are present and correctly labeled, which is crucial for downstream analyses, especially in population genetics studies.
Purpose: This histogram shows the distribution of QUAL scores across all variants in the VCF file.
Interpretation:
- The x-axis (log scale) shows quality scores, while the y-axis shows the count of variants.
- Higher quality scores indicate greater confidence in variant calls.
- Ideally, you want to see most variants having high quality scores.
- A bimodal distribution may indicate two distinct classes of variants (high confidence vs. low confidence).
Importance: Quality scores help identify potential false positives. Variants with very low quality scores may be filtered out in downstream analyses to improve reliability.
Purpose: This boxplot summarizes the central tendency and spread of quality scores.
Interpretation:
- The box represents the interquartile range (IQR, 25th to 75th percentile).
- The line inside the box is the median.
- Whiskers typically extend to 1.5 times the IQR.
- Outliers are plotted as individual points.
Importance: The boxplot provides a quick summary of quality score distribution, helping identify if most variants are of acceptable quality and detecting outliers.
Purpose: This bar chart shows the average sequencing depth (coverage) for each sample.
Interpretation:
- The x-axis shows sample IDs, while the y-axis shows mean depth.
- Higher values indicate more sequencing data available for that sample.
- Ideally, all samples should have similar depths to ensure comparable variant calling sensitivity.
- Samples with very low depth may have more false negatives due to insufficient coverage.
Importance: Sequencing depth directly impacts the ability to accurately call variants. Samples with unusually low depth may need to be excluded or analyzed with caution.
Purpose: This density plot shows the distribution of sequencing depth across genomic positions.
Interpretation:
- The x-axis shows depth values, while the y-axis shows density.
- The shape indicates how evenly coverage is distributed across the genome.
- Multiple peaks may indicate systematic biases in coverage.
- A long right tail suggests some regions have extremely high coverage.
Importance: Regions with extremely low coverage may have poor variant calling sensitivity, while regions with unusually high coverage could indicate mapping artifacts or copy number variations.
Purpose: This bar chart shows the proportion of missing genotypes for each sample.
Interpretation:
- The x-axis shows sample IDs, while the y-axis shows the fraction of missing genotypes.
- Lower values are better, indicating more complete data.
- Samples with high missingness (e.g., >10-20%) may need to be excluded.
- Uneven missingness across samples can bias population genetic analyses.
Importance: Missing data can impact many analyses, especially those involving population structure or relatedness. Samples with excessive missing data may need to be removed.
Purpose: This histogram shows how missingness is distributed across genomic sites.
Interpretation:
- The x-axis shows the fraction of samples missing data at a given position, while the y-axis shows the count of sites.
- Ideally, most sites should have low missingness.
- Sites with high missingness across many samples may be in difficult-to-sequence regions.
Importance: Sites with high missingness across samples may need to be filtered out before certain analyses, as they can reduce statistical power and introduce biases.
Purpose: This bar chart shows the inbreeding coefficient (F) for each sample.
Interpretation:
- The x-axis shows sample IDs, while the y-axis shows the inbreeding coefficient.
- F values around 0 suggest random mating (Hardy-Weinberg equilibrium).
- Positive F values indicate an excess of homozygosity (inbreeding).
- Negative F values suggest an excess of heterozygosity.
- For haploid organisms like Plasmodium, high heterozygosity may indicate mixed infections.
Importance: The inbreeding coefficient can reveal population structure, inbreeding, sample contamination, or mixed infections in parasites.
Purpose: This scatter plot compares observed versus expected homozygous sites for each sample.
Interpretation:
- Points along the diagonal indicate samples in Hardy-Weinberg equilibrium.
- Points above the diagonal have more homozygosity than expected (potential inbreeding).
- Points below the diagonal have more heterozygosity than expected (potential outbreeding or sample contamination).
Importance: Deviations from expected heterozygosity can indicate sample quality issues, population structure, or selection.
Purpose: This table summarizes key metrics about variant quality scores.
Interpretation:
- Total Sites: The total number of variant sites analyzed.
- Mean and Median Quality: Central tendency measures of quality scores.
- Quality SD: Indicates the spread of quality scores.
- Quality Range: Shows the minimum and maximum quality scores.
Importance: These statistics provide a quick overview of variant quality, helping identify potential issues with the variant calling process.
Purpose: This table summarizes sequencing depth across the dataset.
Interpretation:
- Mean Depth: Average coverage across all samples and sites.
- Depth SD: Variation in depth, with high values indicating uneven coverage.
- Min/Max Depth: Extremes of coverage.
- Sites < 10x: Count of sites with potentially insufficient coverage for reliable variant calling.
Importance: Depth statistics help evaluate the overall quality of sequencing data and identify potential limitations in variant detection.
Purpose: This table summarizes missing data across the dataset.
Interpretation:
- Mean Missing: Average proportion of missing data across all samples.
- Max Missing: Highest proportion of missing data in any sample.
- Samples > 20% Missing: Count of samples with potentially problematic levels of missing data.
- Total Missing Sites: Sum of all missing genotypes across all samples and sites.
Importance: These statistics help evaluate data completeness and identify patterns of missingness that might affect analyses.
Purpose: This histogram shows the distribution of allele frequencies across all variable sites.
Interpretation:
- The x-axis shows allele frequency, while the y-axis shows the count of variants.
- The shape of this distribution can reveal population history and selection.
- For neutral evolution, many rare variants and fewer common variants are expected.
- Peaks at intermediate frequencies may indicate balancing selection.
- For Plasmodium, the shape can indicate drug resistance patterns or adaptation.
Importance: The allele frequency spectrum is fundamental for population genetics analyses and can reveal evolutionary processes and selective pressures affecting the population.
This VCF analysis report provides a comprehensive quality assessment of variant calls, helping identify potential issues with sequencing data, variant calling, or sample quality. Key areas to examine include:
- Sample quality: Check for samples with unusual depth, missing data, or heterozygosity.
- Variant quality: Examine the distribution of quality scores to identify potential false positives.
- Coverage uniformity: Assess whether depth is consistent across samples and genomic regions.
- Completeness: Evaluate the extent of missing data and its distribution.
- Population genetics indicators: Use heterozygosity and allele frequency metrics to understand the genetic structure of the population.
By addressing potential issues identified in this report, you can improve the reliability of downstream analyses and interpretations of genetic variation in your samples.