This document demonstrates a complete variant calling and annotation pipeline that starts with FastQ datasets. The pipeline utilizes open-source bioinformatic programs to process resulting reads, align them to the human genome (GRCh38), call genetic variants, and functionally annotate them.
The pipeline imitates a ClinGen workflow and includes all necessary steps—quality control, adapter trimming, alignment, variant calling, filtering, and annotation—using existing command line tools.
To perform quality control, alignment, variant calling, filtering, and annotation on sequencing data from SRR576933 using tools like FastQC, BWA, SAMtools, bcftools, and SnpEff.
- Data Source: Public FASTQ dataset (SRR576933) from SRA.
- Reference Genome: GRCh38 from GENCODE.
- Tools Used:
FastQC
– Read quality controlTrimmomatic
– Adapter and low-quality trimmingHISAT2
– Read alignment to reference genomeSAMtools
– Format conversion and sortingbcftools
– Variant calling and filteringSnpEff
– Variant annotation
File Name | Description |
---|---|
SRR576933_trimmed.fastq |
Trimmed high-quality reads |
SRR576933_sorted.bam |
Sorted alignment file |
variants.vcf |
Raw variants called |
filtered_variants.vcf |
Filtered variants (QUAL ≥ 20, DP ≥ 10) |
annotated_filtered_variants.vcf |
Functionally annotated variant set |
raw_data/
: Original FASTQ fileref/
: GRCh38 reference genome and index filestrimmed/
: Cleaned reads after Trimmomaticalignment/
: BAM files, variant calls, and sorted datasnpeff/
: Annotated VCF output from SnpEffscripts/
: Shell commands and helper scripts used
- Clone the repository:
git clone https://github.com/ParthivRajesh/Clinical-Genomics-VariantPipeline.git cd Clinical-Genomics-VariantPipeline