Fastqc reporter is a Command Line Interface (CLI) tool built to parse fastqc files into sections and generate reports. It also generates graphical representations and a flag file indicating the QC test result (pass
, fail
, or warn
).
To run this program, the following are required:
- Python 3.9 or higher
- Conda or venv (Conda is used in this documentation)
To create a new virtual environment and install dependencies:
conda create -c conda-forge -n name_of_my_env seaborn pandas matplotlib
Activate the virtual environment:
source activate name_of_my_env
Run the program with its parameters (refer to the Example Usage section).
Fastqc reporter follows an object-oriented approach with two main classes:
- FastQCParser
- Section
The FastQCParser
class parses fastqc files into sections and manages optional parameters. The Section
class writes reports and flag files for each section.
Each section inherits from the base Section
class and defines its own implementation of the plot_section()
method to generate the necessary plots.
Fastqc reporter uses Python's argparse
module to handle command-line arguments. Required parameters:
- Path to the fastqc file
- Output folder for plots, reports, and flag files
Optional parameters are handled via add_argument
with store_true
, making them optional.
- Instantiate
FastQCParser
with required parameters. - Parse the fastqc file into a dictionary.
- Handle optional parameters and call appropriate methods.
- Generate reports, plots, and flag files.
The script is structured as follows:
fastqc_reporter/
│── fastqc_reporter.py # Entry point, defines parser options
│── constants.py # Defines section titles
│── model/ # Contains all classes used in the script
│ ├── __init__.py
│ ├── section.py
│ ├── fastqc_parser.py
│── data/ # Contains test fastqc files
- Matplotlib & Seaborn - For plotting graphs
- Pandas - To extract and manage section data using
pandas.read_csv()
To run the program in its default form:
python3 fastqc_reporter.py ./data/fastqc_data1.txt ./solution1/
Basic Statistics pass
#Measure Value
Filename 4_age21_S12_L001_R2_001_concat.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 37287903
Sequences flagged as poor quality 0
Sequence length 75
%GC 55
Users can specify sections to run using options:
Option | Description |
---|---|
-t / --per_tile_seq_qual |
Per Tile Sequence Quality |
-s / --per_seq_qual_scores |
Per Sequence Quality Scores |
-c / --per_base_seq_content |
Per Base Sequence Content |
-g / --per_seq_GC_cont |
Per Sequence GC Content |
-n / --per_base_N_cont |
Per Base N Content |
-l / --seq_len_dist |
Sequence Length Distribution |
-d / --seq_dup |
Sequence Duplication Levels |
-o / --over_seq |
Overrepresented Sequences |
-p / --adap_cont |
Adapter Content |
-k / --kmer_count |
K-mer Content |
-a / --all |
Run all sections |
The script implements error handling using Python's try-except
block to manage:
- Invalid user input
- Malformed fastqc files
- File permission errors
- Parsing errors
If an error occurs, the program exits with a non-zero exit code and prints an error message.
- Akalin, A. (2020). Computational genomics with R (Chapter 7: Quality check on sequencing reads). Bookdown
- Babraham Institute. (n.d.). FastQC per tile sequence quality analysis. FastQC Help
- Kong, Y. (2011). Btrim: A fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics
- Illumina. (2018). Ask a scientist - What is GC-Bias? YouTube
- O'Rawe, J. F., Ferson, S., & Lyon, G. (2015). Accounting for uncertainty in DNA sequencing data. Trends in Genetics
- Pandas Documentation. pandas.read_csv
- Seaborn Documentation. Seaborn functions