This repository contains a comprehensive workflow for analyzing single-cell T-cell receptor sequencing (scTCR-seq) data using MiXCR, with preprocessing steps for quality control, trimming, and post-processing to convert MiXCR outputs to CSV. The pipeline is designed to run on a high-performance computing (HPC) cluster with SLURM job scheduling, processing raw FASTQ files to generate MiXCR analysis results, detailed quality control (QC) reports, and a summarized CSV of clonotype data.
The workflow includes four main scripts:
- runQC.sh: Performs initial quality control on raw FASTQ files using FastQC and MultiQC.
- runQC2.sh: Trims FASTQ files with fastp, followed by FastQC and MultiQC on trimmed files.
- runMixcr.sh: Runs MiXCR analysis, renames output files, and generates extensive QC reports (align, chainUsage, coverage, tags).
- convert_mixcr_to_csv.py: Converts MiXCR
clone.groups_TRAB.tsvfiles into a summarized CSV containing TRA and TRB clonotype data for the top clone per sample.
- Quality Control: Uses FastQC and MultiQC to assess raw and trimmed FASTQ file quality.
- Read Trimming: Employs fastp for adapter removal and quality-based trimming of paired-end and single-end reads.
- MiXCR Analysis: Processes scTCR-seq data with MiXCR to identify and assemble TCR clonotypes, tailored for human samples.
- Comprehensive QC Reports: Generates detailed visualizations for alignment, chain usage, coverage, and tags using MiXCR’s exportQc functionality.
- Post-Processing: Converts MiXCR output to a CSV summarizing key TRA and TRB clonotype metrics (abundance, CDR3 sequences, V/D/J genes).
- Robust Error Handling: Includes extensive checks for file existence, directory permissions, and job failures.
- SLURM Integration: Optimized for HPC environments with SLURM job scheduling.
- HPC Cluster: Access to a SLURM-based HPC system (for shell scripts).
- Modules (for shell scripts):
- GCC/10.3.0
- FastQC/0.12.1-Java-11.0.2
- MultiQC/1.25.1-foss-2021a
- MiXCR/4.7.0-Java-11.0.2
- Software:
- fastp (downloaded automatically by
runQC2.shif not present) - Python 3.6+ (for
convert_mixcr_to_csv.py) - Python packages:
pandas,argparse,logging(install viapip install pandas)
- fastp (downloaded automatically by
- Input: Paired-end or single-end FASTQ files in the specified input directory.
- Storage: Sufficient disk space for input, output, and log directories.
mixcr_workflow/
├── runQC.sh # Initial QC with FastQC and MultiQC
├── runQC2.sh # Trimming with fastp, followed by FastQC and MultiQC
├── runMixcr.sh # MiXCR analysis and QC report generation
├── convert_mixcr_to_csv.py # Convert MiXCR output to CSV
├── logs/ # Log files for job outputs and errors
├── data/ # Input FASTQ and output directories (user-defined)
└── README.md # This file
-
Clone the Repository:
git clone https://github.com/your-username/mixcr_workflow.git cd mixcr_workflow -
Set Up Directories:
- Update the directory paths in each shell script (
INPUT_DIR,OUTPUT_DIR, etc.) to match your file system. - Ensure input FASTQ files follow the naming convention:
Jurkat-P0321-Mart1-{{n}}-{{CELL:a}}_{{SAMPLE:a}}_L001_{{R}}_trimmed.fastq.gz. - For the Python script, specify the MiXCR output directory containing
clone.groups_TRAB.tsvfiles.
- Update the directory paths in each shell script (
-
Run the Pipeline:
- Shell Scripts:
chmod +x runQC.sh runQC2.sh runMixcr.sh sbatch runQC.sh sbatch runQC2.sh sbatch runMixcr.sh
- Python Script:
Replace
python convert_mixcr_to_csv.py --output-dir /path/to/mixcr_output_scTCR
/path/to/mixcr_output_scTCRwith the path to the directory containing MiXCRclone.groups_TRAB.tsvfiles.
- Shell Scripts:
-
Monitor Jobs (for shell scripts):
squeue -u $USER -
Check Outputs:
- FastQC and MultiQC reports:
$OUTPUT_DIR/FastQC,$OUTPUT_DIR/MultiQC,$OUTPUT_DIR/FastQC_Trimmed,$OUTPUT_DIR/MultiQC_Trimmed - Trimmed FASTQ files:
$OUTPUT_DIR/Trimmed_Fastq - MiXCR results:
$OUTPUT_DIR/data/mixcr_output_scTCR - QC reports:
$OUTPUT_DIR/data/mixcr_output_scTCR/qc_reports - Summary CSV:
$OUTPUT_DIR/data/mixcr_output_scTCR/mixcr_summary.csv - Logs:
$LOG_DIR(for shell scripts) andconvert_mixcr_to_csv.log(for Python script)
- FastQC and MultiQC reports:
- File Naming: Ensure FASTQ files match the expected pattern for MiXCR processing, and
clone.groups_TRAB.tsvfiles follow theresults.*.clone.groups_TRAB.tsvnaming convention. - Resource Allocation: Adjust SLURM parameters (
--ntasks,--mem,--time) based on your cluster’s configuration and dataset size. - Error Logs: Check log files in
$LOG_DIR(shell scripts) orconvert_mixcr_to_csv.log(Python script) for troubleshooting. - Expected Samples: The
runMixcr.shscript includes a check for 5 samples; modify theEXPECTED_SAMPLESvariable if needed. - Python Environment: Ensure Python and required packages are installed. On an HPC, you may need to load a Python module or use a virtual environment.
Contributions are welcome! Please submit a pull request or open an issue for bug reports, feature requests, or suggestions.
This project is licensed under the MIT License. See the LICENSE file for details.