High-speed streaming I/O with minimal RAM usage for combining paired-end FASTQ files. Optimized for Cell Ranger compatibility and large-scale sequencing data.
- Streaming I/O: Minimal RAM usage, handles files of any size
- Auto-optimization: Automatically tunes buffer sizes based on system resources
- Memory profiling: Track RAM usage during processing
- Checkpointing: Resume interrupted runs from where they left off
- SSD/HDD detection: Optimizes I/O patterns for different storage types
- Multi-threading: Parallel processing for maximum speed
- Quality score validation: Detects corrupted quality scores
- Read length consistency: Verifies R1/R2 have matching lengths
- Adapter detection: Identifies common adapter contamination
- GC content analysis: Basic sequence quality metrics
- Sample barcode extraction: Parse and validate sample barcodes
- Real-time monitoring: Live updates of processing speed/throughput
- Disk space monitoring: Warns before running out of space
- Backup creation: Auto-backup original files before processing
- Retry logic: Automatically retries failed operations
- Corruption detection: Validates file integrity before processing
- Progress estimation: Shows ETA for large datasets
- Interactive HTML reports: Search, filter, and visualize results
- Configuration files: YAML-based configuration for complex workflows
- System diagnostics: Comprehensive system information
- Performance profiling: Detailed performance analysis
# Clone the repository
git clone https://github.com/yourusername/Fastq-combiner.git
cd Fastq-combiner
# Install dependencies
pip install -r requirements.txt
# Or install directly
pip install tqdm pyyaml psutil
python3 fastq_combiner.py mapping.csv -o combined_output
python3 fastq_combiner.py mapping.csv \
--config example_config.yaml \
--auto-optimize \
--memory-profile \
--validate \
--real-time-monitor \
--checkpoint
target_sample,source_file1,source_file2,source_file3
Sample_A,run1/Sample_A,run2/Sample_A,run3/Sample_A
Sample_B,batch1/Sample_B,batch2/Sample_B,batch3/Sample_B
Sample_C,seq1/Sample_C,seq2/Sample_C,local/Sample_C
# example_config.yaml
output: "combined_output"
threads: 8
auto_optimize: true
validate: true
monitor_disk: true
create_backups: true
# Auto-optimize based on system resources
--auto-optimize
# Memory profiling
--memory-profile
# Performance profiling
--profile
# Custom buffer size
--buffer-size 33554432 # 32MB
# Validate FASTQ quality
--validate
# Paired-end deduplication (removes duplicate read pairs, not just individual reads)
--deduplicate --paired-end-dedup
# Check sample barcodes
--check-barcodes
# GC content analysis
--gc-analysis
# Adapter detection
--adapter-check
# Real-time monitoring
--real-time-monitor
# Disk space monitoring
--monitor-disk
# Create backups
--create-backups
# Retry failed operations
--retry-failed
# Enable checkpointing
--checkpoint
# Resume interrupted run
python3 fastq_combiner.py mapping.csv --checkpoint
{sample}_S1_R1_001.fastq.gz
- Combined R1 reads{sample}_S1_R2_001.fastq.gz
- Combined R2 readscombination_summary.csv
- Processing summarycombination_report.html
- Interactive HTML report
- When using
--deduplicate --paired-end-dedup
, only unique (R1, R2) sequence pairs are retained. If all input reads are identical, only one read pair will be present in the output.
- Interactive search/filter for large datasets
- Success/failure charts with Chart.js
- Per-sample details with collapsible sections
- Validation results and quality metrics
- System metadata and performance stats
# Build image
docker build -t fastq-combiner .
# Run with volume mount
docker run -v $(pwd):/data fastq-combiner mapping.csv -o /data/output
# Print system information
python3 fastq_combiner.py --diagnostics
Output includes:
- Python version and platform
- CPU cores and memory
- Disk space and storage type
- Dependency status
- Use SSD storage for best performance
- Enable auto-optimization for automatic tuning
- Monitor memory usage with
--memory-profile
- Use checkpointing for large datasets
- Enable real-time monitoring for progress tracking
Paired-end deduplication test: only 1 unique pair output
- If all input reads are identical, deduplication will keep only one (R1, R2) pair. This is expected behavior for paired-end deduplication.
"No FASTQ files found"
- Check search directories with
--search-dirs
- Verify file patterns with
--r1-patterns
and--r2-patterns
"Low disk space"
- Use
--monitor-disk
to check space requirements - Clean up temporary files
"Memory issues"
- Use
--memory-profile
to monitor usage - Reduce
--buffer-size
or--threads
"Validation warnings"
- Use
--validate
to check file quality - Review warnings in HTML report
# For large datasets
--auto-optimize --memory-profile --checkpoint --real-time-monitor
# For validation-heavy workflows
--validate --check-barcodes --gc-analysis --adapter-check
# For safety-critical operations
--create-backups --monitor-disk --retry-failed
- Fork the repository
- Create a feature branch
- Add tests for new features
- Submit a pull request
See CONTRIBUTING.md for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Built for the bioinformatics community
- Optimized for Cell Ranger compatibility
- Inspired by the need for efficient large-scale data processing
- Comprehensive automated tests cover edge cases, paired-end deduplication, quality validation, error handling, and reporting.
- The test suite ensures robust behavior for all major features and CLI options.
- In dry-run mode (
--dry-run
), the tool logs only that a dry run was completed. It does not log all enabled features or options.