Skip to content

asomohammed/Fastq-combiner

Repository files navigation

⚑ FASTQ File Combiner - STREAMING OPTIMIZED

High-speed streaming I/O with minimal RAM usage for combining paired-end FASTQ files. Optimized for Cell Ranger compatibility and large-scale sequencing data.

πŸš€ Key Features

Performance & Scalability

  • Streaming I/O: Minimal RAM usage, handles files of any size
  • Auto-optimization: Automatically tunes buffer sizes based on system resources
  • Memory profiling: Track RAM usage during processing
  • Checkpointing: Resume interrupted runs from where they left off
  • SSD/HDD detection: Optimizes I/O patterns for different storage types
  • Multi-threading: Parallel processing for maximum speed

Data Quality & Validation

  • Quality score validation: Detects corrupted quality scores
  • Read length consistency: Verifies R1/R2 have matching lengths
  • Adapter detection: Identifies common adapter contamination
  • GC content analysis: Basic sequence quality metrics
  • Sample barcode extraction: Parse and validate sample barcodes

Monitoring & Safety

  • Real-time monitoring: Live updates of processing speed/throughput
  • Disk space monitoring: Warns before running out of space
  • Backup creation: Auto-backup original files before processing
  • Retry logic: Automatically retries failed operations
  • Corruption detection: Validates file integrity before processing

User Experience

  • Progress estimation: Shows ETA for large datasets
  • Interactive HTML reports: Search, filter, and visualize results
  • Configuration files: YAML-based configuration for complex workflows
  • System diagnostics: Comprehensive system information
  • Performance profiling: Detailed performance analysis

πŸ“¦ Installation

# Clone the repository
git clone https://github.com/yourusername/Fastq-combiner.git
cd Fastq-combiner

# Install dependencies
pip install -r requirements.txt

# Or install directly
pip install tqdm pyyaml psutil

🎯 Quick Start

Basic Usage

python3 fastq_combiner.py mapping.csv -o combined_output

Enhanced Usage with All Features

python3 fastq_combiner.py mapping.csv \
  --config example_config.yaml \
  --auto-optimize \
  --memory-profile \
  --validate \
  --real-time-monitor \
  --checkpoint

πŸ“‹ Input Format

CSV Mapping File

target_sample,source_file1,source_file2,source_file3
Sample_A,run1/Sample_A,run2/Sample_A,run3/Sample_A
Sample_B,batch1/Sample_B,batch2/Sample_B,batch3/Sample_B
Sample_C,seq1/Sample_C,seq2/Sample_C,local/Sample_C

YAML Configuration

# example_config.yaml
output: "combined_output"
threads: 8
auto_optimize: true
validate: true
monitor_disk: true
create_backups: true

πŸ”§ Advanced Features

Performance Optimization

# Auto-optimize based on system resources
--auto-optimize

# Memory profiling
--memory-profile

# Performance profiling
--profile

# Custom buffer size
--buffer-size 33554432  # 32MB

Data Validation

# Validate FASTQ quality
--validate

# Paired-end deduplication (removes duplicate read pairs, not just individual reads)
--deduplicate --paired-end-dedup

# Check sample barcodes
--check-barcodes

# GC content analysis
--gc-analysis

# Adapter detection
--adapter-check

Monitoring & Safety

# Real-time monitoring
--real-time-monitor

# Disk space monitoring
--monitor-disk

# Create backups
--create-backups

# Retry failed operations
--retry-failed

Checkpointing

# Enable checkpointing
--checkpoint

# Resume interrupted run
python3 fastq_combiner.py mapping.csv --checkpoint

πŸ“Š Output

Files Generated

  • {sample}_S1_R1_001.fastq.gz - Combined R1 reads
  • {sample}_S1_R2_001.fastq.gz - Combined R2 reads
  • combination_summary.csv - Processing summary
  • combination_report.html - Interactive HTML report

Paired-End Deduplication Output

  • When using --deduplicate --paired-end-dedup, only unique (R1, R2) sequence pairs are retained. If all input reads are identical, only one read pair will be present in the output.

HTML Report Features

  • Interactive search/filter for large datasets
  • Success/failure charts with Chart.js
  • Per-sample details with collapsible sections
  • Validation results and quality metrics
  • System metadata and performance stats

🐳 Docker Support

# Build image
docker build -t fastq-combiner .

# Run with volume mount
docker run -v $(pwd):/data fastq-combiner mapping.csv -o /data/output

πŸ” System Diagnostics

# Print system information
python3 fastq_combiner.py --diagnostics

Output includes:

  • Python version and platform
  • CPU cores and memory
  • Disk space and storage type
  • Dependency status

πŸ“ˆ Performance Tips

  1. Use SSD storage for best performance
  2. Enable auto-optimization for automatic tuning
  3. Monitor memory usage with --memory-profile
  4. Use checkpointing for large datasets
  5. Enable real-time monitoring for progress tracking

πŸ› οΈ Troubleshooting

Common Issues

Paired-end deduplication test: only 1 unique pair output

  • If all input reads are identical, deduplication will keep only one (R1, R2) pair. This is expected behavior for paired-end deduplication.

"No FASTQ files found"

  • Check search directories with --search-dirs
  • Verify file patterns with --r1-patterns and --r2-patterns

"Low disk space"

  • Use --monitor-disk to check space requirements
  • Clean up temporary files

"Memory issues"

  • Use --memory-profile to monitor usage
  • Reduce --buffer-size or --threads

"Validation warnings"

  • Use --validate to check file quality
  • Review warnings in HTML report

Performance Optimization

# For large datasets
--auto-optimize --memory-profile --checkpoint --real-time-monitor

# For validation-heavy workflows
--validate --check-barcodes --gc-analysis --adapter-check

# For safety-critical operations
--create-backups --monitor-disk --retry-failed

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

See CONTRIBUTING.md for details.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Built for the bioinformatics community
  • Optimized for Cell Ranger compatibility
  • Inspired by the need for efficient large-scale data processing

πŸ§ͺ Test Coverage

  • Comprehensive automated tests cover edge cases, paired-end deduplication, quality validation, error handling, and reporting.
  • The test suite ensures robust behavior for all major features and CLI options.

ℹ️ Notes

  • In dry-run mode (--dry-run), the tool logs only that a dry run was completed. It does not log all enabled features or options.

About

FASTQ file combiner

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •