⚡ FASTQ File Combiner - STREAMING OPTIMIZED

High-speed streaming I/O with minimal RAM usage for combining paired-end FASTQ files. Optimized for Cell Ranger compatibility and large-scale sequencing data.

🚀 Key Features

Performance & Scalability

Streaming I/O: Minimal RAM usage, handles files of any size
Auto-optimization: Automatically tunes buffer sizes based on system resources
Memory profiling: Track RAM usage during processing
Checkpointing: Resume interrupted runs from where they left off
SSD/HDD detection: Optimizes I/O patterns for different storage types
Multi-threading: Parallel processing for maximum speed

Data Quality & Validation

Quality score validation: Detects corrupted quality scores
Read length consistency: Verifies R1/R2 have matching lengths
Adapter detection: Identifies common adapter contamination
GC content analysis: Basic sequence quality metrics
Sample barcode extraction: Parse and validate sample barcodes

Monitoring & Safety

Real-time monitoring: Live updates of processing speed/throughput
Disk space monitoring: Warns before running out of space
Backup creation: Auto-backup original files before processing
Retry logic: Automatically retries failed operations
Corruption detection: Validates file integrity before processing

User Experience

Progress estimation: Shows ETA for large datasets
Interactive HTML reports: Search, filter, and visualize results
Configuration files: YAML-based configuration for complex workflows
System diagnostics: Comprehensive system information
Performance profiling: Detailed performance analysis

📦 Installation

# Clone the repository
git clone https://github.com/yourusername/Fastq-combiner.git
cd Fastq-combiner

# Install dependencies
pip install -r requirements.txt

# Or install directly
pip install tqdm pyyaml psutil

🎯 Quick Start

Basic Usage

python3 fastq_combiner.py mapping.csv -o combined_output

Enhanced Usage with All Features

python3 fastq_combiner.py mapping.csv \
  --config example_config.yaml \
  --auto-optimize \
  --memory-profile \
  --validate \
  --real-time-monitor \
  --checkpoint

📋 Input Format

CSV Mapping File

target_sample,source_file1,source_file2,source_file3
Sample_A,run1/Sample_A,run2/Sample_A,run3/Sample_A
Sample_B,batch1/Sample_B,batch2/Sample_B,batch3/Sample_B
Sample_C,seq1/Sample_C,seq2/Sample_C,local/Sample_C

YAML Configuration

# example_config.yaml
output: "combined_output"
threads: 8
auto_optimize: true
validate: true
monitor_disk: true
create_backups: true

🔧 Advanced Features

Performance Optimization

# Auto-optimize based on system resources
--auto-optimize

# Memory profiling
--memory-profile

# Performance profiling
--profile

# Custom buffer size
--buffer-size 33554432  # 32MB

Data Validation

# Validate FASTQ quality
--validate

# Paired-end deduplication (removes duplicate read pairs, not just individual reads)
--deduplicate --paired-end-dedup

# Check sample barcodes
--check-barcodes

# GC content analysis
--gc-analysis

# Adapter detection
--adapter-check

Monitoring & Safety

# Real-time monitoring
--real-time-monitor

# Disk space monitoring
--monitor-disk

# Create backups
--create-backups

# Retry failed operations
--retry-failed

Checkpointing

# Enable checkpointing
--checkpoint

# Resume interrupted run
python3 fastq_combiner.py mapping.csv --checkpoint

📊 Output

Files Generated

{sample}_S1_R1_001.fastq.gz - Combined R1 reads
{sample}_S1_R2_001.fastq.gz - Combined R2 reads
combination_summary.csv - Processing summary
combination_report.html - Interactive HTML report

Paired-End Deduplication Output

When using --deduplicate --paired-end-dedup, only unique (R1, R2) sequence pairs are retained. If all input reads are identical, only one read pair will be present in the output.

HTML Report Features

Interactive search/filter for large datasets
Success/failure charts with Chart.js
Per-sample details with collapsible sections
Validation results and quality metrics
System metadata and performance stats

🐳 Docker Support

# Build image
docker build -t fastq-combiner .

# Run with volume mount
docker run -v $(pwd):/data fastq-combiner mapping.csv -o /data/output

🔍 System Diagnostics

# Print system information
python3 fastq_combiner.py --diagnostics

Output includes:

Python version and platform
CPU cores and memory
Disk space and storage type
Dependency status

📈 Performance Tips

Use SSD storage for best performance
Enable auto-optimization for automatic tuning
Monitor memory usage with --memory-profile
Use checkpointing for large datasets
Enable real-time monitoring for progress tracking

🛠️ Troubleshooting

Common Issues

Paired-end deduplication test: only 1 unique pair output

If all input reads are identical, deduplication will keep only one (R1, R2) pair. This is expected behavior for paired-end deduplication.

"No FASTQ files found"

Check search directories with --search-dirs
Verify file patterns with --r1-patterns and --r2-patterns

"Low disk space"

Use --monitor-disk to check space requirements
Clean up temporary files

"Memory issues"

Use --memory-profile to monitor usage
Reduce --buffer-size or --threads

"Validation warnings"

Use --validate to check file quality
Review warnings in HTML report

Performance Optimization

# For large datasets
--auto-optimize --memory-profile --checkpoint --real-time-monitor

# For validation-heavy workflows
--validate --check-barcodes --gc-analysis --adapter-check

# For safety-critical operations
--create-backups --monitor-disk --retry-failed

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new features
Submit a pull request

See CONTRIBUTING.md for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built for the bioinformatics community
Optimized for Cell Ranger compatibility
Inspired by the need for efficient large-scale data processing

🧪 Test Coverage

Comprehensive automated tests cover edge cases, paired-end deduplication, quality validation, error handling, and reporting.
The test suite ensures robust behavior for all major features and CLI options.

ℹ️ Notes

In dry-run mode (--dry-run), the tool logs only that a dry run was completed. It does not log all enabled features or options.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
CSV_examples		CSV_examples
demo_output		demo_output
v2		v2
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
REVIEW_IMPROVEMENTS.md		REVIEW_IMPROVEMENTS.md
example_config.yaml		example_config.yaml
fastq_combiner.py		fastq_combiner.py
generate_demo_fastqs.py		generate_demo_fastqs.py
requirements.txt		requirements.txt
setup.py		setup.py

License

asomohammed/Fastq-combiner

Folders and files

Latest commit

History

Repository files navigation

⚡ FASTQ File Combiner - STREAMING OPTIMIZED

🚀 Key Features

Performance & Scalability

Data Quality & Validation

Monitoring & Safety

User Experience

📦 Installation

🎯 Quick Start

Basic Usage

Enhanced Usage with All Features

📋 Input Format

CSV Mapping File

YAML Configuration

🔧 Advanced Features

Performance Optimization

Data Validation

Monitoring & Safety

Checkpointing

📊 Output

Files Generated

Paired-End Deduplication Output

HTML Report Features

🐳 Docker Support

🔍 System Diagnostics

📈 Performance Tips

🛠️ Troubleshooting

Common Issues

Performance Optimization

🤝 Contributing

📄 License

🙏 Acknowledgments

🧪 Test Coverage

ℹ️ Notes

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages