A complete Nextflow pipeline for metagenomic data analysis from raw reads to functional/taxonomic annotation and bacterial growth rate calculation.
This pipeline integrates industry-standard tools to provide a complete workflow for metagenomic data analysis:
- Quality Control: FastQC, MultiQC
- Preprocessing: KneadData (quality filtering + host removal)
- Taxonomic Profiling: MetaPhlAn4
- Functional Profiling: HUMAnN3
- Assembly: MEGAHIT or SPAdes
- Gene Prediction: Prodigal
- Gene Quantification: BWA + SAMtools
- Gene Clustering: CD-HIT
- Functional Annotation: KEGG, CAZy, ARDB
- Genome Binning: MetaBAT2, MaxBin2, CONCOCT, DAS_Tool
- Bin Quality: CheckM
- Growth Rates: DEMIC
curl -s https://get.nextflow.io | bash
mv nextflow /usr/local/bin/
Create a CSV file (samplesheet.csv
) with your samples:
sample,fastq_1,fastq_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
sample3,/path/to/sample3_R1.fastq.gz,/path/to/sample3_R2.fastq.gz
# MetaPhlAn database
metaphlan --install --bowtie2db /path/to/metaphlan_db
# HUMAnN databases
humann_databases --download chocophlan full /path/to/humann_dbs --update-config yes
humann_databases --download uniref uniref90_diamond /path/to/humann_dbs --update-config yes
# CheckM database
mkdir -p /path/to/checkm_data
cd /path/to/checkm_data
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -xzf checkm_data_2015_01_16.tar.gz
checkm data setRoot /path/to/checkm_data
# Host genome (e.g., human GRCh38)
wget http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bowtie2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa human_GRCh38
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--host_genome /path/to/host_genome \
--metaphlan_db /path/to/metaphlan_db \
--humann_nucleotide_db /path/to/chocophlan \
--humann_protein_db /path/to/uniref90_diamond \
--checkm_db /path/to/checkm_data \
-profile docker
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--host_genome /path/to/host_genome \
--metaphlan_db /path/to/metaphlan_db \
--humann_nucleotide_db /path/to/chocophlan \
--humann_protein_db /path/to/uniref90_diamond \
--checkm_db /path/to/checkm_data \
--kegg_db /path/to/kegg_db \
--cazy_db /path/to/cazy_db \
-profile slurm
nextflow run main.nf \
--input s3://bucket/samplesheet.csv \
--outdir s3://bucket/results \
--host_genome s3://bucket/databases/host_genome \
--metaphlan_db s3://bucket/databases/metaphlan_db \
--humann_nucleotide_db s3://bucket/databases/chocophlan \
--humann_protein_db s3://bucket/databases/uniref90 \
-profile awsbatch \
-work-dir s3://bucket/work
Parameter | Description |
---|---|
--input |
Path to samplesheet CSV file |
--outdir |
Output directory for results |
Parameter | Description |
---|---|
--host_genome |
Path to host genome for decontamination (optional) |
--metaphlan_db |
Path to MetaPhlAn database |
--humann_protein_db |
Path to HUMAnN protein database |
--humann_nucleotide_db |
Path to HUMAnN nucleotide database |
--checkm_db |
Path to CheckM database |
--kegg_db |
Path to KEGG database (for BLAST) |
--cazy_db |
Path to CAZy database |
--ardb_db |
Path to ARDB database |
Parameter | Default | Description |
---|---|---|
--skip_qc |
false | Skip FastQC quality control |
--skip_kneaddata |
false | Skip KneadData preprocessing |
--skip_assembly |
false | Skip assembly step |
--skip_binning |
false | Skip genome binning |
--skip_growth_rates |
false | Skip growth rate calculation |
--skip_functional |
false | Skip functional annotation |
--skip_taxonomic |
false | Skip taxonomic profiling |
Parameter | Default | Description |
---|---|---|
--assembler |
megahit | Assembly tool: 'megahit' or 'spades' |
--coassembly |
false | Perform co-assembly of all samples |
--min_contig_length |
1000 | Minimum contig length to keep |
Parameter | Default | Description |
---|---|---|
--binning_tools |
metabat2,maxbin2 | Binning tools (comma-separated) |
--min_bin_completeness |
50 | Minimum bin completeness (%) |
--max_bin_contamination |
10 | Maximum bin contamination (%) |
Parameter | Default | Description |
---|---|---|
--max_cpus |
16 | Maximum CPUs per process |
--max_memory |
128.GB | Maximum memory per process |
--max_time |
240.h | Maximum time per process |
results/
├── 01_kneaddata/ # Quality-filtered reads
│ ├── sample1/
│ ├── sample2/
│ └── ...
├── 02_taxonomy/ # Taxonomic profiles
│ └── metaphlan/
│ ├── sample1/
│ └── ...
├── 03_functional/ # Functional profiles
│ └── humann/
│ ├── sample1/
│ └── ...
├── 04_assembly/ # Assembled contigs
│ ├── megahit/
│ └── filtered_contigs/
├── 05_gene_prediction/ # Predicted genes
│ └── prodigal/
├── 06_gene_quantification/ # Gene abundance
│ └── bwa/
├── 07_nr_genes/ # Non-redundant gene catalog
│ └── cdhit/
├── 08_functional_annotation/ # Gene annotations
│ ├── kegg/
│ ├── cazy/
│ └── ardb/
├── 09_binning/ # MAGs (bins)
│ ├── metabat2/
│ ├── maxbin2/
│ ├── dastool/
│ └── checkm/
├── 10_growth_rates/ # Bacterial growth rates
│ └── demic/
├── qc/ # Quality control reports
│ ├── fastqc/
│ └── multiqc/
└── pipeline_info/ # Pipeline execution info
├── execution_report.html
├── execution_timeline.html
├── execution_trace.txt
└── pipeline_dag.svg
Uses Docker containers for all processes. Suitable for local execution.
nextflow run main.nf -profile docker [other options]
Uses Singularity containers. Ideal for HPC environments.
nextflow run main.nf -profile singularity [other options]
Uses Conda environments for each process.
nextflow run main.nf -profile conda [other options]
Configured for SLURM HPC clusters with Singularity.
nextflow run main.nf -profile slurm [other options]
Configured for AWS Batch execution.
nextflow run main.nf -profile awsbatch [other options]
# Only taxonomic and functional profiling (skip assembly)
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--skip_assembly \
--skip_binning \
--skip_growth_rates \
-profile docker
# Only assembly and binning
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--skip_taxonomic \
--skip_functional \
-profile docker
# Perform co-assembly instead of individual assemblies
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--coassembly \
-profile docker
# Adjust maximum resources
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--max_cpus 32 \
--max_memory 256.GB \
--max_time 480.h \
-profile slurm
Increase memory allocation:
nextflow run main.nf --max_memory 256.GB [other options]
Increase time limit:
nextflow run main.nf --max_time 480.h [other options]
Ensure all database paths are absolute and exist:
ls -l /path/to/metaphlan_db
ls -l /path/to/humann_dbs
Nextflow automatically caches completed processes. Resume with:
nextflow run main.nf [options] -resume
If you use this pipeline, please cite:
- Nextflow: Di Tommaso, P., et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology.
- KneadData: https://huttenhower.sph.harvard.edu/kneaddata/
- MetaPhlAn4: Blanco-Míguez, A., et al. (2023). Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nature Biotechnology.
- HUMAnN3: Beghini, F., et al. (2021). Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife.
- MEGAHIT: Li, D., et al. (2015). MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics.
- MetaBAT2: Kang, D.D., et al. (2019). MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ.
- CheckM: Parks, D.H., et al. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research.
- DEMIC: Korem, T., et al. (2015). Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples. Science.
For questions or issues:
- GitHub Issues: https://github.com/ashoks773/metagenomics-nf-pipeline
- Email: ashoks773@gmail.com OR compbiosharma@gmail.com
MIT License
Based on workflows developed by Ashok K. Sharma: