Comprehensive Metagenomic Analysis Pipeline

A complete Nextflow pipeline for metagenomic data analysis from raw reads to functional/taxonomic annotation and bacterial growth rate calculation.

Overview

This pipeline integrates industry-standard tools to provide a complete workflow for metagenomic data analysis:

Quality Control: FastQC, MultiQC
Preprocessing: KneadData (quality filtering + host removal)
Taxonomic Profiling: MetaPhlAn4
Functional Profiling: HUMAnN3
Assembly: MEGAHIT or SPAdes
Gene Prediction: Prodigal
Gene Quantification: BWA + SAMtools
Gene Clustering: CD-HIT
Functional Annotation: KEGG, CAZy, ARDB
Genome Binning: MetaBAT2, MaxBin2, CONCOCT, DAS_Tool
Bin Quality: CheckM
Growth Rates: DEMIC

Quick Start

1. Install Nextflow

curl -s https://get.nextflow.io | bash
mv nextflow /usr/local/bin/

2. Prepare Input Samplesheet

Create a CSV file (samplesheet.csv) with your samples:

sample,fastq_1,fastq_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
sample3,/path/to/sample3_R1.fastq.gz,/path/to/sample3_R2.fastq.gz

3. Download Required Databases

# MetaPhlAn database
metaphlan --install --bowtie2db /path/to/metaphlan_db

# HUMAnN databases
humann_databases --download chocophlan full /path/to/humann_dbs --update-config yes
humann_databases --download uniref uniref90_diamond /path/to/humann_dbs --update-config yes

# CheckM database
mkdir -p /path/to/checkm_data
cd /path/to/checkm_data
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -xzf checkm_data_2015_01_16.tar.gz
checkm data setRoot /path/to/checkm_data

# Host genome (e.g., human GRCh38)
wget http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bowtie2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa human_GRCh38

4. Run the Pipeline

Option A: Local execution with Docker

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --host_genome /path/to/host_genome \
  --metaphlan_db /path/to/metaphlan_db \
  --humann_nucleotide_db /path/to/chocophlan \
  --humann_protein_db /path/to/uniref90_diamond \
  --checkm_db /path/to/checkm_data \
  -profile docker

Option B: HPC with SLURM

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --host_genome /path/to/host_genome \
  --metaphlan_db /path/to/metaphlan_db \
  --humann_nucleotide_db /path/to/chocophlan \
  --humann_protein_db /path/to/uniref90_diamond \
  --checkm_db /path/to/checkm_data \
  --kegg_db /path/to/kegg_db \
  --cazy_db /path/to/cazy_db \
  -profile slurm

Option C: AWS Batch

nextflow run main.nf \
  --input s3://bucket/samplesheet.csv \
  --outdir s3://bucket/results \
  --host_genome s3://bucket/databases/host_genome \
  --metaphlan_db s3://bucket/databases/metaphlan_db \
  --humann_nucleotide_db s3://bucket/databases/chocophlan \
  --humann_protein_db s3://bucket/databases/uniref90 \
  -profile awsbatch \
  -work-dir s3://bucket/work

Pipeline Parameters

Required Parameters

Parameter	Description
`--input`	Path to samplesheet CSV file
`--outdir`	Output directory for results

Database Parameters

Parameter	Description
`--host_genome`	Path to host genome for decontamination (optional)
`--metaphlan_db`	Path to MetaPhlAn database
`--humann_protein_db`	Path to HUMAnN protein database
`--humann_nucleotide_db`	Path to HUMAnN nucleotide database
`--checkm_db`	Path to CheckM database
`--kegg_db`	Path to KEGG database (for BLAST)
`--cazy_db`	Path to CAZy database
`--ardb_db`	Path to ARDB database

Pipeline Control Parameters

Parameter	Default	Description
`--skip_qc`	false	Skip FastQC quality control
`--skip_kneaddata`	false	Skip KneadData preprocessing
`--skip_assembly`	false	Skip assembly step
`--skip_binning`	false	Skip genome binning
`--skip_growth_rates`	false	Skip growth rate calculation
`--skip_functional`	false	Skip functional annotation
`--skip_taxonomic`	false	Skip taxonomic profiling

Assembly Parameters

Parameter	Default	Description
`--assembler`	megahit	Assembly tool: 'megahit' or 'spades'
`--coassembly`	false	Perform co-assembly of all samples
`--min_contig_length`	1000	Minimum contig length to keep

Binning Parameters

Parameter	Default	Description
`--binning_tools`	metabat2,maxbin2	Binning tools (comma-separated)
`--min_bin_completeness`	50	Minimum bin completeness (%)
`--max_bin_contamination`	10	Maximum bin contamination (%)

Resource Parameters

Parameter	Default	Description
`--max_cpus`	16	Maximum CPUs per process
`--max_memory`	128.GB	Maximum memory per process
`--max_time`	240.h	Maximum time per process

Output Structure

results/
├── 01_kneaddata/              # Quality-filtered reads
│   ├── sample1/
│   ├── sample2/
│   └── ...
├── 02_taxonomy/               # Taxonomic profiles
│   └── metaphlan/
│       ├── sample1/
│       └── ...
├── 03_functional/             # Functional profiles
│   └── humann/
│       ├── sample1/
│       └── ...
├── 04_assembly/               # Assembled contigs
│   ├── megahit/
│   └── filtered_contigs/
├── 05_gene_prediction/        # Predicted genes
│   └── prodigal/
├── 06_gene_quantification/    # Gene abundance
│   └── bwa/
├── 07_nr_genes/               # Non-redundant gene catalog
│   └── cdhit/
├── 08_functional_annotation/  # Gene annotations
│   ├── kegg/
│   ├── cazy/
│   └── ardb/
├── 09_binning/                # MAGs (bins)
│   ├── metabat2/
│   ├── maxbin2/
│   ├── dastool/
│   └── checkm/
├── 10_growth_rates/           # Bacterial growth rates
│   └── demic/
├── qc/                        # Quality control reports
│   ├── fastqc/
│   └── multiqc/
└── pipeline_info/             # Pipeline execution info
    ├── execution_report.html
    ├── execution_timeline.html
    ├── execution_trace.txt
    └── pipeline_dag.svg

Profiles

Docker

Uses Docker containers for all processes. Suitable for local execution.

nextflow run main.nf -profile docker [other options]

Singularity

Uses Singularity containers. Ideal for HPC environments.

nextflow run main.nf -profile singularity [other options]

Conda

Uses Conda environments for each process.

nextflow run main.nf -profile conda [other options]

SLURM

Configured for SLURM HPC clusters with Singularity.

nextflow run main.nf -profile slurm [other options]

AWS Batch

Configured for AWS Batch execution.

nextflow run main.nf -profile awsbatch [other options]

Advanced Usage

Running Specific Modules

# Only taxonomic and functional profiling (skip assembly)
nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --skip_assembly \
  --skip_binning \
  --skip_growth_rates \
  -profile docker

# Only assembly and binning
nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --skip_taxonomic \
  --skip_functional \
  -profile docker

Co-assembly Mode

# Perform co-assembly instead of individual assemblies
nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --coassembly \
  -profile docker

Custom Resource Allocation

# Adjust maximum resources
nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --max_cpus 32 \
  --max_memory 256.GB \
  --max_time 480.h \
  -profile slurm

Troubleshooting

Common Issues

1. Out of Memory Errors

Increase memory allocation:

nextflow run main.nf --max_memory 256.GB [other options]

2. Timeout Errors

Increase time limit:

nextflow run main.nf --max_time 480.h [other options]

3. Database Not Found

Ensure all database paths are absolute and exist:

ls -l /path/to/metaphlan_db
ls -l /path/to/humann_dbs

Resume Failed Runs

Nextflow automatically caches completed processes. Resume with:

nextflow run main.nf [options] -resume

Citation

If you use this pipeline, please cite:

Nextflow: Di Tommaso, P., et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology.
KneadData: https://huttenhower.sph.harvard.edu/kneaddata/
MetaPhlAn4: Blanco-Míguez, A., et al. (2023). Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nature Biotechnology.
HUMAnN3: Beghini, F., et al. (2021). Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife.
MEGAHIT: Li, D., et al. (2015). MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics.
MetaBAT2: Kang, D.D., et al. (2019). MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ.
CheckM: Parks, D.H., et al. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research.
DEMIC: Korem, T., et al. (2015). Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples. Science.

Contact

For questions or issues:

GitHub Issues: https://github.com/ashoks773/metagenomics-nf-pipeline
Email: ashoks773@gmail.com OR compbiosharma@gmail.com

License

MIT License

Acknowledgments

Based on workflows developed by Ashok K. Sharma:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
bin		bin
conf		conf
docs		docs
examples		examples
modules		modules
scripts		scripts
test		test
workflows		workflows
CHANGELOG.md		CHANGELOG.md
EXAMPLES.md		EXAMPLES.md
GETTING_STARTED.md		GETTING_STARTED.md
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
SETUP.md		SETUP.md
main.nf		main.nf
nextflow.config		nextflow.config

ashoks773/metagenomics-nf-pipeline

Folders and files

Latest commit

History

Repository files navigation