Investigating the performance of Oxford Nanopore long-read sequencing with respect to Illumina microarrays and short-read sequencing
This repository contains the complete workflow and analysis scripts for benchmarking Oxford Nanopore Technologies (ONT) long-read sequencing against established platforms (Illumina short-read sequencing and microarrays). The project evaluates the performance of ONT for detecting various genetic variants across different genomic contexts and examines the impact of experimental factors such as multiplexing, sequencing depth, and read length.
Oxford Nanopore Technologies (ONT) long-read sequencing (LRS) has emerged as a promising genomic analysis tool, yet comprehensive benchmarks with established platforms across diverse datasets remain limited. This study aimed to benchmark LRS performance against Illumina short-read sequencing (SRS) and microarrays for variant detection across different genomic contexts and to evaluate the impact of experimental factors. We sequenced 14 human genomes using the three platforms and evaluated single nucleotide variants (SNVs), insertions/deletions (indels), and structural variants (SVs) detection, stratifying by high-complexity, low-complexity, and dark genome regions while assessing effects of multiplexing, depth, and read length. LRS SNV accuracy was slightly lower than that of SRS in high-complexity regions (F-measure: 0.954 vs. 0.967) but showed comparable sensitivity in low-complexity regions. LRS showed robust performance for small (1–5 bp) indels in high-complexity regions (F-measure: 0.869), but SRS agreement decreased significantly in low-complexity regions and for larger indel sizes. Within dark regions, LRS identified more indels than SRS, but showed lower base-level accuracy. LRS identified 2.86 times more SVs than SRS, excelling at detecting large variants (>6 kb), with SV detection improving with sequencing depth. Sequencing depth strongly influenced variant calling performance, whereas multiplexing effects were minimal. Our findings provide valuable insights for optimising LRS applications in genomic research and diagnostics.
.
├── config/ # Workflow configuration files
├── jobs/ # Slurm job submission scripts
│ ├── benchmark/ # Benchmarking scripts
│ ├── illumina/ # Illumina processing scripts
│ ├── jupyter/ # Jupyter notebook environment setup
│ ├── ont/ # ONT processing scripts
│ └── qc/ # Quality control scripts
├── modules/ # Nextflow modules
│ ├── indel_benchmark/ # Indel analysis modules
│ ├── setup/ # Data preparation modules
│ ├── shared/ # Shared utility modules
│ ├── snv_benchmark/ # SNV analysis modules
│ └── sv_consensus/ # Structural variant consensus modules
├── references/ # Genome and positional reference files
├── workflows/ # Nextflow sub-workflows
├── main.nf # Main Nextflow workflow
├── nextflow.config # Nextflow configuration
└── ont-benchmark.ipynb # Jupyter notebook with statistical analyses
└── sample_ids.csv # ONT and Illumina sample IDs dictionary
└── seq_stats.csv # Table containing experimental records for each flowcell
- Nextflow (24.10.2)
- Docker or Singularity
- Please see the conda environment for software requirements and dependencies
The analysis pipeline expects:
- Oxford Nanopore sequencing data (processed through basecalling)
- Illumina short-read sequencing data (aligned and variant-called)
- Illumina microarray genotyping data
To ensure the pipeline functions correctly and to optimize access to NCBI resources, please set the following Nextflow secrets before running the workflow:
nextflow secrets set NCBI_API_KEY <your_ncbi_api_key>
nextflow secrets set NCBI_EMAIL <your_ncbi_email>
These secrets are necessary for accessing NCBI resources during the analysis. By default, the NCBI Datasets API and command-line tool requests are rate-limited to 5 requests per second (rps). Using an API key increases this limit to 10 rps.
For more information on obtaining and using NCBI API keys, please refer to the NCBI Datasets API Keys Documentation.
You can verify that the secrets have been set correctly by listing them:
nextflow secrets list
For more information on managing secrets in Nextflow, refer to the Nextflow Secrets documentation.
nextflow run KHP-Informatics/ont-benchmark
or
sbatch jobs/benchmark/variant_benchmark.sh
Analysis results are stored in the ont-benchmark jupyter notebook, organised by variant type. Each benchmark includes:
- Precision, recall, and F-measure metrics
- Detailed comparison between ONT, Illumina, and microarray platforms
- Analysis of variant detection across different genomic contexts
- Impact assessment of sequencing parameters (depth, multiplexing, read length)
This project is licensed under the MIT License. You can freely use and modify the code, without warranty. See LICENSE for the full license text. The authors reserve the rights to the article content, which is currently submitted for publication.
If you use this benchmark in your research, please cite: Santos, R., Lee, H., Williams, A., Baffour-Kyei, A., Lee, S.-H., Troakes, C., Al-Chalabi, A., Breen, G., & Iacoangeli, A. (2025). Investigating the Performance of Oxford Nanopore Long-Read Sequencing with Respect to Illumina Microarrays and Short-Read Sequencing. International Journal of Molecular Sciences, 26(10), 4492. https://doi.org/10.3390/ijms26104492