Skip to content

Supporting data for the publication "Investigating the performance of Oxford Nanopore long-read sequencing with respect to Illumina microarrays and short-read sequencing"

License

Notifications You must be signed in to change notification settings

KHP-Informatics/ont-benchmark

 
 

Repository files navigation

Investigating the performance of Oxford Nanopore long-read sequencing with respect to Illumina microarrays and short-read sequencing

Nextflow Jupyter BadgePython Language BadgeCode style: Black License: MIT

This repository contains the complete workflow and analysis scripts for benchmarking Oxford Nanopore Technologies (ONT) long-read sequencing against established platforms (Illumina short-read sequencing and microarrays). The project evaluates the performance of ONT for detecting various genetic variants across different genomic contexts and examines the impact of experimental factors such as multiplexing, sequencing depth, and read length.

Abstract

Oxford Nanopore Technologies (ONT) long-read sequencing (LRS) has emerged as a promising genomic analysis tool, yet comprehensive benchmarks with established platforms across diverse datasets remain limited. This study aimed to benchmark LRS performance against Illumina short-read sequencing (SRS) and microarrays for variant detection across different genomic contexts and to evaluate the impact of experimental factors. We sequenced 14 human genomes using the three platforms and evaluated single nucleotide variants (SNVs), insertions/deletions (indels), and structural variants (SVs) detection, stratifying by high-complexity, low-complexity, and dark genome regions while assessing effects of multiplexing, depth, and read length. LRS SNV accuracy was slightly lower than that of SRS in high-complexity regions (F-measure: 0.954 vs. 0.967) but showed comparable sensitivity in low-complexity regions. LRS showed robust performance for small (1–5 bp) indels in high-complexity regions (F-measure: 0.869), but SRS agreement decreased significantly in low-complexity regions and for larger indel sizes. Within dark regions, LRS identified more indels than SRS, but showed lower base-level accuracy. LRS identified 2.86 times more SVs than SRS, excelling at detecting large variants (>6 kb), with SV detection improving with sequencing depth. Sequencing depth strongly influenced variant calling performance, whereas multiplexing effects were minimal. Our findings provide valuable insights for optimising LRS applications in genomic research and diagnostics.

Project Structure

.
├── config/ # Workflow configuration files
├── jobs/ # Slurm job submission scripts
│ ├── benchmark/ # Benchmarking scripts
│ ├── illumina/ # Illumina processing scripts
│ ├── jupyter/ # Jupyter notebook environment setup
│ ├── ont/ # ONT processing scripts
│ └── qc/ # Quality control scripts
├── modules/ # Nextflow modules
│ ├── indel_benchmark/ # Indel analysis modules
│ ├── setup/ # Data preparation modules
│ ├── shared/ # Shared utility modules
│ ├── snv_benchmark/ # SNV analysis modules
│ └── sv_consensus/ # Structural variant consensus modules
├── references/ # Genome and positional reference files
├── workflows/ # Nextflow sub-workflows
├── main.nf # Main Nextflow workflow
├── nextflow.config # Nextflow configuration
└── ont-benchmark.ipynb # Jupyter notebook with statistical analyses
└── sample_ids.csv # ONT and Illumina sample IDs dictionary
└── seq_stats.csv # Table containing experimental records for each flowcell

Setup

Prerequisites

Nextflow Pipeline

Jupyter Notebook

Data Requirements

The analysis pipeline expects:

  1. Oxford Nanopore sequencing data (processed through basecalling)
  2. Illumina short-read sequencing data (aligned and variant-called)
  3. Illumina microarray genotyping data

NCBI API Key Setup

To ensure the pipeline functions correctly and to optimize access to NCBI resources, please set the following Nextflow secrets before running the workflow:

nextflow secrets set NCBI_API_KEY <your_ncbi_api_key>
nextflow secrets set NCBI_EMAIL <your_ncbi_email>

These secrets are necessary for accessing NCBI resources during the analysis. By default, the NCBI Datasets API and command-line tool requests are rate-limited to 5 requests per second (rps). Using an API key increases this limit to 10 rps.

For more information on obtaining and using NCBI API keys, please refer to the NCBI Datasets API Keys Documentation.

You can verify that the secrets have been set correctly by listing them:

nextflow secrets list

For more information on managing secrets in Nextflow, refer to the Nextflow Secrets documentation.

Usage

Running the Complete Workflow

nextflow run KHP-Informatics/ont-benchmark

or

sbatch jobs/benchmark/variant_benchmark.sh

Results

Analysis results are stored in the ont-benchmark jupyter notebook, organised by variant type. Each benchmark includes:

  1. Precision, recall, and F-measure metrics
  2. Detailed comparison between ONT, Illumina, and microarray platforms
  3. Analysis of variant detection across different genomic contexts
  4. Impact assessment of sequencing parameters (depth, multiplexing, read length)

License

This project is licensed under the MIT License. You can freely use and modify the code, without warranty. See LICENSE for the full license text. The authors reserve the rights to the article content, which is currently submitted for publication.

Citation

If you use this benchmark in your research, please cite: Santos, R., Lee, H., Williams, A., Baffour-Kyei, A., Lee, S.-H., Troakes, C., Al-Chalabi, A., Breen, G., & Iacoangeli, A. (2025). Investigating the Performance of Oxford Nanopore Long-Read Sequencing with Respect to Illumina Microarrays and Short-Read Sequencing. International Journal of Molecular Sciences, 26(10), 4492. https://doi.org/10.3390/ijms26104492

About

Supporting data for the publication "Investigating the performance of Oxford Nanopore long-read sequencing with respect to Illumina microarrays and short-read sequencing"

Resources

License

Stars

Watchers

Forks

Languages

  • Jupyter Notebook 99.6%
  • Other 0.4%