First, create the necessary folder structure as shown in the example workflow:
dataset/
profiles/
resources/
rules/
scripts/
results/
You can create these directories using the following command:
mkdir -p dataset profiles resources rules scripts results
This script (located in scripts/
) generates reference files from GenBank.
Run the following command:
python3 scripts/generate_from_genbank.py --reference "AY426531.1" --output-dir dataset/
During execution, you may be asked to provide CDS annotations. You can use the following codes to specify the CDS automatically:
[0]
[product]
or[leave empty for manual choice]
to select proteins.[2]
.
The script will generate:
dataset/reference.fasta
dataset/genome_annotation.gff3
Modify pathogen.json
to:
- Ensure file names match the generated reference files.
- Update attributes as needed.
- Adjust the Quality Control (QC) settings if necessary. If QC is not configured, Nextclade will not perform any checks.
For more details on configuration, refer to the Nextclade documentation.
- Copy the
reference.gb
file into theresources/
directory. - Modify protein names as needed to match your requirements.
- Modify lines 1-18 to adjust paths and parameters.
- Ensure all necessary files for the Augur pipeline are present, including:
sequences.fasta
&metadata.tsv
- can be downloaded from NCBI Virus via ingest:
FETCH_SEQUENCES==True
- can be downloaded from NCBI Virus via ingest:
auspice_config.json
- These files are essential for building the reference tree and running Nextclade.
The ingest process downloads sequences and metadata from NCBI Virus. For more details, refer to the Ingest Documentation.
The following packages must be installed to run the ingest process:
conda-forge/bioconda: csvtk, nextclade, tsv-utils, seqkit, zip, unzip, entrez-direct, ncbi-datasets-cli
To create the auspice JSON and a Nextclade example dataset:
snakemake --cores 9 all
This runs Nextclade on the example sequences in out-dataset/sequences.fasta
using the dataset in dataset
. The results are saved to the test_out
directory and contain alignment, aligned translations and a summary TSV file.
One can also use the dataset in Nextclade Web by hosting the dataset through a local web server. For example, after having installed node
and run npm install -g serve
, one can host the dataset via:
serve --cors out-dataset -l 3000
And open Nextclade Web with a URL parameter dataset-url
pointing to the local web server:
https://master.clades.nextstrain.org/?dataset-url=http://localhost:3000
Once the web page loads, you can click "Load example" and click run to test. You may want to reduce the maximum number of nucleotide markers to 500 to prevent Nextclade from freezing (click "Settings" at the top right, then select the "Sequence view" tab and reduce "Max. nucleotide markers to 500).
This guide provides a structured workflow for setting up Nextclade for Enterovirus D68. If you encounter issues, refer to the official documentation.