Automated bioinformatics workflow for the identification and characterisation of Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) arrays from bacterial genomes, with a focus on Campylobacter coli and C. jejuni.
For more detailed documentation, please look at the project website.
-
Verify CRISPR screening functionality (CCTyper + CRISPRidentify); generate CRISPR-Cas array table and CRISPR spacer table
-
Complete documentation for 'basic' functions:
-
User manual
-
Step-by-step process description
-
Output file descriptions
-
-
Clean-up code
-
Remove outdated/unnecessary steps
-
Apply standardised code formatting
-
-
Add CRISPRidentify to workflow
- And make sure it works on a clean install
-
Combine results from CCTyper with CRISPRidentify
-
Make and/or correct scripts for combining results into 'Output files' (write to
data/processed/
)-
Concatenate MLST results
-
Enable spacer table creation script in Snakefile (add to
rule all
)
-
-
Collect and combine results from geNomad and Jaeger
-
Map spacers to genomes and phage/plasmid databases
-
Add PADLOC for identifying other anti-phage systems
-
Write documentation for output files
-
Rewrite 'Problems encountered' into a rationale for our tool selection (as separate document)
-
Write detailed and technical step-by-step description of the workflow
- While reviewing the workflow, remove unnecessary pieces and clean-up where possible
-
Setup MkDocs-powered documentation (at least locally, integrate with GitHub pages later)
(Note to self: Add to this list when other ideas come to mind!)
- Campylobacter jujuni and coli genomes from AllTheBacteria
As of 2024-09-19 this includes 129,080 genomes! (Up from 104,146 before the incremental update. That means there are 24,934 extra genomes now.)
Note: AllTheBacteria has its own quality criteria for inclusion. This includes:
- >=99% species abundance (practically pure)
- >= 90% completeness (CheckM2)
- <= 5% contaminated (CheckM2)
- total length between 100 kbp and 15 Mbp
- <= 2,000 contigs
- >= N50 2,000
This repository includes scripts to automatically download genome files and metadata from ATB. These can be run as follows:
git clone https://github.com/UtrechtUniversity/campylobacter-crisprscape.git
cd campylobacter-crisprscape
bash bin/prepare_genomes.sh
By default this downloads high-quality Campylobacter jejuni and C. coli genomes from the incremental update.
This prepare_genomes.sh
script links to other scripts and has to be run from the 'base' folder as shown above.
The script itself contains a general description of how it works and to use it.
In short, it:
-
Dowloads the metadata from AllTheBacteria (see this script). By default, it downloads to the
data/ATB/
subdirectory and a different directory can be provided as command-line argument. -
Extract sample accession IDs of the species of interest, as defined in
config/species_of_interest.txt
. Edit this file if you want to run this workflow for different species! -
Look up the metadata of the species of interest by filtering the ENA metadata file.
-
Find the batches in AllTheBacteria that contain the species of interest.
-
Download the genome sequences of the species of interest (i.e., the batches identified in step 4).
-
Remove other species from the downloaded batches. (Batches may contain a mix of different species.)
As of February 2025, AllTheBacteria consists of an original set of genomes and
an incremental update. The prepare_genomes.sh
script can download either part,
or all of the genomes at once using command-line options 'all', 'original', or
'update' (default: update).
The genomes are downloaded to the data/tmp/ATB/
subdirectory. This is also the
default input directory for the analysis workflow.
This workflow uses geNomad to predict whether genomic contigs derive from chromosomal DNA, plasmids or viruses. This tool uses both a neural network classifier and a marker-based approach to calculate prediction scores. For the marker-based method, it requires a database which can be downloaded using the tool geNomad itself. If you have installed mamba this can be done as follows:
mamba env create -f envs/genomad.yaml
mamba activate genomad
genomad download-database data/
Note that this will create the subdirectory data/genomad_db/
,
which is the default that is also defined in
config/parameters.yaml
.
The current version of the database, v1.7, uses 1.4GB disk space.
The bin folder also includes scripts to download and extract pre-selected databases for use in Spacepharer. These include Phagescope for annotated phage sequences and PLSDB for annotated plasmid sequences which have been chosen for their broad taxonomy. By running:
bash bin/download_spacepharer_database.sh
Both databases are downloaded, extracted and then merged for use in Spacepharer.
If you wish to use a different database or add to them, see
doc/spacepharer.md
for advice.
CRISPRscape uses CRISPRidentify as a second pass over cctyper's output. However this program has no conda environment that contains the program itself and as of writing requires a forked version to function properly. To install CRISPRidentify, run
git clone https://github.com/Necopy-byte/CRISPRidentify/tree/master bin/
The analysis itself is recorded as a Snakemake workflow. Its dependencies (bioinformatics tools) are handled by Snakemake using the conda package manager, or rather its successor mamba. If you have not yet done so, please install mamba following the instructions found here: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html.
After installing mamba, snakemake can be installed using their instructions: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#full-installation (Note: the workflow was tested with Snakemake version 8.20.3 and is expected to work with any version since 5.)
When Snakemake has been set up, you can test if the workflow is ready to be run (dry-run) with:
snakemake --profile config -n
If that returns no errors, run the workflow by removin the -n
(dry-run) option:
snakemake --profile config
Note that the workflow is currently configured to run on the local machine
(not on a high-performance computing (HPC) cluster or grid) and uses a maximum
of 24 CPU threads. The number of threads to use can be configured in:
config/config.yaml
(overall workflow) and
config/parameters.yaml
(per step/tool).
In its current state, the workflow:
-
Determines MultiLocus Sequence Types for Campylobacter jejuni/coli using the public typing scheme from pubMLST and pyMLST (version 2.2.1)
-
Identifies CRISPR-Cas loci with CCTyper (version 1.8.0) and resulting loci are processed with CRISPRidentify (forked from version 1.2.1) to reduce false positives.
- this includes extra scripts to collect CRISRP-Cas information and extract sequences from the genome fasta files
-
Collects all CRISPR spacers and creates clusters of identical spacers using CD-HIT-EST (version 4.8.1)
-
Predicts whether contigs of the species of interest derive from chromosomal DNA, plasmids or viruses using both geNomad (version 1.8.0) and Jaeger (version 1.1.26).
-
Predicts the potential targets of spacers and whether they target chromosomal DNA (of input genomes), plasmid or viruses using Spacepharer (version 1.0.1) and kma.
Further steps are added to the workflow after testing!
-
Mash with CRISPR loci, and whole genomes
-
Map CRISPR spacers to all downloaded genomes (bowtie, and KMA?), metagenome assemblies, other databases?
-
Whole-genome MLST
-
SpacerPlacer (see input file format in https://github.com/fbaumdicker/SpacerPlacer?tab=readme-ov-file#spacer_fasta-input-format (also requires an extra conversion script?)
Unticked boxes indicate that documentation has not been written yet.
-
ENA metadata
- Cleaned-up and filtered metadata of included genomes
-
Species classifications
- Taxonomic classification by Sylph, as collected from AllTheBacteria
-
Contig chromosome/plasmid/virus predictions
-
CRISPR-Cas overview table
- Output from CCTyper, collected and combined in one CSV file
- Combine with CRISPRidentify, create filtered crispr by adding Cas and orientation data onto crispridentify csv
-
CRISPR spacer table
-
MLST
- Sequence Types (ST) of all included genomes
-
List of spacer-putative targets
- Output from mapping unique spacers to possible targets seperated by plasmid or phage and merged with database metadata
-
List of anti-phage systems per genome
- Output from PADLOC, combined in single CSV file
-
Genome comparison all-vs-all
- by whole-genome MLST, average nucleotide identity (ANI) or similar(?)
2024-09-12:
-
CCTyper does not work from conda installation (Russel88/CRISPRCasTyper#55)
-
CCTyper cannot handle gzipped fasta files as input, so the input files need to be uncompressed
-
RFPlasmid does not work from local install (file not found, permission denied) or conda installation (
rfplasmid --initialize
fails to download database files) -
'hybrid' RFPlasmid install works: activate conda environment and then run the shared executable using the absolute path (/mnt/DGK_KLIF/data/rfplasmidweb/pip_package/package_files/RFPlasmid/rfplasmid.py) but it seems not to work when only one genome is present in the input directory.
2024-12-10:
- CCTyper returns different output when running single genomes as
compared to running one concatenated fasta file with contigs of
many genomes. (See
data/tmp/cctyper/batch_22/CRISPR_Cas-batch_22.tab
vsdata/tmp/cctyper/test/batch_22/CRISPR_Cas.tab
- difference is 1KB.) The separate method returns 7 contigs that the concatenated did not find, and the concatenated method found 1 contig that the separate did not find. These may be false positives (how do you check?), but for now I'm sticking with the separate method.
2025-02-11:
- Spacepharer wants to run locally within the same folder that you designate the tmpfolder and where the created databases are located. Following the easy-predict workflow is not recommended as created .fasta files are inconsistently recognized as actual fasta files.
- Phagescope database says that it can filter genomes based on criteria, but actually downloading these fastas is impossible through an error. Additionally wget and curl do not properly download the databases in a way that spacepharer can identify, requiring a manual upload.
2025-02-21:
- Spacepharer databases are best created using the example names "querysetDB" and "targetsetDB" as other names such as "spacersetDB" causes weird errors.
2025-06-27:
- The yml file provided by CRISPRidentify is only solveable using flexible or disabled channel priority in conda. As of now an adjusted yml file is used that is solveable in strict priority mode, but does make strand prediction in CRISPRidentify non-functional.
.
├── .gitignore
├── CITATION.cff
├── LICENSE
├── README.md
├── Snakefile <- Python-based workflow description
├── bin <- Code and programs used in this project/experiment
├── config <- Configuration of Snakemake workflow
├── data <- All project data, divided in subfolders
│ ├── processed <- Final data, used for visualisation (e.g. tables)
│ ├── raw <- Raw data, original, should not be modified (e.g. fastq files)
│ └── tmp <- Intermediate data, derived from the raw data, but not yet ready for visualisation
├── doc <- Project documentation, notes and experiment records
├── envs <- Conda environments necessary to run the project/experiment
├── log <- Log files from programs
└── results <- Figures or reports generated from processed data
This project is licensed under the terms of the New BSD licence.
Please cite this project as described in the citation file.