Campylobacter CRISPRscape

Automated bioinformatics workflow for the identification and characterisation of Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) arrays from bacterial genomes, with a focus on Campylobacter coli and C. jejuni.

For more detailed documentation, please look at the project website.

Release roadmap

Verify CRISPR screening functionality (CCTyper + CRISPRidentify); generate CRISPR-Cas array table and CRISPR spacer table
Complete documentation for 'basic' functions:
- User manual
- Step-by-step process description
- Output file descriptions
Clean-up code
- Remove outdated/unnecessary steps
- Apply standardised code formatting

To do

(Note to self: Add to this list when other ideas come to mind!)

Workflow description

Input files to prepare

Campylobacter jujuni and coli genomes from AllTheBacteria

As of 2024-09-19 this includes 129,080 genomes! (Up from 104,146 before the incremental update. That means there are 24,934 extra genomes now.)

Note: AllTheBacteria has its own quality criteria for inclusion. This includes:

>=99% species abundance (practically pure)
>= 90% completeness (CheckM2)
<= 5% contaminated (CheckM2)
total length between 100 kbp and 15 Mbp
<= 2,000 contigs
>= N50 2,000

Download genomes

This repository includes scripts to automatically download genome files and metadata from ATB. These can be run as follows:

git clone https://github.com/UtrechtUniversity/campylobacter-crisprscape.git
cd campylobacter-crisprscape
bash bin/prepare_genomes.sh

By default this downloads high-quality Campylobacter jejuni and C. coli genomes from the incremental update. This prepare_genomes.sh script links to other scripts and has to be run from the 'base' folder as shown above. The script itself contains a general description of how it works and to use it. In short, it:

Dowloads the metadata from AllTheBacteria (see this script). By default, it downloads to the data/ATB/ subdirectory and a different directory can be provided as command-line argument.
Extract sample accession IDs of the species of interest, as defined in config/species_of_interest.txt. Edit this file if you want to run this workflow for different species!
Look up the metadata of the species of interest by filtering the ENA metadata file.
Find the batches in AllTheBacteria that contain the species of interest.
Download the genome sequences of the species of interest (i.e., the batches identified in step 4).
Remove other species from the downloaded batches. (Batches may contain a mix of different species.)

As of February 2025, AllTheBacteria consists of an original set of genomes and an incremental update. The prepare_genomes.sh script can download either part, or all of the genomes at once using command-line options 'all', 'original', or 'update' (default: update).

The genomes are downloaded to the data/tmp/ATB/ subdirectory. This is also the default input directory for the analysis workflow.

Download Databases

geNomad

This workflow uses geNomad to predict whether genomic contigs derive from chromosomal DNA, plasmids or viruses. This tool uses both a neural network classifier and a marker-based approach to calculate prediction scores. For the marker-based method, it requires a database which can be downloaded using the tool geNomad itself. If you have installed mamba this can be done as follows:

mamba env create -f envs/genomad.yaml
mamba activate genomad
genomad download-database data/

Note that this will create the subdirectory data/genomad_db/, which is the default that is also defined in config/parameters.yaml.

The current version of the database, v1.7, uses 1.4GB disk space.

SpacePHARER

The bin folder also includes scripts to download and extract pre-selected databases for use in Spacepharer. These include Phagescope for annotated phage sequences and PLSDB for annotated plasmid sequences which have been chosen for their broad taxonomy. By running:

bash bin/download_spacepharer_database.sh

Both databases are downloaded, extracted and then merged for use in Spacepharer. If you wish to use a different database or add to them, see doc/spacepharer.md for advice.

dependency CRISPRidentify

CRISPRscape uses CRISPRidentify as a second pass over cctyper's output. However this program has no conda environment that contains the program itself and as of writing requires a forked version to function properly. To install CRISPRidentify, run

git clone https://github.com/Necopy-byte/CRISPRidentify/tree/master bin/

Analysis workflow

The analysis itself is recorded as a Snakemake workflow. Its dependencies (bioinformatics tools) are handled by Snakemake using the conda package manager, or rather its successor mamba. If you have not yet done so, please install mamba following the instructions found here: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html.

After installing mamba, snakemake can be installed using their instructions: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#full-installation (Note: the workflow was tested with Snakemake version 8.20.3 and is expected to work with any version since 5.)

When Snakemake has been set up, you can test if the workflow is ready to be run (dry-run) with:

snakemake --profile config -n

If that returns no errors, run the workflow by removin the -n (dry-run) option:

snakemake --profile config

Note that the workflow is currently configured to run on the local machine (not on a high-performance computing (HPC) cluster or grid) and uses a maximum of 24 CPU threads. The number of threads to use can be configured in: config/config.yaml (overall workflow) and config/parameters.yaml (per step/tool).

In its current state, the workflow:

Determines MultiLocus Sequence Types for Campylobacter jejuni/coli using the public typing scheme from pubMLST and pyMLST (version 2.2.1)
Identifies CRISPR-Cas loci with CCTyper (version 1.8.0) and resulting loci are processed with CRISPRidentify (forked from version 1.2.1) to reduce false positives.
- this includes extra scripts to collect CRISRP-Cas information and extract sequences from the genome fasta files
Collects all CRISPR spacers and creates clusters of identical spacers using CD-HIT-EST (version 4.8.1)
Predicts whether contigs of the species of interest derive from chromosomal DNA, plasmids or viruses using both geNomad (version 1.8.0) and Jaeger (version 1.1.26).
Predicts the potential targets of spacers and whether they target chromosomal DNA (of input genomes), plasmid or viruses using Spacepharer (version 1.0.1) and kma.

Further steps are added to the workflow after testing!

Suggestions of programs/analyses to test

Mash with CRISPR loci, and whole genomes
Map CRISPR spacers to all downloaded genomes (bowtie, and KMA?), metagenome assemblies, other databases?
Whole-genome MLST
SpacerPlacer (see input file format in https://github.com/fbaumdicker/SpacerPlacer?tab=readme-ov-file#spacer_fasta-input-format (also requires an extra conversion script?)

Output files

Unticked boxes indicate that documentation has not been written yet.

ENA metadata
- Cleaned-up and filtered metadata of included genomes
Species classifications
- Taxonomic classification by Sylph, as collected from AllTheBacteria
Contig chromosome/plasmid/virus predictions
CRISPR-Cas overview table
- Output from CCTyper, collected and combined in one CSV file
- Combine with CRISPRidentify, create filtered crispr by adding Cas and orientation data onto crispridentify csv
CRISPR spacer table
MLST
- Sequence Types (ST) of all included genomes
List of spacer-putative targets
- Output from mapping unique spacers to possible targets seperated by plasmid or phage and merged with database metadata
List of anti-phage systems per genome
- Output from PADLOC, combined in single CSV file
Genome comparison all-vs-all
- by whole-genome MLST, average nucleotide identity (ANI) or similar(?)

Problems encountered

2024-09-12:

CCTyper does not work from conda installation (Russel88/CRISPRCasTyper#55)
CCTyper cannot handle gzipped fasta files as input, so the input files need to be uncompressed
RFPlasmid does not work from local install (file not found, permission denied) or conda installation (rfplasmid --initialize fails to download database files)
'hybrid' RFPlasmid install works: activate conda environment and then run the shared executable using the absolute path (/mnt/DGK_KLIF/data/rfplasmidweb/pip_package/package_files/RFPlasmid/rfplasmid.py) but it seems not to work when only one genome is present in the input directory.

2024-12-10:

CCTyper returns different output when running single genomes as compared to running one concatenated fasta file with contigs of many genomes. (See data/tmp/cctyper/batch_22/CRISPR_Cas-batch_22.tab vs data/tmp/cctyper/test/batch_22/CRISPR_Cas.tab - difference is 1KB.) The separate method returns 7 contigs that the concatenated did not find, and the concatenated method found 1 contig that the separate did not find. These may be false positives (how do you check?), but for now I'm sticking with the separate method.

2025-02-11:

Spacepharer wants to run locally within the same folder that you designate the tmpfolder and where the created databases are located. Following the easy-predict workflow is not recommended as created .fasta files are inconsistently recognized as actual fasta files.
Phagescope database says that it can filter genomes based on criteria, but actually downloading these fastas is impossible through an error. Additionally wget and curl do not properly download the databases in a way that spacepharer can identify, requiring a manual upload.

2025-02-21:

Spacepharer databases are best created using the example names "querysetDB" and "targetsetDB" as other names such as "spacersetDB" causes weird errors.

2025-06-27:

The yml file provided by CRISPRidentify is only solveable using flexible or disabled channel priority in conda. As of now an adjusted yml file is used that is solveable in strict priority mode, but does make strand prediction in CRISPRidentify non-functional.

Project organisation

.
├── .gitignore
├── CITATION.cff
├── LICENSE
├── README.md
├── Snakefile          <- Python-based workflow description
├── bin                <- Code and programs used in this project/experiment
├── config             <- Configuration of Snakemake workflow
├── data               <- All project data, divided in subfolders
│   ├── processed      <- Final data, used for visualisation (e.g. tables)
│   ├── raw            <- Raw data, original, should not be modified (e.g. fastq files)
│   └── tmp            <- Intermediate data, derived from the raw data, but not yet ready for visualisation
├── doc                <- Project documentation, notes and experiment records
├── envs               <- Conda environments necessary to run the project/experiment
├── log                <- Log files from programs
└── results            <- Figures or reports generated from processed data

Licence

This project is licensed under the terms of the New BSD licence.

Citation

Please cite this project as described in the citation file.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github		.github
bin		bin
config		config
doc		doc
envs		envs
renv		renv
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.renvignore		.renvignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
mkdocs.yaml		mkdocs.yaml
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Campylobacter CRISPRscape

Index

Release roadmap

To do

Workflow description

Input files to prepare

Download genomes

Download Databases

geNomad

SpacePHARER

dependency CRISPRidentify

Analysis workflow

Suggestions of programs/analyses to test

Output files

Problems encountered

Project organisation

Licence

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

UtrechtUniversity/campylobacter-crisprscape

Folders and files

Latest commit

History

Repository files navigation

Campylobacter CRISPRscape

Index

Release roadmap

To do

Workflow description

Input files to prepare

Download genomes

Download Databases

geNomad

SpacePHARER

dependency CRISPRidentify

Analysis workflow

Suggestions of programs/analyses to test

Output files

Problems encountered

Project organisation

Licence

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages