Pylint output: Your code has been rated at 9.35/10
- Author(s): Mark van de Streek
- Organization: Rijksinstituut voor Volksgezondheid en Milieu (RIVM)
- Department: Infectieziekteonderzoek, Diagnostiek en Laboratorium Surveillance (IDS)
- Start date: 02 - 09 - 2024
- Commissioned by: Roxanne Wolthuis & Boas van der Putten & Sohana Singh
Pacini-typing is a user-friendly application for the detection of DNA sequences and SNPs in both FASTA and FASTQ files. The application is designed to be used in a Linux-like environment and is easily executable via a YAML-based configuration scheme.
Pacini-typing is not limited to bacterial genomes, although it was primarily developed with Yersinia pestis and Vibrio cholerae as first real use cases. Performance in other species is not yet validated but the application is designed to be flexible.
Quick start command of the application:
pacini_typing --config path_to_config_file.yaml --input file_1.fastq file_2.fastq --search_mode SNPs
The structure of the YAML
configuration file is explained here and the search modes are explained here.
- Application information
- About this project
- Table of Contents
- Prerequisites
- Complete list of packages
- Installation
- Modes of Pacini-typing
- Configuration file
- Approach
- Getting Started
- Parameters & Usage
- Output
- Example Run of Pacini-typing
- Testing
- Issues
- Future Ideas
- License
- Contact
All required packages are available in a pre-defined conda environment. Steps to install this environment are found in the Automatic installation of the required packages section.
Pacini-typing requires:
- Linux-like environment with (mini) conda installed
- Python 3.10 or higher (developed on 3.12)
The following Python packages are required:
- pip=>24.2
- pyyaml=>6.0.2
- setuptools=>75.1.0
- cgecore=>2.0.1
The following Tools are required:
- blast=>2.16.0
- kma=>1.4.15
The subcommands of blast (makeblastdb
) and kma (kma_index
) are also required in the PATH of the system. They will be installed automatically by the conda environment.
Package | Version |
---|---|
pip | >=24.2 |
pyyaml | >=6.0.2 |
setuptools | >=75.1.0 |
pandas | >=2.2.3 |
blast | >=2.16.0 |
kma | >=1.4.15 |
pytest | >=8.3.3 |
cgecore | >=2.0.1 |
Pacini-typing can be installed using the conda/mamba package manager. The package is available on the bioconda channel, under the name pacini_typing.
conda install bioconda::pacini_typing
Manual installation of the application is achieved by cloning the repository and installing the requirements:
- Clone the repository.
git clone https://github.com/RIVM-bioinformatics/Pacini-typing.git
- Go to the Pacini-typing directory.
cd Pacini-typing
At this point, the repository is cloned to your system. It is advised to install the required packages. This is achieved by following the steps in the section Installation of the required packages or by installing the required packages manually listed in the Prerequisites section.
- Install the package.
pip install .
Pacini-typing is now installed on your system. After installation, the application executed by calling pacini_typing
or Pacini-typing
.
Additionally, the application can also be executed by calling the original pacini_typing.py
script in the pacini_typing
directory with the following command:
python3 directory_of_clone/pacini_typing.py --help
For both macOS and Linux users, a complete conda environment, containing all required packages, is found in the root of the repository.
To install the environment, run the following command:
# For Linux users:
conda env create -f linux-environment.yaml -n pacini-typing
# or for macOS users:
conda env create -f mac-environment.yaml -n pacini-typing
After the environment is installed, activate the environment by running:
conda activate pacini-typing
Pacini-typing accepts both assembled FASTA contigs and paired-end FASTQ files as input. The application is executed using the following three search modes:
genes
: Search for genes in the input genome(s)SNPs
: Search for SNPs in the input genome(s)both
: Search for both genes and SNPs in the input genome(s)
In addition, Pacini-typing has two subcommands which can be used to (1) manually create a gene reference database or (2) manually run a query against the gene reference database. These subcommands are named makedatabase
and query
, respectively. More information about these subcommands is found in the Parameters & Usage section.
The configuration file of Pacini-typing delivers the required information to run in a easy-to-use manner. The configuration file is a YAML-based file with paths to input files, database locations, and genetic threshold to use for a specific run.
Three pre-defined configuration schemes are available in the config
directory of the repository:
O1-scheme.yaml
: Configuration file for pandemic serotype O1 of Vibrio choleraeO139-scheme.yaml
: Configuration file for pandemic serotype O139 of Vibrio choleraeYersinia-pestis-scheme.yaml
: EXAMPLE Configuration file for Yersinia pestis (since sharing pandemic-related genes is not allowed at the time of writing, this file is only an example and does not contain any real genes)
The schemes for Vibrio cholerae are based on real genetic patterns. These patterns, including gene sequences (
O1/O139.fasta
), can be used detect pandemic serotypes O1 and O139 of Vibrio cholerae.
Example configuration file for Yersinia pestis related variants:
%YAML 1.2
---
metadata:
# Metadata information that will be used in the output report
filename: "Yersinia.yaml"
id: "YP-01"
type: "Y. pestis related variants"
description: "Genetic pattern run config file for Yersinia pestis related variants"
date_created: "2025-05-24"
# Path to the PointFinder script location,
# if not available, it will be installed here automatically
pointfinder_script_path: "/my_own_path/to/pacini_typing/pacini_typing/PointFinder.py"
database:
# Name and path of the gene database
name: "YP-01"
path: "databases/YP-01"
# Multi-fasta file with genes you want to search for
target_genes_file: "/my_own_path/to/fasta/genes.fasta"
# Multi-fasta file with genes in which the SNPs are located
target_snps_file: "/my_own_path/to/fasta/SNPs.fasta"
path_snps: "/my_own_path/to/database"
# Species of the SNP database, used for naming
species: "Yersinia"
global_settings:
# Output directory for the run, mainly for genes
run_output: "output/"
# Custom output directory for SNPs, only required if search mode is SNPs or both
run_output_snps: "output/snp/"
# Percentage identity and coverage thresholds for the search of genes
perc_ident: 95.0
perc_cov: 80.0
pattern:
# Searchable genes under 'gene' fields,
# SNPs under 'SNP' fields
- gene: "rfbV"
- gene: "ctxA"
- gene: "ctxB"
# The name of the gene in which the SNP is located
- SNP: "myGene"
# The reference nucleotide sequence of the SNP (must be 3 nucleotides)
ref: "TTT"
# The alternative amino acid
alt: "X"
# The CODON position of the SNP in the gene
pos: 123
Let's take a closer look at the SNP field, since this is a bit more complex:
- SNP: "myGene"
ref: "TTT"
alt: "X"
pos: 123
- SNP: The name of the gene in which the SNP is located (genes are defined in the
target_snps_file
). - ref: The reference nucleotide sequence of the SNP. Must be a codon (3 nucleotides).
- alt: The alternative amino acid that you want to search for. (1 letter amino acid code)
- pos: The position of the CODON in the gene where the SNP is located. This is not the position of the nucleotide but the position of the codon in the gene. Very important since SNPs are very sensitive to the position of the codon in the gene.
Please note that there is also a field for PointFinder's script. This is due to the fact that PointFinder is not available via Pip or Conda and must be installed manually. If the script is not found at the specified path, Pacini-typing will try to install it automatically in the specified path. This is performed by a wget
command in the snp_query_runner.py
script of the application.
Global steps of the application are:
- Validating the input: Check if the input files are valid and if the configuration file is valid.
- Checking the availability of the required database: Check if the database (gene, SNPs or both databses) are available in the specified path. This checking also includes the database structure.
- Creating the database: If the database is not available, Pacini-typing will try to create the database.
- Check again if the database is available: If any of the required databases are not available, the application will exit with an error.
- Running the query: If the database is available, the application will prepare the query and execute it against the correct reference database.
- Parsing the results: The application will parse the results of the query. For genes, this process includes filtering according to the threshold values in the configuration file (coverage and identity).
- Creating the output: Pacini-typing will create a (CSV) output report and, based on the user input, a FASTA file with the found sequences, a log file or a zip containing all intermediate files of a run will be created as well.
The main logic for the parsing operation is present in the parsing
module of the application. Additionally, this module contains the usage of two design patterns:
- Strategy Pattern
- Filter Pattern
Firstly, the strategy parser consists of a base class parsing_strategy.py
which is inherited by the fasta_parser.py
and fastq_parser.py
files.
The strategy pattern is used define overarching features required for both methods of parsing.
The filter pattern is used to filter the hits based on the values in the configuration file. All specific filters are inheriting from the filter_pattern.py
base class.
Filters can easily be added by creating a new class that inherits from the Filter
base class and implementing the single method apply()
of the base class. Currently, two filters are implemented:
coverage_filter.py
: Filters the hits based on the percentage coverage value in the configuration file.identity_filter.py
: Filters the hits based on the percentage identity value in the configuration file.
The main SNP parsing methodology is implemented in the snp_parser.py
file. Since this operation is quite different from the gene parsing, it is not using the strategy pattern. PointFinder's results also do not need to be filtered.
To get started with the application, you can run the following command to get the help of the application:
python pacini_typing.py --help
See the Parameters & Usage section for more information on how to run the application.
-h, --help
Shows the help of the pipeline.
usage: Pacini-typing [-h] [-v] [-V] [-c File] [-i File [File ...]]
[--save-intermediates] [--log-file] [-t Threads] [-f]
[-m {SNPs,genes,both}]
{makedatabase,query} ...
Bacterial Genotyping Tool for RIVM IDS-Bioinformatics
Either pick a subcommand to manually run the tool or
provide a predefined configuration file and your input file(s) (FASTA/FASTQ)
and let Pacini-typing do the work for you.
If using a configuration file, both the
--config and --input arguments are required.
options:
-h, --help show this help message and exit
-v, --verbose Increase output verbosity
-V, --version show program's version number and exit
-c File, --config File
Path to predefined configuration file
-i File [File ...], --input File [File ...]
Path to input file(s). Accepts 1 fasta file or 2 fastq files
--save-intermediates Save intermediate files of the run
--log-file Save log file of the run
-t Threads, --threads Threads
Number of threads to use (rounded to the nearest integer)
-f, --fasta-out Write found sequences to a FASTA output file
-m {SNPs,genes,both}, --search_mode {SNPs,genes,both}
Search mode to use. SNPs, genes or both.
Default is genes.
operations:
For more information on a specific command, type: pacini_typing <command> -h
{makedatabase,query}
makedatabase Create a new reference database
query Run query against reference database
See github.com/RIVM-Bioinformatics for more information
Pacini-typing can be used in two different ways. This could either be:
- Using a pre-defined configuration file to run the application
- Manually creating a (gene) reference database and manually search for genes, this consists of running
pacini_typing
with an additional subcommandsmakedatabase
orquery
.
One of these two methods must be used to run the application.
Note: Manually searching for genetic variations does not result in parsing and creating output reports. This is only possible when using a configuration file (--config
).
-
-c, --config
path to the configuration file -
-i, --input
path to the input file(s). Either accepts 1 or 2 files. If providing 2 files, separate them with a space:
pacini_typing --input file_1.ext file_2.ext
-h, --help
Shows the help for the makedatabase command-db_path, --database_path
path to the database directory-db_name, --database_name
name of the database-I, --input_file
path to the database file-db_type, --database_type
type of the database, choose betweenfasta
orfastq
To run the above options, don't forget to add the makedatabase
subcommand at the beginning of the command:
pacini_typing makedatabase -db_path [path_to_database_directory] -db_name [name_of_database] -I [path_to_input_file.ext] -db_type [fasta/fastq]
query -h
Shows the help for the query command-db_path, --database_path
path to the database directory-db_name, --database_name
name of the database-p, --paired
path to the paired FASTQ files, seperate with a space-s, --single
path to the single FASTQ files-o, --output
path to the output directory and prefix of the output files
Note: The -p
and -s
parameters are mutually exclusive. Only one of these parameters can be used at a time.
To run the above options, don't forget to add the query
subcommand at the beginning of the command:
pacini_typing query -db_path [path_to_database_directory] -db_name [name_of_database] -p [path_to_paired_files.ext] -o [path_to_output_directory]
-m, --search_mode
Search mode to use. Choose betweenSNPs
,genes
orboth
. Default isgenes
.-v, --verbose
Increase output verbosity-V, --version
Show program's version number and exit--save-intermediates
Save intermediate files of the run--log-file
Save log file of the run, namedpacini_typing.log
-t, --threads
Number of threads to use-f, --fasta-out
Write found sequences (hits) to a FASTA output file, named{prefix}_sequences.fasta
Note: The
--save-intermediates
and--fasta-out
parameters can not be used in combination with themakedatabase
orquery
subcommands.
In the accept_arguments.yaml
file in the config
directory, the accepted extensions for the input files are defined. These can be changed by the user.
In config/accept_arguments.yaml
, the accepted extensions for genome files are defined. These extensions can be changed by the user. The default extensions are:
accepted_input_extensions:
- .fq
- .fastq
- .fq.gz
- .fastq.gz
- .fna
- .fsa
- .fasta
- .tar.gz
- .fasta.gz
- .scaffold.fasta
- .result.fasta
- .fa
Zipped files are automatically unzipped by Pacini-typing, so the user does not have to worry about this. The application will automatically detect the file type and parse it accordingly.
pacini_typing --config [path_to_config_file.yaml] --input [path_to_input_file.ext]
The output of Pacini-typing consists of four possible files, depending on the parameters used:
{prefix}_report.csv
: report of found genetic variations
This report is the main output of Pacini-typing and is created when hit(s) are found and they meet the thresholds in the configuration file.
Example (for --search_mode genes
):
ID,Input,Configuration,Type/Genes,Mode,Hits,Percentage Identity,Percentage Coverage,e-value
1,ERR976461,O1-scheme.yaml,V. cholerae O1 related genes,Gene,rfbV,100.0,100.0,1e-26
1,ERR976461,O1-scheme.yaml,V. cholerae O1 related genes,Gene,ctxA,100.0,100.0,1e-26
Example (for --search_mode SNPs
):
ID,Input,Configuration,Type/Genes,Mode,Hits,Reference nucleotide,Alternative nucleotide,Position,Amino acid change
1,SAMN00115171,Yersinia.yaml,Y. pestis related variants,SNP,group_1234 p.V1I,CCC,CTC,1,V1I
2,SAMN00115171,Yersinia.yaml,Y. pestis related variants,SNP,group_5678 p.E7K,AAA,AAG,7,E7K
The
Amino acid change
column is formatted asp.<original amino acid><position><new amino acid>
, i.e.,p.V1I
means that amino acidV
at position1
was mutated into the new amino acidI
.
- (optional with --log-file)
pacini_typing.log
: Log file containing information about the run
This optional log file contains the log output of the application. This file can be used to debug the application.
- (optional with --fasta-out)
{prefix}_sequences.fasta
: FASTA file containing the found sequences
This file contains the found sequences of the (gene) hits in the input file. Not the sequence of search but the actual sequence that were found in the input genome file(s). The sequences are written in FASTA format.
Example:
>rfbV
ATGCCATGGAAGACCTACTCACGGAACTTGATGTATGCTGTCATAACTTTGATGTTGAATGTATTAAGCG
AATTTTACTTGATGCACCTACGGGTTATTCGCCACAAAAATGAGAATAAAATGAAAGTATTGCATGTATA
>ctxA
ATGCCATGGAAGACCTACTCACGGAACTTGATGTATGCTGTCATAACTTTGATGTTGAATGTATTAAGCG
AATTTTACTTGATGCACCTACGGGTTATTCGCCACAAAAATGAGAATAAAATGAAAGTATTGCATGTATA
Note: The prefix of the output files is the same as the prefix of the input file.
- (optional with --save-intermediates)
{prefix}_intermediates_<SNP/gene>.tar.gz
: Tarball containing all intermediate files of the run, this includes raw BLAST, KMA or PointFinder reports.
Based on:
- Input
- Run
Input is either 1 Assembled FASTA file OR 2 Paired FASTQ files.
Example of an Assembled FASTA file containing the rfbV
gene of the O1 serotype of Vibrio cholerae:
>rfbV_O1:1:AE003852
ATGCCATGGAAGACCTACTCACGGAACTTGATGTATGCTGTCATAACTTTGATGTTGAATGTATTAAGCG
AATTTTACTTGATGCACCTACGGGTTATTCGCCACAAAAATGAGAATAAAATGAAAGTATTGCATGTATA
Example of a 1 Paired FASTQ file Vibrio cholerae:
@ERR976461.1 1 length=100
CTACTATTAAGGAGCAGGATCTTTGTGGATAAGTGAAAAATGATCAACAAGATCATGCGATTCAGAAGGA
+ERR976461.1 1 length=100
CCCFFFFFHHHGHJJJJJJIJJJJJHIJJJJJC1:FHIIIIIJJIIJFIJGHIJJJJJJJIGIJJJJIJJ
Search for O1-related genes in the input FASTQ files using the pre-defined O1.yaml
configuration file:
pacini_typing \
--i ERR976461_1.fastq ERR976461_2.fastq \
--config config/O1.yaml
--search_mode genes
Pacini-typing contains a quite broad test suite. Most useful tests are probably the end-to-end (E2E) tests. These tests are located in the tests/e2e
directory of the repository. The (most) tests are additionally run online by a GitHub action workflow on every push to the repository.
All tests are written in the pytest
framework. To run the tests, the following command can be used:
pytest -v tests/
Big downside of some good tests is the dependency of bigger data files. These files are not included in the repository, because of their size and the GitHub Organization's policy. Therefore, some tests must be skipped if running through a GitHub action workflow.
This skipping is done by a skip-if condition in the test file:
skip_in_ci = pytest.mark.skipif(
os.getenv("CI") == "true",
reason="Test online (GitHub Action) not available due to dependencies",
)
@skip_in_ci
def test_example():
# Test code
When cloning/downloading the repository, these tests must be skipped too. This is achieved by running the following command:
CI=true pytest -v tests/
This simply uses the same strategy as the GitHub action workflow by setting the CI
environment variable to true
.
If encoutering any issues:
- Any issues can be reported in the Issues section of this repository
- Contact the author(s) of the application
Implement biological typing based on the configuration file: A user can specify a biological type in the configuration file. Pacini-typing automatically processes the biological type in a report. This way the user can easily see which biological type is present in the entered genome(s). This is especially useful for microbiologists.
For example, the following configuration file can be used to specify pandemic cholera:
types:
name: "Pandemic cholera"
targets:
- one_of: ['rfbV:O1', 'wbfZ:O139']
- one_of: ['ctxA', 'ctxB']
- all_of:
- not: ['']
But this is still a future idea of the application.
This pipeline is licensed with a AGPL3 license. Detailed information can be found inside the 'LICENSE' file in this repository.
- Contact person: Mark van de Streek
- Email mark.van.de.streek@rivm.nl or m.van.de.streek@st.hanze.nl