Meta proteomics pipeline

In this directory, we provide a meta proteomics pipeline that can be used to analyze the data generated by InstaNovo. The pipleline uses DDA mass spectrometry proteomics data, and is build upon the softeares ndeveloped in relation to the InstaNovo algorithm, develeped by the XX group at DTU Bioengieering.

The pipeline is developed as a result of a 7.5CTS special course at Bioengineering and NNF Center for Biosustainability at Technical University of Denmark (DTU) in the spring of 2025 by MSc Eng stud. Josefine Tvermoes Meineche with supervision from Alberto Santos Delgado and Konstantinos Kalogeropoulos.

In summary, the pipeline includes the following steps:

Initial data conversion (if needed): Conversion of raw DDA files to mzML format using msconvert. This step is only needed if the raw files are not in mzML format already.
De novo sequencing: The mzML files are used as input for the InstaNovo algorithm, which performs de novo sequencing of the peptides. The output is a set of peptide sequences in csv format.
Data preprocessing and filtering: The peptide sequences are preprocessed to remove low-quality sequences and ensure robust mapping.
Peptide mapping: Filtered peptides are mapped a fasta file of the host organism. Unmapped peptides are then mapped to other organisms of choice (meta-proteomics).
BLASTING: Remaining, unmapped peptides are blasted against a custom BLAST database build using blastdbcmd. The database is built from the NCBI nr database, filtered for organisms of choice. The BLAST results are used to identify potential matches for the unmapped peptides.
Protein inference: Proteins and their abundandes are inferred using XXX, developed at Bioengineering, DTU. Proteins are filtered based on coverage.
Downstream data analysis: Using acore, a python package developed at NNF Center for Biosustainability, DTU, the data is analyzed and visualized. The analysis includes statistical tests, clustering, and visualization of the results.

Data preprocessing and filtering

After sequencing your data, the initial preprocessing is done. The sctipt preprocessing.py iteratively removes low-confidence peptides based ona custom threshold, removes ox-strings and ensures harmonization of I and L peptides. Furthermore, the script maps the peptides to a local fastafile of the host-organism as well as fasta flies of custom secondary organisms. The script also removes any peptides that are not in the fasta file of the host organism. The script is run as follows:

python preprocessing.py \
  --instanovo_outputs <str> \
  --host_organism <int> \
  --meta_organisms <int,int,...> \
  --threshold <float>

Where:

instanovo_outputs: Path to the directory containing InstaNovo output CSV files with peptide sequences and log-probabilities.
host_organism (optional): NCBI Taxonomy ID of the host organism (e.g., 9606 for human). Used for protein mapping. Default = 9606.
meta_organisms (optional): Space-separated list of Taxonomy IDs for meta organisms (e.g., 1234 5678). Deafult = None.
threshold (optional): Confidence score threshold for filtering peptides. Must be between 0 and 1. Default: 0.95.

Tax IDs for organisms can be found at: https://www.ncbi.nlm.nih.gov/taxonomy.

Example usage

python preprocessing.py \
  --instanovo_outputs ./outputs/predictions/ \
  --host_organism 9606 \
  --meta_organisms 562,1280 \
  --threshold 0.98

BLAST

Rmmote BLASTp-based mapping of unmapped peptide sequences to the UniProtKB database via the EMBL-EBI public REST API. Uses taxonomy-based filtering to only return hits from the host organism (e.g., Homo sapiens) and optionally selected meta-organisms (e.g., Bacteria, Fungi).

The script reads all unmapped*.csv peptide files from a given directory, submits them to the BLAST service, and downloads the results as .tsv.gz files for reuse. It uses multiprocessing to parallelize submission.

python blast_UniProt.py

The script assumes a directory structure:
- workdir/mappings/ contains the unmapped*.csv peptide files.
- workdir/blast/ is where BLAST result files will be saved.
Each input CSV must contain a preds column with peptide sequences.
Already processed sequences are skipped using .tsv.gz marker files.
Results are saved in gzip-compressed tab-separated format.

Required Setup

Update the script’s header to your working directory and email:

workdir = "/path/to/your/meta-proteomics-dir"
email = "your_email@example.com"

Taxonomy Filtering

The following taxonomic IDs are hardcoded into the script:

9606 – Homo sapiens (host)
2 – Bacteria
4751 – Fungi

Update "taxids" in the submit_job function to customize.

Map BLAST results

Processes peptide BLAST results generated by the EMBL-EBI BLAST REST API. Loads compressed BLAST output files, summarizes hits by species, filters results by confidence thresholds, merges BLAST mappings back to unmapped peptide data, and generates summary plots.

The script saves filtered mappings and visualizations for host (human) and meta-organisms separately.

python blast_results_processing.py

Paths set inside the script:

WORKDIR: Base working directory.
BLAST_DIR: Directory containing BLAST result files (*.tsv.gz).
MAPPING_DIR: Directory containing unmapped peptide CSV files.
FIGURE_DIR: Directory where generated figures will be saved.

Output

Filtered BLAST results saved as CSV files in the mapping directory, separated into host (human) and meta-organism mappings.
Summary plots of BLAST statistics and species counts.
Logs printed with progress and key statistics.

Quantification using INQuant

The quantification is done using INQaunt (https://github.com/UadKLab/INQuant). Some tuning parameters have been chosen and incorporated in the command line execution here. Defaults are the same as those used by INQuant.

python quantification.py \
  --instanovo_processed_dir <str> \
  --mzml_dir <str> \
  --concat_files \
  --INQ_confidence_filter <float> \
  --INQ_cleavage_length <int> \
  --INQ_top_n_peptides <int> \
  --INQ_normalize_abundance <str>

Where:

instanovo_processed_dir: Directory containing preprocessed InstaNovo files.
mzml_dir: Directory containing .mzML files.
concat_files (optional): Concatenate mapping files before running INQuant. Default = True.
INQ_confidence_filter (optional): Minimum confidence threshold for keeping PSMs. Peptides with confidence below this value will be excluded. Default: 0.95
INQ_cleavage_length (optional): Number of amino acids to include before and after the peptide sequence when reporting protein positions. Default: 4
INQ_top_n_peptides (optional): Number of top-scoring peptides to use for protein quantification. Lower values increase stringency. Default: 5.
INQ_normalize_abundance (optional): Normalization method for abundance values. Options: median, mean, tic, false. Default: median.

Example usage

python quantification.py \
  --instanovo_processed_dir ./outputs/instanovo/ \
  --mzml_dir ./data/mzml/ \
  --concat_files \
  --INQ_confidence_filter 0.98 \
  --INQ_top_n_peptides 3 \
  --INQ_normalize_abundance mean

Downstream analyses

python blast_results_processing.py --work_dir /path/to/dir

python blast_results_processing.py --work_dir /path/to/dir --filt_missing_values 90 --KNN 3

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
figures		figures
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
github_packages		github_packages
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Meta proteomics pipeline

Data preprocessing and filtering

Example usage

BLAST

Required Setup

Taxonomy Filtering

Map BLAST results

Output

Quantification using INQuant

Example usage

Downstream analyses

About

Uh oh!

Releases

Packages

Languages

License

Multiomics-Analytics-Group/MetaProteomics

Folders and files

Latest commit

History

Repository files navigation

Meta proteomics pipeline

Data preprocessing and filtering

Example usage

BLAST

Required Setup

Taxonomy Filtering

Map BLAST results

Output

Quantification using INQuant

Example usage

Downstream analyses

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages