Skip to content

Multiomics-Analytics-Group/MetaProteomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Meta proteomics pipeline

In this directory, we provide a meta proteomics pipeline that can be used to analyze the data generated by InstaNovo. The pipleline uses DDA mass spectrometry proteomics data, and is build upon the softeares ndeveloped in relation to the InstaNovo algorithm, develeped by the XX group at DTU Bioengieering.

The pipeline is developed as a result of a 7.5CTS special course at Bioengineering and NNF Center for Biosustainability at Technical University of Denmark (DTU) in the spring of 2025 by MSc Eng stud. Josefine Tvermoes Meineche with supervision from Alberto Santos Delgado and Konstantinos Kalogeropoulos.

alt text

In summary, the pipeline includes the following steps:

  1. Initial data conversion (if needed): Conversion of raw DDA files to mzML format using msconvert. This step is only needed if the raw files are not in mzML format already.
  2. De novo sequencing: The mzML files are used as input for the InstaNovo algorithm, which performs de novo sequencing of the peptides. The output is a set of peptide sequences in csv format.
  3. Data preprocessing and filtering: The peptide sequences are preprocessed to remove low-quality sequences and ensure robust mapping.
  4. Peptide mapping: Filtered peptides are mapped a fasta file of the host organism. Unmapped peptides are then mapped to other organisms of choice (meta-proteomics).
  5. BLASTING: Remaining, unmapped peptides are blasted against a custom BLAST database build using blastdbcmd. The database is built from the NCBI nr database, filtered for organisms of choice. The BLAST results are used to identify potential matches for the unmapped peptides.
  6. Protein inference: Proteins and their abundandes are inferred using XXX, developed at Bioengineering, DTU. Proteins are filtered based on coverage.
  7. Downstream data analysis: Using acore, a python package developed at NNF Center for Biosustainability, DTU, the data is analyzed and visualized. The analysis includes statistical tests, clustering, and visualization of the results.

Data preprocessing and filtering

After sequencing your data, the initial preprocessing is done. The sctipt preprocessing.py iteratively removes low-confidence peptides based ona custom threshold, removes ox-strings and ensures harmonization of I and L peptides. Furthermore, the script maps the peptides to a local fastafile of the host-organism as well as fasta flies of custom secondary organisms. The script also removes any peptides that are not in the fasta file of the host organism. The script is run as follows:

python preprocessing.py \
  --instanovo_outputs <str> \
  --host_organism <int> \
  --meta_organisms <int,int,...> \
  --threshold <float>

Where:

  • instanovo_outputs: Path to the directory containing InstaNovo output CSV files with peptide sequences and log-probabilities.
  • host_organism (optional): NCBI Taxonomy ID of the host organism (e.g., 9606 for human). Used for protein mapping. Default = 9606.
  • meta_organisms (optional): Space-separated list of Taxonomy IDs for meta organisms (e.g., 1234 5678). Deafult = None.
  • threshold (optional): Confidence score threshold for filtering peptides. Must be between 0 and 1. Default: 0.95.

Tax IDs for organisms can be found at: https://www.ncbi.nlm.nih.gov/taxonomy.

Example usage

python preprocessing.py \
  --instanovo_outputs ./outputs/predictions/ \
  --host_organism 9606 \
  --meta_organisms 562,1280 \
  --threshold 0.98

BLAST

Rmmote BLASTp-based mapping of unmapped peptide sequences to the UniProtKB database via the EMBL-EBI public REST API. Uses taxonomy-based filtering to only return hits from the host organism (e.g., Homo sapiens) and optionally selected meta-organisms (e.g., Bacteria, Fungi).

The script reads all unmapped*.csv peptide files from a given directory, submits them to the BLAST service, and downloads the results as .tsv.gz files for reuse. It uses multiprocessing to parallelize submission.

python blast_UniProt.py 
  • The script assumes a directory structure:
    • workdir/mappings/ contains the unmapped*.csv peptide files.
    • workdir/blast/ is where BLAST result files will be saved.
  • Each input CSV must contain a preds column with peptide sequences.
  • Already processed sequences are skipped using .tsv.gz marker files.
  • Results are saved in gzip-compressed tab-separated format.

Required Setup

Update the script’s header to your working directory and email:

workdir = "/path/to/your/meta-proteomics-dir"
email = "your_email@example.com"

Taxonomy Filtering

The following taxonomic IDs are hardcoded into the script:

  • 9606Homo sapiens (host)
  • 2 – Bacteria
  • 4751 – Fungi

Update "taxids" in the submit_job function to customize.

Map BLAST results

Processes peptide BLAST results generated by the EMBL-EBI BLAST REST API. Loads compressed BLAST output files, summarizes hits by species, filters results by confidence thresholds, merges BLAST mappings back to unmapped peptide data, and generates summary plots.

The script saves filtered mappings and visualizations for host (human) and meta-organisms separately.

python blast_results_processing.py

Paths set inside the script:

  • WORKDIR: Base working directory.
  • BLAST_DIR: Directory containing BLAST result files (*.tsv.gz).
  • MAPPING_DIR: Directory containing unmapped peptide CSV files.
  • FIGURE_DIR: Directory where generated figures will be saved.

Output

  • Filtered BLAST results saved as CSV files in the mapping directory, separated into host (human) and meta-organism mappings.
  • Summary plots of BLAST statistics and species counts.
  • Logs printed with progress and key statistics.

Quantification using INQuant

The quantification is done using INQaunt (https://github.com/UadKLab/INQuant). Some tuning parameters have been chosen and incorporated in the command line execution here. Defaults are the same as those used by INQuant.

python quantification.py \
  --instanovo_processed_dir <str> \
  --mzml_dir <str> \
  --concat_files \
  --INQ_confidence_filter <float> \
  --INQ_cleavage_length <int> \
  --INQ_top_n_peptides <int> \
  --INQ_normalize_abundance <str>

Where:

  • instanovo_processed_dir: Directory containing preprocessed InstaNovo files.
  • mzml_dir: Directory containing .mzML files.
  • concat_files (optional): Concatenate mapping files before running INQuant. Default = True.
  • INQ_confidence_filter (optional): Minimum confidence threshold for keeping PSMs. Peptides with confidence below this value will be excluded. Default: 0.95
  • INQ_cleavage_length (optional): Number of amino acids to include before and after the peptide sequence when reporting protein positions. Default: 4
  • INQ_top_n_peptides (optional): Number of top-scoring peptides to use for protein quantification. Lower values increase stringency. Default: 5.
  • INQ_normalize_abundance (optional): Normalization method for abundance values. Options: median, mean, tic, false. Default: median.

Example usage

python quantification.py \
  --instanovo_processed_dir ./outputs/instanovo/ \
  --mzml_dir ./data/mzml/ \
  --concat_files \
  --INQ_confidence_filter 0.98 \
  --INQ_top_n_peptides 3 \
  --INQ_normalize_abundance mean

Downstream analyses

python blast_results_processing.py --work_dir /path/to/dir
python blast_results_processing.py --work_dir /path/to/dir --filt_missing_values 90 --KNN 3

About

Josefine's repo for special course on meta-proteomics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published