Skip to content

Conversation

@pcruzparri
Copy link
Contributor

New PR with changes from PR #797 and more.

Description:
PTM stoichiometry classes created for each structural level (protein group -> protein -> (base) peptide - modification). Probably the most important class to understand is the QuantifiedPeptide class for which an object represents a collection of post-translationally modified variants sharing the same base sequence. This is in contrast with other Peptide classes that represent a peptide with a given full sequence. The QuantifiedProtein class, which stores the QuantifiedPeptide objects, handles peptide-to-protein indexing and obtaining of the modification stoichiometry for the protein. The QuantifiedModification class is primarily a data class, and the QuantifiedProteinGroup class is mostly a container for the different QuantifiedProtein members of the protein group. Lastly, the PositionFrequencyAnalysis class has a method to take a list of (full sequence, protein group list, intensity) tuples and create a collection of QuantifiedProteinGroup objects.

@codecov
Copy link

codecov bot commented Sep 16, 2025

Codecov Report

❌ Patch coverage is 82.22222% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.99%. Comparing base (a682b7a) to head (6389de1).
⚠️ Report is 2 commits behind head on master.

⚠️ Current head 6389de1 differs from pull request most recent head 302edd7

Please upload reports for the commit 302edd7 to get more accurate results.

Files with missing lines Patch % Lines
...til/PositionFrequencyAnalysis/QuantifiedProtein.cs 71.26% 19 Missing and 6 partials ⚠️
...til/PositionFrequencyAnalysis/QuantifiedPeptide.cs 81.42% 12 Missing and 1 partial ⚠️
...tionFrequencyAnalysis/PositionFrequencyAnalysis.cs 97.14% 0 Missing and 1 partial ⚠️
...ositionFrequencyAnalysis/QuantifiedProteinGroup.cs 93.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #955      +/-   ##
==========================================
+ Coverage   80.97%   80.99%   +0.01%     
==========================================
  Files         269      274       +5     
  Lines       38744    38969     +225     
  Branches     4228     4267      +39     
==========================================
+ Hits        31374    31561     +187     
- Misses       6639     6669      +30     
- Partials      731      739       +8     
Files with missing lines Coverage Δ
mzLib/MzLibUtil/ClassExtensions.cs 100.00% <100.00%> (+1.68%) ⬆️
...ositionFrequencyAnalysis/QuantifiedModification.cs 100.00% <100.00%> (ø)
...tionFrequencyAnalysis/PositionFrequencyAnalysis.cs 97.14% <97.14%> (ø)
...ositionFrequencyAnalysis/QuantifiedProteinGroup.cs 93.33% <93.33%> (ø)
...til/PositionFrequencyAnalysis/QuantifiedPeptide.cs 81.42% <81.42%> (ø)
...til/PositionFrequencyAnalysis/QuantifiedProtein.cs 71.26% <71.26%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pcruzparri pcruzparri marked this pull request as ready for review October 2, 2025 18:53
@acesnik acesnik self-requested a review October 4, 2025 00:34
@acesnik
Copy link
Collaborator

acesnik commented Oct 6, 2025

Hi there! @trishorts asked me to take a look at this one. It seems like a great pull request for getting modification ratios from results.

On stoichiometry, it's worth considering Jesper Olsen's work on stoichiometry measurements back in 2010 and what's been going on since then. Getting a true stoichiometry measure has required some tricks in the past to get the fully unoccupied quantity. I think the ratios might be useful, but I've struggled in the past when I've calculated such ratios to claim confidently that they're stoichiometries. You could consider using that original data to compare how ratios stack up to the measured stoichiometries if you're writing this up in a paper. https://www.science.org/doi/abs/10.1126/scisignal.2000475

I haven't looked at this logic in a long time, and it looks like it's been changed a bit, but do you think there is work to be done in mzLib/Omics/BioPolymer/VariantApplication.cs, or do you think there's logic to add to these new methods regarding sequence variations? Proteins with amino acid variations may have modifications listed within those data structures, https://github.com/acesnik/mzLib/blob/master/mzLib/Omics/BioPolymer/SequenceVariation.cs#L24. Should those be quantified and tested for these ratio calculations in the situations when amino acid variations are added?

@pcruzparri
Copy link
Contributor Author

@acesnik thank you for your thoughts and the paper you shared! I had not come across it and still need to get through it more closely, but it seems like a great reference for a paper.

You are absolutely right that this calculation just outputs the residue-specific intensity ratios, and that is not the full picture of a site-occupancy calculation. I still need to compare the values obtained from this analysis to the residue-specific PSM count ratios we currently output in MetaMorpheus, but the goal is to output something quantitatively closer to a mod's stoichiometry. I'm thinking that the correction/scaling of intensities to account for peptide ionization efficiency would be a follow-up enhancement to this code (or maybe on the MetaMorpheus end?). There is still a little more exploring I'd like to do before deciding on which approaches we'd like to support on mzLib or MetaMorpheus for stoichiometry calculation, but at least whoever would like to implement their own normalization/scaling strategies can do so from the output being facilitated here in the meantime.

As for sequence variations, variants in MetaMorpheus will get their own accessions, be treated as different proteins, and be placed accordingly into protein groups. Once the protein groups and variant sequences are extracted, this code will treat it as any other protein group. As for using this code directly from mzlib, sequence variants would need to be passed in as different proteins. In either case, the code here (specifically, the SetUpQuanitficationFromFullSequences() method) is written to only take in a full peptide sequence, the protein group names the full sequence belong to, and the peptide's intensity and then create new quantification objects. All mod information is parsed from the full sequence string and localized using the full sequence and the provided protein sequences (which would require passing each variant protein sequence separately).

Please let me know if you have further thoughts/concerns I can address!

…culation in mzlibutils were copied from the previous branch onto this one. Need to add/remake the tests next.
…s. Need tests for the protein groups and the occupancy set up (currently called CalculateOccupancies).
…ts population) from SetUpQuantificationObjects method for now.
Copy link
Contributor

@Alexander-Sol Alexander-Sol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of review comments. Two things that jump out:

  • Lack of summary comments. There are a lot of dictionaries in use, and I'm not always sure what the keys are just from looking at them. It would be nice to have summary comments that explain the dictionaries
  • IdWithMotif doesn't include the modification type (the part before the colon).

{
public Dictionary<string, QuantifiedProteinGroup> ProteinGroups { get; private set; }

//public Dictionary<string, (QuantifiedPeptide QuantifiedPeptide, string ProteinGroups)> Peptides { get; private set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out property

/// all of the amino acids in that peptide.</returns>
///
public void SetUpQuantificationObjectsFromFullSequences(List<(string fullSeq, List<string> proteinGroups, double intensity)> peptides, Dictionary<string, string> proteinSequences=null)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing in a list of tuples, have you considered making a lightweight class to hold that information? Like a record class that stores the sequence, protein groups, and intensity. Also, it's not clear what the proteinGroups are and what information they contain. In IQuantifiableRecord, a tuple stores accessions, gene names, and organisms for the different protein groups. Just using Accessions would probably work as well, but I would like guidance on what the proteinGroups actually are

{
ProteinGroups[pg] = new QuantifiedProteinGroup(pg);
}
var proteinGroup = ProteinGroups[pg];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are actually multiple protein groups associated with the peptide, should we store the combined protein group in the dictionary? What would happen if we split first, as in line 40, then added them to the dictionary?

public string BaseSequence { get; set; }
public QuantifiedProtein ParentProtein { get; set; }
public int OneBasedStartIndexInProtein { get; set; }
public Dictionary<int, Dictionary<string, QuantifiedModification>> ModifiedAminoAcidPositions { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What string serves as the key in the <string, QuantMod> dictionary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the string just stores position, could the ModifiedAmminoAcidPositions just be an <int, List> Dictionary? The position string seems redundant

peptide.OneBasedStartIndexInProtein = Sequence.IndexOf(peptide.BaseSequence) + 1;
}
// if peptide has no modifications, add to all its positions
if (!peptide.ModifiedAminoAcidPositions.IsNotNullOrEmpty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsNullOrEmpty() would be slightly cleaner

public string Name { get; set; }
public Dictionary<string, QuantifiedProtein> Proteins { get; set; }

public QuantifiedProteinGroup(string name, Dictionary<string, QuantifiedProtein> proteins = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the string key here?

[Test]
public void TestQuantifiedModification()
{
var quantmod = new QuantifiedModification(idWithMotif: "TestMod: ModX on AAY", positionInPeptide: 1, positionInProtein: 2, intensity: 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IdWithMotif just refers to the part after the colon. The full mod string is {Modification Type}: {Id with motif}

…tide input for setting up the protein groups and the quantifications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants