PTM Stoichiometry #797

pcruzparri · 2024-08-27T19:29:32Z

Creating a mzLib method to calculate the stoichiometry (or site-occupancy) of PTMs using the intensity of each quantified peak. The current inputs are the protein database(s) file(.xml) paths and the AllQuantifiedPeaks.tsv file path. The output, occupancyDict, is currently a dictionary of nested dictionaries with the following structure:

{{string PROTEIN1, {{int MAA1, {{string MODNAME1, double INTENSITY}, 
                                {string MODNAME2, double INTENSITY},
                                ...,
                                {string "Total", double INTENSITY}}} 
                    {int MAA2, {...}}, 
                   ...}},
 {string PROTEIN2, {...}},
 ...}

where PROTEINX is the protein accession, MAAX is the modified amino acid at protein position X, and MODNAME1 is the full label of the modification. For each MAAX, there is a "Total" key (instead of a modification name) that holds the total intensity of that amino acid measured in the quantified peaks file, including modified and unmodified peptides with that specific residue.

The general approach is to first get all of the modification intensities and record those in occupancyDict while storing in proteinSeqRangesSeen a dictionary with protein accession keys and values stored as a list of (STARTINDEX, ENDINDEX, INTENSITY) tuples. This helps keep track of the index ranges seen for each protein. Once we have parsed all of the mods, for every amino acid falling into any of those ranges, we increase its "Total" intensity by that amount.

From our discussion, I've added below some of the items I'd like to get some opinions about. Imade them a task list primarily for me to keep track of what I've figured out.

Where should this code live in mzLib. The most reasonable suggestions so far are in FlashLFQResults and Readers/QuantificationResults.
To interface this nicely with MetaMorpheus, what should the inputs be? My goal now is to look into how/where this will be integrated into MM, but any suggestions on places to look to figure this out are appreciated.
I have some ideas on making the code more efficient/succinct, especially foreseeing a lot more information about the peaks being readily available in MM (like the exact protein index for a peptide/peak). Any new ideas are welcomed.

Thanks in advance!

Alexander-Sol · 2024-08-28T19:02:37Z

mzLib/Test/FileReadingTests/TestPsmFromTsv.cs

+
+                        // get the localized modifications from the peptide full sequence and add any amino acid/modification combination not
+                        // seen yet to the occupancy dictionary
+                        foreach (KeyValuePair<int, List<string>> aaWithModList in peptideMods)


In situations like this, you can use "var aaWithModList" instead of specifying the actual class

nbollis

I think readers/Quant... is the best place for it. That way it can be used to find occupancy of the results from another software should that be desired.

In order to optimize your inputs and outputs of the function, you should break your test method into two. One test method with reads in all the data you need. Another method (not a test method) that gets called to calculate the occupancy. This will help you to better understand what is needed for the method, and for use to help make recommendations

pcruzparri · 2024-10-11T16:55:23Z

Requesting a second round of reviews! The second to last commit contains a little more in detail most changes. Currently pending work is to create a small enough subset of the raw data to create a test similar to the TestFlashLFQoutputRealData() test. More rigorous testing can be done with some of the identifications in the vignette data, since some base sequences have enough variations in fullSequence mods and positions to have better case coverage.

I'd be happy to hear about 1) code optimization, 2) currently written tests, and 3) clarifications on code commenting. In a conversation, Nic suggested using objects for my main ptm calculation code rather than the 5-level deep dictionary, thoughts on that would be useful as well. Ofc, anything else is useful. TIA!

codecov · 2024-10-11T22:23:36Z

Codecov Report

Attention: Patch coverage is 88.18182% with 26 lines in your changes missing coverage. Please review.

Project coverage is 77.83%. Comparing base (2f85ac1) to head (1fe2c97).

Files with missing lines	Patch %	Lines
mzLib/MzLibUtil/PositionFrequencyAnalysis.cs	88.62%	17 Missing and 2 partials ⚠️
mzLib/FlashLFQ/Peptide.cs	0.00%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #797      +/-   ##
==========================================
+ Coverage   77.78%   77.83%   +0.04%     
==========================================
  Files         229      230       +1     
  Lines       34159    34351     +192     
  Branches     3539     3570      +31     
==========================================
+ Hits        26570    26736     +166     
- Misses       6985     7009      +24     
- Partials      604      606       +2

Files with missing lines	Coverage Δ
mzLib/FlashLFQ/FlashLFQResults.cs	`92.04% <ø> (ø)`
mzLib/MzLibUtil/ClassExtensions.cs	`100.00% <100.00%> (ø)`
mzLib/Omics/SpectrumMatch/SpectrumMatchFromTsv.cs	`97.05% <100.00%> (-0.29%)`	⬇️
mzLib/FlashLFQ/Peptide.cs	`92.85% <0.00%> (-7.15%)`	⬇️
mzLib/MzLibUtil/PositionFrequencyAnalysis.cs	`88.62% <88.62%> (ø)`

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mzLib/MassSpectrometry/Enums/DissociationType.cs

mzLib/MzLibUtil/ClassExtensions.cs

trishorts · 2024-11-01T15:28:31Z

mzLib/MzLibUtil/ClassExtensions.cs

+        {
+            // use a regex to get all modifications
+            string pattern = @"\[(.+?)\]";
+            Regex regex = new(pattern);


we need to make sure that this method never thinks that
[hydroxylation]EPT[phospho] is accidentaly identified as a mod for P[hydroxylation]EPT[phospho]IDE
I'm not sure that ]EPT[ won't be ignored by your regex

After finding an opening bracket, regex will always find the next closing bracket, except (updated now) in the case where the closing bracket belongs to an ion charge state.

trishorts · 2024-11-01T15:36:09Z

mzLib/MzLibUtil/PositionFrequencyAnalysis.cs

+namespace MzLibUtil
+{
+    // Should this have all of the parent data (i.e. protein group, protein, peptide, peptide position)? Unnecessary for now, but probably useful later.
+    public class UtilModification


UtilModification => LocalizedModificationFromTsv
modName => IdWithMotif
position =>PeptidePositionZeroIsNterminus

trishorts · 2024-11-01T15:36:33Z

mzLib/MzLibUtil/PositionFrequencyAnalysis.cs

+    {
+        public string FullSequence { get; set; }
+        public string BaseSequence { get; set; }
+        public UtilProtein ParentProtein { get; set; }


maybe this should be ProteinGroup?

trishorts · 2024-11-01T15:38:04Z

mzLib/MzLibUtil/PositionFrequencyAnalysis.cs

+        }
+    }
+
+    public class UtilProtein


flashlfq proteingroup

trishorts · 2024-11-01T15:42:13Z

it's possible that "Identification" should be the currency of this realm as it is what is passed into flashlfq by MM and it is what is generated by FlashLFQ when it is run alone on any acceptable input.

pcruzparri · 2025-02-16T10:08:55Z

Some pending changes:
Potentially modify the input to ProteinGroupsOccupancyByPeptide() be some object and create an interface rather than a list.
Remove occupancy additions to FlashLFQ. These additions have been moved to MetaMorpheus.
Create new package for modification quantification within MzLib.

Side Note:
This branch/pr is a mess. I made a new clean branch on my fork from the master branch and cherry-picked the important, recent commits from this branch. I can change the pull request to track that branch unless there is opposition.

…. Changed the BioPolymerWithSetModsExtensions class to write full sequences separating the C-terminus with a dash. Updated some of the tests that failed because of the new notation of C-terminus mods. Some tests are still failing, and will be updated once happy with this general change.

…t handle ambiguity(or multiple mods at the same position). Modified the corresponding tests or commented them out in case we want to revert.

…ve amino acid positions depending on the length for the modification string and its index. Current approach fixes that.

…sitionFrequencyAnalysis UtilProtein class (now updates peptide mod positions to protein positions) and PFA argument (list of named tuple for clarity)

…ing to master and matching content

…d of FlashLFQ to output occupancy. Updated UtilClasses for correct UtilProtein.ModifiedAminoAcidPositionsInProtein positions.

…g work but untested. WIP.

…urrently does not addequately identify N-Terminus Mods. Make sure UtilProtein.SetProteinModsFromPeptides correctly adds terminal protein mods. Saving but needs more rigorous testing once ParseModifications updated (in separate PR) to correctly parse N-Terminus mods. WIP

…s sequences, since it covers most but not many interesting cases. Best to remove it to maintain code coverage. I will add some notes on the issue on the PR for future reference.

… not finished and there are errors.

…tion of occupancy implementation and fixed quantifiedprotein and quantifiedpeptide classes. Might need to update the inputs of occupancy calculation to handle position in proteins.

…ccupancies. More refactoring done. ProteinGroup occupancy seems to be working. Still polishing peptide occupancy. Metamorpheus branch updated to work with PG occupancy refactoring. Will add peptide occupancy output to metamorpheus branch.

…g peptides from different experiments.

…corporating updated. N-terminus mod writing.

pcruzparri · 2025-09-16T20:36:02Z

Stale and updated in new branch with updated and clean commit history. Pull request addressing these changes (and more) is now #955.

pcruzparri requested review from Alexander-Sol, nbollis and trishorts August 27, 2024 19:30

Alexander-Sol reviewed Aug 28, 2024

View reviewed changes

nbollis reviewed Sep 4, 2024

View reviewed changes