Make "Associate Proteins" more forgiving when peptide filter or digest settings would exclude the peptide #3169

nickshulman · 2024-09-25T02:06:20Z

Mike has had a lot of documents where a small fraction of peptides end up being "unmapped" even though all of the peptides were found by a peptide search engine using the same FASTA file that's being used for the protein association.

This change makes it so that if the peptide would otherwise be "unmapped" (that is, it does not satisfy the digestion criteria for any protein in the FASTA file) the peptide becomes associated with all proteins whose sequence contains the peptide sequence.

… there is no other protein for which the peptide is tryptic

chambm · 2024-09-26T18:10:33Z

Sorry for the delay. Couldn't this just be a checkbox like "Apply document filters to protein association"?

nickshulman · 2024-09-26T18:59:34Z

Sorry for the delay. Couldn't this just be a checkbox like "Apply document filters to protein association"?

We normally think of those filters as controlling whether the peptide is added to the document at all. Given that a peptide is in the document which does not pass the filter criteria, would a user actually want that peptide to be put into the "unmapped" peptide list because it had too many missed cleavages?

Mike's document contains peptides which do not pass the filter criteria because he used the "Add All" button in the Library Explorer (he also probably also had to answer "yes" when Skyline told him that some of the peptides did not match the filter criteria).

By the way, there is a "Pick peptides matching" setting on the "Library" tab of the Peptide Settings dialog which asks whether peptides should be included, but that only affects the settings on the "Filter" tab of the "Peptide Settings" dialog.

This setting is sort of in the spirit of what you are proposing.

I think I was mistaken that Protein Association pays attention to any of the filter settings on the Filter tab in the Peptide Settings dialog. Actually, all that matter are "exclude ragged ends", "max missed cleavages" and the enzyme itself.

chambm · 2024-09-26T19:47:36Z

It seems users often expect the external search results to override the document settings, or at least the document settings that aren't exposed in the wizard. Perhaps the associate proteins logic should not do any filtering and rely on other Skyline code to do that?

nickshulman · 2024-09-26T21:52:04Z

It seems users often expect the external search results to override the document settings, or at least the document settings that aren't exposed in the wizard. Perhaps the associate proteins logic should not do any filtering and rely on other Skyline code to do that?

I am not sure I understand what you mean by "filtering".
"Associate Proteins" does not make any decisions about which peptides are in the document, only which proteins they belong to. Currently, peptides which do not satisfy the Max Missed Cleavages or Ragged Ends setting are getting put into an "unmapped" peptide list.

If we wanted to make it so the "Add All" button on the Library Explorer was unnecessary to get all the found peptides into the document, the place to change that would be "FastaImporter.Import".

…sStrictFiltering

pwiz_tools/Skyline/Model/Proteome/ProteinAssociation.cs

chambm · 2024-10-08T19:26:58Z

Hmm, why would these changes slightly change gene level parsimony association in TestHugeAssociateProteins?

…sStrictFiltering

nickshulman · 2024-10-08T21:17:23Z

Hmm, why would these changes slightly change gene level parsimony association in TestHugeAssociateProteins?

It does not appear that there are any changes to which peptides are associated with which proteins.

That test passes on my computer. I wonder whether my extra ParallelEx.ForEach has introduced some sort of timing-related intermittent failure.

nickshulman · 2024-10-08T22:07:20Z

Hmm, why would these changes slightly change gene level parsimony association in TestHugeAssociateProteins?

We have seen this intermittent failure in TestHugeAssociateProteins in the nightlies:
https://skyline.ms/testresults/home/development/Performance%20Tests/showRun.view?runId=70857

I imagine this PR makes the intermittent failure more likely.
I can provoke the failure in master by shuffling the proteins in "FindProteinMatches":

var random = new Random((int)DateTime.UtcNow.Ticks);
ParallelEx.ForEach(proteinSource.Proteins.OrderBy(p=>random.Next()), fastaRecord =>

Should I try to figure out why the order that these proteins get processed affects the results, or should I just make it so that the results do not depend on the order in which the individual ParallelEx items finish?

chambm · 2024-10-09T02:15:53Z

Let's see what that does to the test time tonight.

chambm · 2024-10-09T16:36:13Z

pwiz_tools/Skyline/Test/ProteinAssociationTest.data/TwoProteins.fasta

@@ -0,0 +1,8 @@
+>Protein1


Do we want a pattern of many unit tests having their own .data directory with small test data files in them? I think you could just write this out to a temp file directly from the code to keep the repo tidier. I love the new ability to keep things in .data dir instead of a .zip file, but think we should still use it judiciously. Like .sky files. Those would be a pain to write directly from test code (as xml I mean; the settings can be generated programmatically). As would any big DSV file.

The only way I know to create and clean up a temporary directory is by using "TestFilesDir".
Do you know of an easier way to do that which does not require either a .zip file or .data folder?

How about:
using var testDir = new TemporaryDirectory(Path.Combine(TestContext.TestRunDirectory, TestContext.TestName));
If it works we could definitely have a shortcut for that in AbstractUnitTest. :)

Yep, that works. Thanks!
I am probably going to add a method to ProteinAssociation which takes a ProteinSource so that it can be used without any file on disk.

Unfortunately, TestContext.TestRunDirectory is null when running from TestRunner.
I added a method ProteinAssociate.UseProteinSource so I do not need a file on disk.

chambm

The huge test's timing on TeamCity is pretty volatile. When I look at all PRs your changes don't really seem to make a big difference. But just looking at this branch (https://teamcity.labkey.org/test/5831976084300368231?currentProjectId=ProteoWizard&expandTestHistoryChartSection=true&branch=pull%2F3169), there's a 15 second difference between before and after the change from your new parallel loop to making it serial. I don't remember how much of the time in that test is doing the parsimony stuff vs. the protein matching though so I had to run it myself. It's a pretty significant chunk, but a lot is also in the OkDialog saving the document tree. Can you check the performance impact of using foreach vs. ParallelEx.Foreach? I got 40s with this in PerfAssociateProteinsHugeTest:


            Stopwatch sw = Stopwatch.StartNew();
            var proteinsDlg = ShowDialog<AssociateProteinsDlg>(SkylineWindow.ShowAssociateProteinsDlg);
            if (type == ImportType.FASTA)
            {
                RunUI(() => proteinsDlg.FastaFileName = _fastaFile);
            }
            else
            {
                RunUI(proteinsDlg.UseBackgroundProteome);
            }

            WaitForCondition(() => !proteinsDlg.IsBusy);
            Console.WriteLine("Association time without parsimony: " + sw.ElapsedMilliseconds);
            //PauseTest();
            RunUI(() =>
            {
                Assert.AreEqual(425484, proteinsDlg.FinalResults.PeptidesMapped);
                Assert.AreEqual(0, proteinsDlg.FinalResults.PeptidesUnmapped);
                Assert.AreEqual(84198, proteinsDlg.FinalResults.ProteinsMapped);
                Assert.AreEqual(4281, proteinsDlg.FinalResults.ProteinsUnmapped);
            });
            CancelDialog(proteinsDlg, proteinsDlg.CancelDialog);
            return;

…k/20240923_AssociateProteinsLessStrictFiltering

…method which does the digestion.

…k/20240923_AssociateProteinsLessStrictFiltering

nickshulman · 2024-10-10T23:15:56Z

Can you check the performance impact of using foreach vs. ParallelEx.Foreach?

I moved some of the stuff from the second loop into "ParallelEx.For" and that does seem to have made it go faster.

nickshulman added 3 commits September 24, 2024 12:29

Allow non-tryptic peptides to be associated with a protein so long as…

d21d2c5

… there is no other protein for which the peptide is tryptic

Fix problem where results were being updated at wrong spot in the loop

2fffb19

Fix incorrect use of "Parallel" instead of "ParallelEx"

56373de

nickshulman requested a review from chambm September 25, 2024 02:06

Merge branch 'master' into Skyline/work/20240923_AssociateProteinsLes…

3a84298

…sStrictFiltering

chambm reviewed Oct 8, 2024

View reviewed changes

pwiz_tools/Skyline/Model/Proteome/ProteinAssociation.cs Show resolved Hide resolved

Merge branch 'master' into Skyline/work/20240923_AssociateProteinsLes…

b7a47f5

…sStrictFiltering

Fix intermittent failure in TestHugeAssociateProteins

10f2de0

nickshulman added 2 commits October 9, 2024 08:28

Use Tuple<string, bool> in "ProteinPeptideMatches"

7b0e4e4

Add "ProteinAssociationTest"

71c918b

nickshulman marked this pull request as ready for review October 9, 2024 16:08

chambm self-requested a review October 9, 2024 16:19

chambm reviewed Oct 9, 2024

View reviewed changes

nickshulman added 8 commits October 9, 2024 12:14

Delete "TwoProteins.fasta" and use TemporaryDirectory instead

a67a923

Use ParallelEx.For to enumerate over ProteinPeptideMatches objects

2b433d1

Add method "ProteinAssociation.UseProteinSource"

6a0ae1b

Merge remote-tracking branch 'remotes/origin/master' into Skyline/wor…

da5516b

…k/20240923_AssociateProteinsLessStrictFiltering

Change "AssociateProteins" to take an Enzyme instead of passing in a …

1a71b69

…method which does the digestion.

Remove inaccurate comment

4235938

Use Enzyme from the document instead of passing it in.

e27a2f8

Fix TestAssociateProteins

234c1b0

Merge remote-tracking branch 'remotes/origin/master' into Skyline/wor…

fb7a62c

…k/20240923_AssociateProteinsLessStrictFiltering

nickshulman merged commit 2f7e0ac into master Oct 11, 2024
12 checks passed

nickshulman deleted the Skyline/work/20240923_AssociateProteinsLessStrictFiltering branch October 11, 2024 18:24

Make "Associate Proteins" more forgiving when peptide filter or digest settings would exclude the peptide #3169

Make "Associate Proteins" more forgiving when peptide filter or digest settings would exclude the peptide #3169

Uh oh!

Conversation

nickshulman commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chambm commented Sep 26, 2024

Uh oh!

nickshulman commented Sep 26, 2024

Uh oh!

chambm commented Sep 26, 2024

Uh oh!

nickshulman commented Sep 26, 2024

Uh oh!

Uh oh!

chambm commented Oct 8, 2024

Uh oh!

nickshulman commented Oct 8, 2024

Uh oh!

nickshulman commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chambm commented Oct 9, 2024

Uh oh!

chambm Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

nickshulman Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

chambm Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

nickshulman Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

nickshulman Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

chambm left a comment

Choose a reason for hiding this comment

Uh oh!

nickshulman commented Oct 10, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickshulman commented Sep 25, 2024 •

edited

Loading

nickshulman commented Oct 8, 2024 •

edited

Loading