Skip to content

Conversation

nickshulman
Copy link
Contributor

@nickshulman nickshulman commented Sep 25, 2024

Mike has had a lot of documents where a small fraction of peptides end up being "unmapped" even though all of the peptides were found by a peptide search engine using the same FASTA file that's being used for the protein association.

This change makes it so that if the peptide would otherwise be "unmapped" (that is, it does not satisfy the digestion criteria for any protein in the FASTA file) the peptide becomes associated with all proteins whose sequence contains the peptide sequence.

@nickshulman nickshulman requested a review from chambm September 25, 2024 02:06
@chambm
Copy link
Member

chambm commented Sep 26, 2024

Sorry for the delay. Couldn't this just be a checkbox like "Apply document filters to protein association"?

@nickshulman
Copy link
Contributor Author

Sorry for the delay. Couldn't this just be a checkbox like "Apply document filters to protein association"?

We normally think of those filters as controlling whether the peptide is added to the document at all. Given that a peptide is in the document which does not pass the filter criteria, would a user actually want that peptide to be put into the "unmapped" peptide list because it had too many missed cleavages?

Mike's document contains peptides which do not pass the filter criteria because he used the "Add All" button in the Library Explorer (he also probably also had to answer "yes" when Skyline told him that some of the peptides did not match the filter criteria).

By the way, there is a "Pick peptides matching" setting on the "Library" tab of the Peptide Settings dialog which asks whether peptides should be included, but that only affects the settings on the "Filter" tab of the "Peptide Settings" dialog.
image
This setting is sort of in the spirit of what you are proposing.

I think I was mistaken that Protein Association pays attention to any of the filter settings on the Filter tab in the Peptide Settings dialog. Actually, all that matter are "exclude ragged ends", "max missed cleavages" and the enzyme itself.

@chambm
Copy link
Member

chambm commented Sep 26, 2024

It seems users often expect the external search results to override the document settings, or at least the document settings that aren't exposed in the wizard. Perhaps the associate proteins logic should not do any filtering and rely on other Skyline code to do that?

@nickshulman
Copy link
Contributor Author

It seems users often expect the external search results to override the document settings, or at least the document settings that aren't exposed in the wizard. Perhaps the associate proteins logic should not do any filtering and rely on other Skyline code to do that?

I am not sure I understand what you mean by "filtering".
"Associate Proteins" does not make any decisions about which peptides are in the document, only which proteins they belong to. Currently, peptides which do not satisfy the Max Missed Cleavages or Ragged Ends setting are getting put into an "unmapped" peptide list.

If we wanted to make it so the "Add All" button on the Library Explorer was unnecessary to get all the found peptides into the document, the place to change that would be "FastaImporter.Import".

@chambm
Copy link
Member

chambm commented Oct 8, 2024

Hmm, why would these changes slightly change gene level parsimony association in TestHugeAssociateProteins?

@nickshulman
Copy link
Contributor Author

Hmm, why would these changes slightly change gene level parsimony association in TestHugeAssociateProteins?

It does not appear that there are any changes to which peptides are associated with which proteins.

That test passes on my computer. I wonder whether my extra ParallelEx.ForEach has introduced some sort of timing-related intermittent failure.

@nickshulman
Copy link
Contributor Author

nickshulman commented Oct 8, 2024

Hmm, why would these changes slightly change gene level parsimony association in TestHugeAssociateProteins?

We have seen this intermittent failure in TestHugeAssociateProteins in the nightlies:
https://skyline.ms/testresults/home/development/Performance%20Tests/showRun.view?runId=70857

I imagine this PR makes the intermittent failure more likely.
I can provoke the failure in master by shuffling the proteins in "FindProteinMatches":

var random = new Random((int)DateTime.UtcNow.Ticks);
ParallelEx.ForEach(proteinSource.Proteins.OrderBy(p=>random.Next()), fastaRecord =>

Should I try to figure out why the order that these proteins get processed affects the results, or should I just make it so that the results do not depend on the order in which the individual ParallelEx items finish?

@chambm
Copy link
Member

chambm commented Oct 9, 2024

Let's see what that does to the test time tonight.

@nickshulman nickshulman marked this pull request as ready for review October 9, 2024 16:08
@chambm chambm self-requested a review October 9, 2024 16:19
@@ -0,0 +1,8 @@
>Protein1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a pattern of many unit tests having their own .data directory with small test data files in them? I think you could just write this out to a temp file directly from the code to keep the repo tidier. I love the new ability to keep things in .data dir instead of a .zip file, but think we should still use it judiciously. Like .sky files. Those would be a pain to write directly from test code (as xml I mean; the settings can be generated programmatically). As would any big DSV file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way I know to create and clean up a temporary directory is by using "TestFilesDir".
Do you know of an easier way to do that which does not require either a .zip file or .data folder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
using var testDir = new TemporaryDirectory(Path.Combine(TestContext.TestRunDirectory, TestContext.TestName));
If it works we could definitely have a shortcut for that in AbstractUnitTest. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that works. Thanks!
I am probably going to add a method to ProteinAssociation which takes a ProteinSource so that it can be used without any file on disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, TestContext.TestRunDirectory is null when running from TestRunner.
I added a method ProteinAssociate.UseProteinSource so I do not need a file on disk.

Copy link
Member

@chambm chambm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The huge test's timing on TeamCity is pretty volatile. When I look at all PRs your changes don't really seem to make a big difference. But just looking at this branch (https://teamcity.labkey.org/test/5831976084300368231?currentProjectId=ProteoWizard&expandTestHistoryChartSection=true&branch=pull%2F3169), there's a 15 second difference between before and after the change from your new parallel loop to making it serial. I don't remember how much of the time in that test is doing the parsimony stuff vs. the protein matching though so I had to run it myself. It's a pretty significant chunk, but a lot is also in the OkDialog saving the document tree. Can you check the performance impact of using foreach vs. ParallelEx.Foreach? I got 40s with this in PerfAssociateProteinsHugeTest:


            Stopwatch sw = Stopwatch.StartNew();
            var proteinsDlg = ShowDialog<AssociateProteinsDlg>(SkylineWindow.ShowAssociateProteinsDlg);
            if (type == ImportType.FASTA)
            {
                RunUI(() => proteinsDlg.FastaFileName = _fastaFile);
            }
            else
            {
                RunUI(proteinsDlg.UseBackgroundProteome);
            }

            WaitForCondition(() => !proteinsDlg.IsBusy);
            Console.WriteLine("Association time without parsimony: " + sw.ElapsedMilliseconds);
            //PauseTest();
            RunUI(() =>
            {
                Assert.AreEqual(425484, proteinsDlg.FinalResults.PeptidesMapped);
                Assert.AreEqual(0, proteinsDlg.FinalResults.PeptidesUnmapped);
                Assert.AreEqual(84198, proteinsDlg.FinalResults.ProteinsMapped);
                Assert.AreEqual(4281, proteinsDlg.FinalResults.ProteinsUnmapped);
            });
            CancelDialog(proteinsDlg, proteinsDlg.CancelDialog);
            return;

…k/20240923_AssociateProteinsLessStrictFiltering
@nickshulman
Copy link
Contributor Author

Can you check the performance impact of using foreach vs. ParallelEx.Foreach?

I moved some of the stuff from the second loop into "ParallelEx.For" and that does seem to have made it go faster.

@nickshulman nickshulman merged commit 2f7e0ac into master Oct 11, 2024
12 checks passed
@nickshulman nickshulman deleted the Skyline/work/20240923_AssociateProteinsLessStrictFiltering branch October 11, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants