-
Notifications
You must be signed in to change notification settings - Fork 110
Make "Associate Proteins" more forgiving when peptide filter or digest settings would exclude the peptide #3169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make "Associate Proteins" more forgiving when peptide filter or digest settings would exclude the peptide #3169
Conversation
… there is no other protein for which the peptide is tryptic
Sorry for the delay. Couldn't this just be a checkbox like "Apply document filters to protein association"? |
It seems users often expect the external search results to override the document settings, or at least the document settings that aren't exposed in the wizard. Perhaps the associate proteins logic should not do any filtering and rely on other Skyline code to do that? |
I am not sure I understand what you mean by "filtering". If we wanted to make it so the "Add All" button on the Library Explorer was unnecessary to get all the found peptides into the document, the place to change that would be "FastaImporter.Import". |
Hmm, why would these changes slightly change gene level parsimony association in TestHugeAssociateProteins? |
It does not appear that there are any changes to which peptides are associated with which proteins. That test passes on my computer. I wonder whether my extra ParallelEx.ForEach has introduced some sort of timing-related intermittent failure. |
We have seen this intermittent failure in TestHugeAssociateProteins in the nightlies: I imagine this PR makes the intermittent failure more likely.
Should I try to figure out why the order that these proteins get processed affects the results, or should I just make it so that the results do not depend on the order in which the individual ParallelEx items finish? |
Let's see what that does to the test time tonight. |
@@ -0,0 +1,8 @@ | |||
>Protein1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a pattern of many unit tests having their own .data directory with small test data files in them? I think you could just write this out to a temp file directly from the code to keep the repo tidier. I love the new ability to keep things in .data dir instead of a .zip file, but think we should still use it judiciously. Like .sky files. Those would be a pain to write directly from test code (as xml I mean; the settings can be generated programmatically). As would any big DSV file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way I know to create and clean up a temporary directory is by using "TestFilesDir".
Do you know of an easier way to do that which does not require either a .zip file or .data folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
using var testDir = new TemporaryDirectory(Path.Combine(TestContext.TestRunDirectory, TestContext.TestName));
If it works we could definitely have a shortcut for that in AbstractUnitTest. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that works. Thanks!
I am probably going to add a method to ProteinAssociation which takes a ProteinSource so that it can be used without any file on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, TestContext.TestRunDirectory
is null when running from TestRunner.
I added a method ProteinAssociate.UseProteinSource
so I do not need a file on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The huge test's timing on TeamCity is pretty volatile. When I look at all PRs your changes don't really seem to make a big difference. But just looking at this branch (https://teamcity.labkey.org/test/5831976084300368231?currentProjectId=ProteoWizard&expandTestHistoryChartSection=true&branch=pull%2F3169), there's a 15 second difference between before and after the change from your new parallel loop to making it serial. I don't remember how much of the time in that test is doing the parsimony stuff vs. the protein matching though so I had to run it myself. It's a pretty significant chunk, but a lot is also in the OkDialog saving the document tree. Can you check the performance impact of using foreach vs. ParallelEx.Foreach? I got 40s with this in PerfAssociateProteinsHugeTest:
Stopwatch sw = Stopwatch.StartNew();
var proteinsDlg = ShowDialog<AssociateProteinsDlg>(SkylineWindow.ShowAssociateProteinsDlg);
if (type == ImportType.FASTA)
{
RunUI(() => proteinsDlg.FastaFileName = _fastaFile);
}
else
{
RunUI(proteinsDlg.UseBackgroundProteome);
}
WaitForCondition(() => !proteinsDlg.IsBusy);
Console.WriteLine("Association time without parsimony: " + sw.ElapsedMilliseconds);
//PauseTest();
RunUI(() =>
{
Assert.AreEqual(425484, proteinsDlg.FinalResults.PeptidesMapped);
Assert.AreEqual(0, proteinsDlg.FinalResults.PeptidesUnmapped);
Assert.AreEqual(84198, proteinsDlg.FinalResults.ProteinsMapped);
Assert.AreEqual(4281, proteinsDlg.FinalResults.ProteinsUnmapped);
});
CancelDialog(proteinsDlg, proteinsDlg.CancelDialog);
return;
…k/20240923_AssociateProteinsLessStrictFiltering
…method which does the digestion.
…k/20240923_AssociateProteinsLessStrictFiltering
I moved some of the stuff from the second loop into "ParallelEx.For" and that does seem to have made it go faster. |
Mike has had a lot of documents where a small fraction of peptides end up being "unmapped" even though all of the peptides were found by a peptide search engine using the same FASTA file that's being used for the protein association.
This change makes it so that if the peptide would otherwise be "unmapped" (that is, it does not satisfy the digestion criteria for any protein in the FASTA file) the peptide becomes associated with all proteins whose sequence contains the peptide sequence.