-
Notifications
You must be signed in to change notification settings - Fork 38
Update Handling of Sequence Variants (Part 5): VariantApplication Improved Specificity #968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Update Handling of Sequence Variants (Part 5): VariantApplication Improved Specificity #968
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #968 +/- ##
==========================================
- Coverage 81.00% 80.90% -0.11%
==========================================
Files 269 269
Lines 38826 38913 +87
Branches 4241 4260 +19
==========================================
+ Hits 31450 31481 +31
- Misses 6640 6681 +41
- Partials 736 751 +15
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the handling of sequence variants by improving coordinate validation, isoform-specific variant filtering, and API specificity for variant application. The main changes focus on preventing out-of-range variants from being applied and refining the combinatorics parameters to differentiate between total isoform counts and per-isoform variant limits.
- Adds pruning of out-of-range sequence variants whose coordinates exceed the known sequence length
- Introduces isoform-aware variant filtering by checking
<location sequence="...">attributes to ensure variants only apply to their target isoform - Refactors variant application parameters to distinguish between
consensusPlusVariantIsoforms(total result cap) andmaxVariantsPerIsoform(per-isoform combinatoric limit)
Reviewed Changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs | Added LocationSequenceId field, PruneOutOfRangeSequenceVariants() method, and isoform filtering logic to prevent applying variants to incorrect sequences |
| mzLib/UsefulProteomicsDatabases/ProteinDbWriter.cs | Updated to use VariantCallFormatDataString instead of Description for VCF data |
| mzLib/UsefulProteomicsDatabases/DecoyGeneration/RnaDecoyGenerator.cs | Updated references from Description.Description to VariantCallFormatDataString.Description |
| mzLib/UsefulProteomicsDatabases/DecoyGeneration/DecoyProteinGenerator.cs | Updated references from Description to VariantCallFormatDataString |
| mzLib/Test/TestPeptideWithSetMods.cs | Replaced simple assertion test with comprehensive TSV-based regression test for decoy truncations |
| mzLib/Test/DatabaseTests/TestProteomicsReadWrite.cs | Updated test assertions to use VariantCallFormatDataString |
| mzLib/Test/DatabaseTests/TestProteinReader.cs | Updated test assertions to use VariantCallFormatDataString |
| mzLib/Omics/BioPolymer/VariantCallFormat.cs | Updated comment reference from Description to VariantCallFormatDataString |
| mzLib/Omics/BioPolymer/VariantApplication.cs | Introduced new method GetConsensusAndVariantBioPolymers with clearer parameters, deprecated existing GetVariantBioPolymers, and updated all references to VariantCallFormatDataString |
| mzLib/Omics/BioPolymer/SequenceVariation.cs | Refactored to separate Description (string) from VariantCallFormatDataString (VCF object), added constructor overloads, and enhanced documentation |
| mzLib/Test/Test.csproj | Added truncationsExpected.tsv as test resource |
| mzLib/Test/DatabaseTests/truncationsExpected.tsv | New file containing expected truncation test data |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Finalizes the parsing of a protein XML entry and constructs a <see cref="Protein"/> object. | ||
| /// This method is called when the end of an <entry> element is reached during XML parsing. | ||
| /// It sanitizes the sequence, prunes out-of-range sequence variants, resolves and attaches modifications, | ||
| /// and aggregates all parsed data (such as gene names, proteolysis products, sequence variations, disulfide bonds, and splice sites) | ||
| /// into a new <see cref="Protein"/> instance. | ||
| /// After construction, the internal state is cleared to prepare for the next entry. |
Copilot
AI
Oct 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XML comment incorrectly describes the ParseElement method (line 77), which handles element starts, not the finalization of an entry. The comment should describe element parsing/attribute extraction, not entry finalization.
| /// Finalizes the parsing of a protein XML entry and constructs a <see cref="Protein"/> object. | |
| /// This method is called when the end of an <entry> element is reached during XML parsing. | |
| /// It sanitizes the sequence, prunes out-of-range sequence variants, resolves and attaches modifications, | |
| /// and aggregates all parsed data (such as gene names, proteolysis products, sequence variations, disulfide bonds, and splice sites) | |
| /// into a new <see cref="Protein"/> instance. | |
| /// After construction, the internal state is cleared to prepare for the next entry. | |
| /// Handles the start of an XML element during protein database parsing. | |
| /// This method is called when a new XML element is encountered and is responsible for | |
| /// extracting relevant attributes and data from the element, updating the internal state | |
| /// of the parser as needed (e.g., capturing sequence information, modifications, gene names, etc.). |
| else | ||
| { | ||
| type += ("null-null"); | ||
| type += "null-null"; |
Copilot
AI
Oct 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Missing parentheses around the string literal. The previous line 564 uses type += (\"null-null\"). For consistency, this line should also use parentheses.
| type += "null-null"; | |
| type += ("null-null"); |
| return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinatorics, minAlleleDepth); | ||
|
|
||
| // Otherwise, do genotype/allele-depth-aware application with combinatorics limited for heterozygous sites | ||
| return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinitorics: consensusPlusVariantIsoforms, minAlleleDepth); |
Copilot
AI
Oct 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'Combinitorics' to 'Combinatorics' in parameter name.
| return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinitorics: consensusPlusVariantIsoforms, minAlleleDepth); | |
| return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinatorics: consensusPlusVariantIsoforms, minAlleleDepth); |
| /// </param> | ||
| /// <param name="minAlleleDepth">Minimum AD (Allele Depth) per sample for an allele to be considered in application.</param> | ||
| /// <returns>A list of concrete variant biopolymers across all individuals encoded in the VCF payloads.</returns> | ||
| public static List<TBioPolymerType> ApplyVariants<TBioPolymerType>(TBioPolymerType protein, IEnumerable<SequenceVariation> sequenceVariations, int maxAllowedVariantsForCombinitorics, int minAlleleDepth) |
Copilot
AI
Oct 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'Combinitorics' to 'Combinatorics' in parameter name.
|
|
||
| bool tooManyHeterozygousVariants = uniqueEffectsToApply.Count(v => v.Description.Heterozygous[individual]) > maxAllowedVariantsForCombinitorics; | ||
| // Whether to limit combinatorial branching for this individual | ||
| bool tooManyHeterozygousVariants = uniqueEffectsToApply.Count(v => v.VariantCallFormatDataString.Heterozygous[individual]) > maxAllowedVariantsForCombinitorics; |
Copilot
AI
Oct 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'Combinitorics' to 'Combinatorics' in parameter name.
Current behavior of VariantApplication has one variable maxVariantsForCombinatorics, which is a number for how many total sequences to return. This is ambiguous in that it doesn't specify a limit on the number of variants per isoform. This PR adds parameters that will limit the number of variants per isoform. This is just a start for a big change that will handle all of the combinatorics.