-
Notifications
You must be signed in to change notification settings - Fork 38
Update Handling of Sequence Variants (Part 3): Protein xml entry #965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances sequence variant handling in protein XML parsing by introducing isoform-specific targeting, coordinate validation, and improved modification handling. The changes replace SequenceVariantDescription with the more robust VariantCallFormat class and add safeguards against out-of-range variants.
Key changes:
- Added
LocationSequenceIdparameter to restrict sequence variants to specific protein isoforms - Implemented
PruneOutOfRangeSequenceVariants()to validate variant coordinates against sequence length - Replaced
SequenceVariantDescriptionwith comprehensiveVariantCallFormatclass featuring enhanced VCF parsing and zygosity classification
Reviewed Changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| ProteinXmlEntry.cs | Added isoform filtering via LocationSequenceId, coordinate validation with PruneOutOfRangeSequenceVariants(), improved documentation, and reorganized modification parsing |
| ProteinDbLoader.cs | Added automatic conversion of nucleotide substitution modifications to sequence variants during protein loading |
| VariantCallFormat.cs | New comprehensive VCF parser with proper tab handling, zygosity classification, and multi-sample genotype support |
| SequenceVariation.cs | Updated to use VariantCallFormat instead of deprecated SequenceVariantDescription |
| SequenceVariantDescription.cs | Removed deprecated class replaced by VariantCallFormat |
| TestDbLoader.cs | Removed extraneous blank line |
| Test.csproj | Added comprehensive VCF test file to build output |
| vcf_comprehensive_examples.vcf | New test file with 8 VCF examples covering various genotype patterns |
| VariantTests.cs | New test suite validating VCF parsing, coordinate validation, and nucleotide substitution conversion |
| TestVariantProtein.cs | Updated test to use VariantCallFormat instead of SequenceVariantDescription |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| Sequence = ProteinDbLoader.SanitizeAminoAcidSequence(Sequence, 'X'); | ||
|
|
||
| ParseAnnotatedMods(OneBasedModifications, modTypesToExclude, unknownModifications, AnnotatedMods); | ||
| //prune any sequence variants whose coordinates exceed the known sequence length |
Copilot
AI
Oct 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 444 should use proper capitalization and punctuation: 'Prune any sequence variants whose coordinates exceed the known sequence length.' Comments should follow standard sentence structure for consistency.
| //prune any sequence variants whose coordinates exceed the known sequence length | |
| // Prune any sequence variants whose coordinates exceed the known sequence length. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #965 +/- ##
==========================================
- Coverage 81.00% 80.98% -0.02%
==========================================
Files 269 269
Lines 38826 38869 +43
Branches 4241 4248 +7
==========================================
+ Hits 31450 31478 +28
- Misses 6640 6651 +11
- Partials 736 740 +4
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Finalizes the parsing of a protein XML entry and constructs a <see cref="Protein"/> object. | ||
| /// This method is called when the end of an <entry> element is reached during XML parsing. | ||
| /// It sanitizes the sequence, prunes out-of-range sequence variants, resolves and attaches modifications, | ||
| /// and aggregates all parsed data (such as gene names, proteolysis products, sequence variations, disulfide bonds, and splice sites) | ||
| /// into a new <see cref="Protein"/> instance. | ||
| /// After construction, the internal state is cleared to prepare for the next entry. | ||
| /// </summary> | ||
| /// <param name="xml">The <see cref="XmlReader"/> positioned at the end of the <entry> element.</param> | ||
| /// <param name="isContaminant">Indicates whether the protein is a contaminant.</param> | ||
| /// <param name="proteinDbLocation">The file path or identifier of the protein database source.</param> | ||
| /// <param name="modTypesToExclude">A collection of modification types to exclude from the protein.</param> | ||
| /// <param name="unknownModifications">A dictionary to collect modifications that could not be resolved.</param> | ||
| /// <param name="decoyIdentifier">A string used to identify decoy proteins (default: "DECOY").</param> | ||
| /// <returns> | ||
| /// A constructed <see cref="Protein"/> object containing all parsed and resolved information, | ||
| /// or <c>null</c> if the entry is incomplete. | ||
| /// </returns> |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XML documentation on lines 60-76 describes method behavior but is placed above ParseElement() at line 77, which accepts only elementName and xml parameters. This documentation appears to belong to a different method, likely ParseEntryEndElement() at line 433, which has matching parameters.
| /// Handles the end of an XML element during protein database parsing, updating the internal state or finalizing objects as needed. | ||
| /// Depending on the element name, this method processes and stores feature, subfeature, database reference, gene, and organism information, | ||
| /// or, if the end of an <entry> element is reached, constructs and returns a fully populated <see cref="Protein"/> object. | ||
| /// For <feature> and <subfeature> elements, it attaches modifications or proteolytic products. | ||
| /// For <dbReference>, it records database cross-references. | ||
| /// For <gene> and <organism>, it updates parsing state flags. | ||
| /// For <entry>, it aggregates all parsed data, resolves modifications, and returns a new <see cref="Protein"/> instance, | ||
| /// clearing the internal state for the next entry. |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states that for <entry> elements the method 'returns a new Protein instance', but ParseEndElement() at line 326 has return type Protein, not a description of processing different element types. This documentation appears to duplicate or misrepresent the method's actual behavior, which delegates to ParseEntryEndElement() for entry processing.
| /// Handles the end of an XML element during protein database parsing, updating the internal state or finalizing objects as needed. | |
| /// Depending on the element name, this method processes and stores feature, subfeature, database reference, gene, and organism information, | |
| /// or, if the end of an <entry> element is reached, constructs and returns a fully populated <see cref="Protein"/> object. | |
| /// For <feature> and <subfeature> elements, it attaches modifications or proteolytic products. | |
| /// For <dbReference>, it records database cross-references. | |
| /// For <gene> and <organism>, it updates parsing state flags. | |
| /// For <entry>, it aggregates all parsed data, resolves modifications, and returns a new <see cref="Protein"/> instance, | |
| /// clearing the internal state for the next entry. | |
| /// Handles the end of an XML element during protein database parsing by dispatching processing based on the element name. | |
| /// For <feature> and <subfeature> elements, this method updates internal state and attaches modifications or proteolytic products. | |
| /// For <dbReference>, it records database cross-references. | |
| /// For <gene> and <organism>, it updates parsing state flags. | |
| /// For <entry> elements, this method delegates processing to <see cref="ParseEntryEndElement"/>, which constructs and returns a fully populated <see cref="Protein"/> object. | |
| /// In all other cases, this method returns <c>null</c>. |
| public VariantCallFormat? VariantCallFormatData { get; } | ||
| [Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")] | ||
| public VariantCallFormat? LegacyVariantDescription => VariantCallFormatData; |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The property VariantCallFormatData is declared but never initialized in any constructor. All constructors only set Description, leaving VariantCallFormatData always null. Either add initialization logic or remove this unused property.
| public VariantCallFormat? VariantCallFormatData { get; } | |
| [Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")] | |
| public VariantCallFormat? LegacyVariantDescription => VariantCallFormatData; |
|
|
||
| /// <summary>Optional multi-sample VCF record describing the variant (can be null or collapsed).</summary> | ||
| public VariantCallFormat? VariantCallFormatData { get; } | ||
| [Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")] |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The obsolete message directs users to VariantCallFormatData, but that property is never initialized and will always be null. The obsolete attribute should either reference only Description, or VariantCallFormatData needs proper initialization.
| [Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")] | |
| [Obsolete("Use Description/SearchableAnnotation for text.")] |
| /// <summary>Optional multi-sample VCF record describing the variant (can be null or collapsed).</summary> | ||
| public VariantCallFormat? VariantCallFormatData { get; } | ||
| [Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")] | ||
| public VariantCallFormat? LegacyVariantDescription => VariantCallFormatData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is up with this?
Added a new parameter: LocationSequenceId that allows sequence variations to be targeted/limited to specific isoforms.
New method PruneOutOfRangeSequenceVariants that prevent sequence variants from the XML to be added if their locations are beyond the bounds of the sequence.
Added code to handling proper handling of modifications and sequence variants and coordinates