Skip to content

Conversation

@trishorts
Copy link
Contributor

@trishorts trishorts commented Oct 23, 2025

Added a new parameter: LocationSequenceId that allows sequence variations to be targeted/limited to specific isoforms.
New method PruneOutOfRangeSequenceVariants that prevent sequence variants from the XML to be added if their locations are beyond the bounds of the sequence.
Added code to handling proper handling of modifications and sequence variants and coordinates

@trishorts trishorts requested a review from Copilot October 23, 2025 16:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances sequence variant handling in protein XML parsing by introducing isoform-specific targeting, coordinate validation, and improved modification handling. The changes replace SequenceVariantDescription with the more robust VariantCallFormat class and add safeguards against out-of-range variants.

Key changes:

  • Added LocationSequenceId parameter to restrict sequence variants to specific protein isoforms
  • Implemented PruneOutOfRangeSequenceVariants() to validate variant coordinates against sequence length
  • Replaced SequenceVariantDescription with comprehensive VariantCallFormat class featuring enhanced VCF parsing and zygosity classification

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
ProteinXmlEntry.cs Added isoform filtering via LocationSequenceId, coordinate validation with PruneOutOfRangeSequenceVariants(), improved documentation, and reorganized modification parsing
ProteinDbLoader.cs Added automatic conversion of nucleotide substitution modifications to sequence variants during protein loading
VariantCallFormat.cs New comprehensive VCF parser with proper tab handling, zygosity classification, and multi-sample genotype support
SequenceVariation.cs Updated to use VariantCallFormat instead of deprecated SequenceVariantDescription
SequenceVariantDescription.cs Removed deprecated class replaced by VariantCallFormat
TestDbLoader.cs Removed extraneous blank line
Test.csproj Added comprehensive VCF test file to build output
vcf_comprehensive_examples.vcf New test file with 8 VCF examples covering various genotype patterns
VariantTests.cs New test suite validating VCF parsing, coordinate validation, and nucleotide substitution conversion
TestVariantProtein.cs Updated test to use VariantCallFormat instead of SequenceVariantDescription

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sequence = ProteinDbLoader.SanitizeAminoAcidSequence(Sequence, 'X');

ParseAnnotatedMods(OneBasedModifications, modTypesToExclude, unknownModifications, AnnotatedMods);
//prune any sequence variants whose coordinates exceed the known sequence length
Copy link

Copilot AI Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 444 should use proper capitalization and punctuation: 'Prune any sequence variants whose coordinates exceed the known sequence length.' Comments should follow standard sentence structure for consistency.

Suggested change
//prune any sequence variants whose coordinates exceed the known sequence length
// Prune any sequence variants whose coordinates exceed the known sequence length.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 86.86869% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.98%. Comparing base (57f1327) to head (9bdc949).

Files with missing lines Patch % Lines
mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs 88.29% 7 Missing and 4 partials ⚠️
mzLib/Omics/BioPolymer/SequenceVariation.cs 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #965      +/-   ##
==========================================
- Coverage   81.00%   80.98%   -0.02%     
==========================================
  Files         269      269              
  Lines       38826    38869      +43     
  Branches     4241     4248       +7     
==========================================
+ Hits        31450    31478      +28     
- Misses       6640     6651      +11     
- Partials      736      740       +4     
Files with missing lines Coverage Δ
mzLib/Omics/BioPolymer/VariantCallFormat.cs 86.95% <100.00%> (ø)
mzLib/Omics/BioPolymer/SequenceVariation.cs 83.05% <0.00%> (-2.92%) ⬇️
mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs 96.77% <88.29%> (-2.76%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +60 to +76
/// Finalizes the parsing of a protein XML entry and constructs a <see cref="Protein"/> object.
/// This method is called when the end of an &lt;entry&gt; element is reached during XML parsing.
/// It sanitizes the sequence, prunes out-of-range sequence variants, resolves and attaches modifications,
/// and aggregates all parsed data (such as gene names, proteolysis products, sequence variations, disulfide bonds, and splice sites)
/// into a new <see cref="Protein"/> instance.
/// After construction, the internal state is cleared to prepare for the next entry.
/// </summary>
/// <param name="xml">The <see cref="XmlReader"/> positioned at the end of the &lt;entry&gt; element.</param>
/// <param name="isContaminant">Indicates whether the protein is a contaminant.</param>
/// <param name="proteinDbLocation">The file path or identifier of the protein database source.</param>
/// <param name="modTypesToExclude">A collection of modification types to exclude from the protein.</param>
/// <param name="unknownModifications">A dictionary to collect modifications that could not be resolved.</param>
/// <param name="decoyIdentifier">A string used to identify decoy proteins (default: "DECOY").</param>
/// <returns>
/// A constructed <see cref="Protein"/> object containing all parsed and resolved information,
/// or <c>null</c> if the entry is incomplete.
/// </returns>
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML documentation on lines 60-76 describes method behavior but is placed above ParseElement() at line 77, which accepts only elementName and xml parameters. This documentation appears to belong to a different method, likely ParseEntryEndElement() at line 433, which has matching parameters.

Copilot uses AI. Check for mistakes.
Comment on lines +307 to +314
/// Handles the end of an XML element during protein database parsing, updating the internal state or finalizing objects as needed.
/// Depending on the element name, this method processes and stores feature, subfeature, database reference, gene, and organism information,
/// or, if the end of an &lt;entry&gt; element is reached, constructs and returns a fully populated <see cref="Protein"/> object.
/// For &lt;feature&gt; and &lt;subfeature&gt; elements, it attaches modifications or proteolytic products.
/// For &lt;dbReference&gt;, it records database cross-references.
/// For &lt;gene&gt; and &lt;organism&gt;, it updates parsing state flags.
/// For &lt;entry&gt;, it aggregates all parsed data, resolves modifications, and returns a new <see cref="Protein"/> instance,
/// clearing the internal state for the next entry.
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that for <entry> elements the method 'returns a new Protein instance', but ParseEndElement() at line 326 has return type Protein, not a description of processing different element types. This documentation appears to duplicate or misrepresent the method's actual behavior, which delegates to ParseEntryEndElement() for entry processing.

Suggested change
/// Handles the end of an XML element during protein database parsing, updating the internal state or finalizing objects as needed.
/// Depending on the element name, this method processes and stores feature, subfeature, database reference, gene, and organism information,
/// or, if the end of an &lt;entry&gt; element is reached, constructs and returns a fully populated <see cref="Protein"/> object.
/// For &lt;feature&gt; and &lt;subfeature&gt; elements, it attaches modifications or proteolytic products.
/// For &lt;dbReference&gt;, it records database cross-references.
/// For &lt;gene&gt; and &lt;organism&gt;, it updates parsing state flags.
/// For &lt;entry&gt;, it aggregates all parsed data, resolves modifications, and returns a new <see cref="Protein"/> instance,
/// clearing the internal state for the next entry.
/// Handles the end of an XML element during protein database parsing by dispatching processing based on the element name.
/// For &lt;feature&gt; and &lt;subfeature&gt; elements, this method updates internal state and attaches modifications or proteolytic products.
/// For &lt;dbReference&gt;, it records database cross-references.
/// For &lt;gene&gt; and &lt;organism&gt;, it updates parsing state flags.
/// For &lt;entry&gt; elements, this method delegates processing to <see cref="ParseEntryEndElement"/>, which constructs and returns a fully populated <see cref="Protein"/> object.
/// In all other cases, this method returns <c>null</c>.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +68
public VariantCallFormat? VariantCallFormatData { get; }
[Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")]
public VariantCallFormat? LegacyVariantDescription => VariantCallFormatData;
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property VariantCallFormatData is declared but never initialized in any constructor. All constructors only set Description, leaving VariantCallFormatData always null. Either add initialization logic or remove this unused property.

Suggested change
public VariantCallFormat? VariantCallFormatData { get; }
[Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")]
public VariantCallFormat? LegacyVariantDescription => VariantCallFormatData;

Copilot uses AI. Check for mistakes.

/// <summary>Optional multi-sample VCF record describing the variant (can be null or collapsed).</summary>
public VariantCallFormat? VariantCallFormatData { get; }
[Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")]
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The obsolete message directs users to VariantCallFormatData, but that property is never initialized and will always be null. The obsolete attribute should either reference only Description, or VariantCallFormatData needs proper initialization.

Suggested change
[Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")]
[Obsolete("Use Description/SearchableAnnotation for text.")]

Copilot uses AI. Check for mistakes.
Alexander-Sol
Alexander-Sol previously approved these changes Oct 24, 2025
/// <summary>Optional multi-sample VCF record describing the variant (can be null or collapsed).</summary>
public VariantCallFormat? VariantCallFormatData { get; }
[Obsolete("Use VariantCallFormatData for structured data or Description/SearchableAnnotation for text.")]
public VariantCallFormat? LegacyVariantDescription => VariantCallFormatData;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is up with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants