Skip to content

Conversation

@trishorts
Copy link
Contributor

@trishorts trishorts commented Oct 25, 2025

Current behavior of VariantApplication has one variable maxVariantsForCombinatorics, which is a number for how many total sequences to return. This is ambiguous in that it doesn't specify a limit on the number of variants per isoform. This PR adds parameters that will limit the number of variants per isoform. This is just a start for a big change that will handle all of the combinatorics.

@codecov
Copy link

codecov bot commented Oct 25, 2025

Codecov Report

❌ Patch coverage is 71.10092% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.90%. Comparing base (57f1327) to head (942a930).

Files with missing lines Patch % Lines
mzLib/Omics/BioPolymer/VariantApplication.cs 62.96% 20 Missing and 10 partials ⚠️
mzLib/Omics/BioPolymer/SequenceVariation.cs 51.72% 11 Missing and 3 partials ⚠️
mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs 88.54% 7 Missing and 4 partials ⚠️
...Databases/DecoyGeneration/DecoyProteinGenerator.cs 12.50% 0 Missing and 7 partials ⚠️
...micsDatabases/DecoyGeneration/RnaDecoyGenerator.cs 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #968      +/-   ##
==========================================
- Coverage   81.00%   80.90%   -0.11%     
==========================================
  Files         269      269              
  Lines       38826    38913      +87     
  Branches     4241     4260      +19     
==========================================
+ Hits        31450    31481      +31     
- Misses       6640     6681      +41     
- Partials      736      751      +15     
Files with missing lines Coverage Δ
mzLib/Omics/BioPolymer/VariantCallFormat.cs 86.95% <ø> (ø)
mzLib/UsefulProteomicsDatabases/ProteinDbWriter.cs 95.48% <100.00%> (ø)
...micsDatabases/DecoyGeneration/RnaDecoyGenerator.cs 83.72% <50.00%> (ø)
...Databases/DecoyGeneration/DecoyProteinGenerator.cs 90.79% <12.50%> (ø)
mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs 96.78% <88.54%> (-2.75%) ⬇️
mzLib/Omics/BioPolymer/SequenceVariation.cs 71.25% <51.72%> (-14.72%) ⬇️
mzLib/Omics/BioPolymer/VariantApplication.cs 79.11% <62.96%> (-2.66%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@trishorts trishorts changed the title Variant application m in changes Update Handling of Sequence Variants (Part 5): VariantApplication Improved Specificity Oct 25, 2025
@trishorts trishorts requested a review from Copilot October 25, 2025 14:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the handling of sequence variants by improving coordinate validation, isoform-specific variant filtering, and API specificity for variant application. The main changes focus on preventing out-of-range variants from being applied and refining the combinatorics parameters to differentiate between total isoform counts and per-isoform variant limits.

  • Adds pruning of out-of-range sequence variants whose coordinates exceed the known sequence length
  • Introduces isoform-aware variant filtering by checking <location sequence="..."> attributes to ensure variants only apply to their target isoform
  • Refactors variant application parameters to distinguish between consensusPlusVariantIsoforms (total result cap) and maxVariantsPerIsoform (per-isoform combinatoric limit)

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs Added LocationSequenceId field, PruneOutOfRangeSequenceVariants() method, and isoform filtering logic to prevent applying variants to incorrect sequences
mzLib/UsefulProteomicsDatabases/ProteinDbWriter.cs Updated to use VariantCallFormatDataString instead of Description for VCF data
mzLib/UsefulProteomicsDatabases/DecoyGeneration/RnaDecoyGenerator.cs Updated references from Description.Description to VariantCallFormatDataString.Description
mzLib/UsefulProteomicsDatabases/DecoyGeneration/DecoyProteinGenerator.cs Updated references from Description to VariantCallFormatDataString
mzLib/Test/TestPeptideWithSetMods.cs Replaced simple assertion test with comprehensive TSV-based regression test for decoy truncations
mzLib/Test/DatabaseTests/TestProteomicsReadWrite.cs Updated test assertions to use VariantCallFormatDataString
mzLib/Test/DatabaseTests/TestProteinReader.cs Updated test assertions to use VariantCallFormatDataString
mzLib/Omics/BioPolymer/VariantCallFormat.cs Updated comment reference from Description to VariantCallFormatDataString
mzLib/Omics/BioPolymer/VariantApplication.cs Introduced new method GetConsensusAndVariantBioPolymers with clearer parameters, deprecated existing GetVariantBioPolymers, and updated all references to VariantCallFormatDataString
mzLib/Omics/BioPolymer/SequenceVariation.cs Refactored to separate Description (string) from VariantCallFormatDataString (VCF object), added constructor overloads, and enhanced documentation
mzLib/Test/Test.csproj Added truncationsExpected.tsv as test resource
mzLib/Test/DatabaseTests/truncationsExpected.tsv New file containing expected truncation test data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +60 to +65
/// Finalizes the parsing of a protein XML entry and constructs a <see cref="Protein"/> object.
/// This method is called when the end of an &lt;entry&gt; element is reached during XML parsing.
/// It sanitizes the sequence, prunes out-of-range sequence variants, resolves and attaches modifications,
/// and aggregates all parsed data (such as gene names, proteolysis products, sequence variations, disulfide bonds, and splice sites)
/// into a new <see cref="Protein"/> instance.
/// After construction, the internal state is cleared to prepare for the next entry.
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML comment incorrectly describes the ParseElement method (line 77), which handles element starts, not the finalization of an entry. The comment should describe element parsing/attribute extraction, not entry finalization.

Suggested change
/// Finalizes the parsing of a protein XML entry and constructs a <see cref="Protein"/> object.
/// This method is called when the end of an &lt;entry&gt; element is reached during XML parsing.
/// It sanitizes the sequence, prunes out-of-range sequence variants, resolves and attaches modifications,
/// and aggregates all parsed data (such as gene names, proteolysis products, sequence variations, disulfide bonds, and splice sites)
/// into a new <see cref="Protein"/> instance.
/// After construction, the internal state is cleared to prepare for the next entry.
/// Handles the start of an XML element during protein database parsing.
/// This method is called when a new XML element is encountered and is responsible for
/// extracting relevant attributes and data from the element, updating the internal state
/// of the parser as needed (e.g., capturing sequence information, modifications, gene names, etc.).

Copilot uses AI. Check for mistakes.
else
{
type += ("null-null");
type += "null-null";
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Missing parentheses around the string literal. The previous line 564 uses type += (\"null-null\"). For consistency, this line should also use parentheses.

Suggested change
type += "null-null";
type += ("null-null");

Copilot uses AI. Check for mistakes.
return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinatorics, minAlleleDepth);

// Otherwise, do genotype/allele-depth-aware application with combinatorics limited for heterozygous sites
return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinitorics: consensusPlusVariantIsoforms, minAlleleDepth);
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Combinitorics' to 'Combinatorics' in parameter name.

Suggested change
return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinitorics: consensusPlusVariantIsoforms, minAlleleDepth);
return ApplyVariants(protein, protein.SequenceVariations, maxAllowedVariantsForCombinatorics: consensusPlusVariantIsoforms, minAlleleDepth);

Copilot uses AI. Check for mistakes.
/// </param>
/// <param name="minAlleleDepth">Minimum AD (Allele Depth) per sample for an allele to be considered in application.</param>
/// <returns>A list of concrete variant biopolymers across all individuals encoded in the VCF payloads.</returns>
public static List<TBioPolymerType> ApplyVariants<TBioPolymerType>(TBioPolymerType protein, IEnumerable<SequenceVariation> sequenceVariations, int maxAllowedVariantsForCombinitorics, int minAlleleDepth)
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Combinitorics' to 'Combinatorics' in parameter name.

Copilot uses AI. Check for mistakes.

bool tooManyHeterozygousVariants = uniqueEffectsToApply.Count(v => v.Description.Heterozygous[individual]) > maxAllowedVariantsForCombinitorics;
// Whether to limit combinatorial branching for this individual
bool tooManyHeterozygousVariants = uniqueEffectsToApply.Count(v => v.VariantCallFormatDataString.Heterozygous[individual]) > maxAllowedVariantsForCombinitorics;
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Combinitorics' to 'Combinatorics' in parameter name.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant