-
Notifications
You must be signed in to change notification settings - Fork 38
Whats wrong with variant application and xml #956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Whats wrong with variant application and xml #956
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses two issues with variant application in protein XML parsing: 1) prevents crashes when overlapping sequence variants are applied together, and 2) filters out variants that target alternate protein isoforms rather than the base sequence.
- Adds validation to prevent application of overlapping sequence variants that would cause position conflicts
- Filters out sequence variants that reference alternate isoforms (e.g., "Q96J88-3") to avoid misapplication
- Refactors the ParseAnnotatedMods method for improved readability
Reviewed Changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ProteinXmlEntry.cs | Adds isoform filtering logic and refactors ParseAnnotatedMods method |
| VariantApplication.cs | Adds overlap validation for variant combinations |
| TestProteomicsReadWrite.cs | Adds tests for long substitution conflicts and alternate isoform handling |
| Test.csproj | Includes new test XML files in build output |
| longSubstitution.xml | Test case with overlapping variants |
| sequenceVariantOnAlternateIsoform.xml | Test case with variants on alternate isoforms |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| appliesToThisSequence = | ||
| LocationSequenceId.Equals(acc, StringComparison.OrdinalIgnoreCase) | ||
| || (!string.IsNullOrEmpty(acc) && LocationSequenceId.Equals($"{acc}-1", StringComparison.OrdinalIgnoreCase)); |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hardcoded '-1' suffix check assumes the base isoform is always numbered as '-1', but this may not be consistent across all protein databases. Consider using a more flexible approach or documenting this assumption.
| public static List<TBioPolymerType> GetVariantBioPolymers<TBioPolymerType>(this TBioPolymerType protein, int maxAllowedVariantsForCombinatorics = 4, int minAlleleDepth = 1) | ||
| where TBioPolymerType : IHasSequenceVariants | ||
| { | ||
| if(maxAllowedVariantsForCombinatorics == 0) |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after 'if' keyword violates C# formatting conventions.
| if(maxAllowedVariantsForCombinatorics == 0) | |
| if (maxAllowedVariantsForCombinatorics == 0) |
| { | ||
| foreach (var combo in GetCombinations(variations, size)) | ||
| { | ||
| if(!ValidCombination(combo.ToList())) |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after 'if' keyword violates C# formatting conventions.
| if(!ValidCombination(combo.ToList())) | |
| if (!ValidCombination(combo.ToList())) |
| { | ||
| foreach (var combo in GetCombinations(variations, size)) | ||
| { | ||
| if(!ValidCombination(combo.ToList())) |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling ToList() on each combination creates unnecessary allocations. Consider changing ValidCombination to accept IEnumerable instead of List.
| if(!ValidCombination(combo.ToList())) | |
| if(!ValidCombination(combo)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| string acc = Accession ?? ""; | ||
| appliesToThisSequence = | ||
| LocationSequenceId.Equals(acc, StringComparison.OrdinalIgnoreCase) | ||
| || (!string.IsNullOrEmpty(acc) && LocationSequenceId.Equals($"{acc}-1", StringComparison.OrdinalIgnoreCase)); |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded '-1' suffix check assumes a specific isoform naming pattern. Consider using a more flexible approach that checks if LocationSequenceId starts with the accession followed by a dash, or make this configurable to handle different isoform naming conventions.
| || (!string.IsNullOrEmpty(acc) && LocationSequenceId.Equals($"{acc}-1", StringComparison.OrdinalIgnoreCase)); | |
| || (!string.IsNullOrEmpty(acc) && LocationSequenceId.StartsWith(acc + "-", StringComparison.OrdinalIgnoreCase)); |
…ub.com/trishorts/mzLib into whatsWrongWithVariantApplicationAndXml
… combines sequence variants with the same effect
…iantApplicationAndXml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some issues left over from copilot:
- Rewrote entire sections of code that should not have been affected by changing how variants operate.
- Deleted a bunch of comments unnecessarily.
- Commented about what change in the code occurred due to your copilot prompt, but does not add any understanding to what is happening and why in the function.
Max variant isoforms being 1 to get the canonical out feels bad. The sequence we get out is not a variant isoform.
| public int CodingDomainSequenceLengthIncludingStopCodon { get; } | ||
| public int OneBasedProteinPosition { get; } | ||
| public int ProteinLength { get; } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why delete this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all comments back in snpeff
| /// Positions are validated in <see cref="AreValid"/> against the altered span (<see cref="VariantSequence"/>). | ||
| /// </summary> | ||
| public int OneBasedEndPosition { get; } | ||
| public Dictionary<int, List<Modification>> OneBasedModifications { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to store the full mod dictionary here? This would (likely) more than double the memory footprint of the modifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
short answer is yes.
| OriginalSequence?.Length > 1 && | ||
| VariantSequence?.Length == OriginalSequence.Length && | ||
| OriginalSequence != VariantSequence && | ||
| !IsPointSubstitution; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!IsPointSubstitution is a redundant check as the length check already does this.
| /// <summary> | ||
| /// True if variant introduces a stop (* at end). | ||
| /// </summary> | ||
| public bool IsStopGain => VariantSequence?.EndsWith("*", StringComparison.Ordinal) == true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if bool == true?
| /// Heuristic frameshift flag: length difference not equal & not simple stop gain only. | ||
| /// (Refine if you have explicit annotation elsewhere.) | ||
| /// </summary> | ||
| public bool IsLikelyFrameshift => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by frame shift? How can we evaluate a frame shift from only protein sequences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the entirety of XML writing change here? I would assume only the variant relevant components needed to change?
| @@ -30,7 +31,7 @@ public class ProteinXmlEntry | |||
| public string FeatureDescription { get; private set; } | |||
| public string SubFeatureType { get; private set; } | |||
| public string SubFeatureDescription { get; private set; } | |||
| public string OriginalValue { get; private set; } = ""; // if no content is found, assume it is empty, not null (e.g. <original>A</original><variation/> for a deletion event) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete these comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistake. restored to former glory
| @@ -172,19 +157,7 @@ private void ParseEntryAttributes(XmlReader xml) | |||
| DatabaseVersionEntryTag = xml.GetAttribute("version"); | |||
| XmlnsEntryTag = xml.GetAttribute("xmlns"); | |||
| } | |||
| /// <summary> | |||
| /// Parses some attributes of a <sequence> XML element and assigns their values to the corresponding properties of the ProteinXmlEntry. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment was deleted again :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistake. restored to former glory
| /// <summary> | ||
| /// Computes the monoisotopic mass of the given sequence. | ||
| /// Returns 0 if the sequence is empty. | ||
| /// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistake. restored to former glory
| Sequence = ProteinDbLoader.SanitizeAminoAcidSequence(Sequence, 'X'); | ||
|
|
||
| // NEW: prune any sequence variants whose coordinates exceed the now-known sequence length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover uninformative AI comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised comment
…iantApplicationAndXml
| public string SimpleString() | ||
| { | ||
| return OriginalSequence + OneBasedBeginPosition.ToString() + VariantSequence; | ||
| if (OneBasedBeginPosition == OneBasedEndPosition || (OriginalSequence?.Length ?? 0) <= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make this the ToString() method. Is there another ToString method that I missed?
| segment.OneBasedEndPosition >= OneBasedBeginPosition && segment.OneBasedBeginPosition <= OneBasedEndPosition; | ||
|
|
||
| internal bool Intersects(TruncationProduct segment) => | ||
| segment.OneBasedEndPosition >= OneBasedBeginPosition && segment.OneBasedBeginPosition <= OneBasedEndPosition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Intersects methods will only return true if the segment is a subset of this SequenceVariation, but not if they merely overlap, e.g.,
segment.OneBasedEndPosition >= OneBasedBeginPosition && segment.OneBasedEndPosition <= OneBasedEndPosition && segment.OneBasedBeginPosition OneBasedEndPosition
Is this the intended behavior?
| public string[] Effects { get; } = Array.Empty<string>(); | ||
| public string PutativeImpact { get; } = string.Empty; | ||
| public string GeneName { get; } = string.Empty; | ||
| public string GeneID { get; } = string.Empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we initializing fields in this fashion? This isn't done anywhere else in the code base
|
|
||
| public string[] Warnings { get; } | ||
| public string[] Warnings { get; } = Array.Empty<string>(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can wer remove lines 70-72, as they're unused?
| if (int.TryParse(a[14], out y)) DistanceToFeature = y; | ||
| Warnings = a[15].Split('&'); | ||
|
|
||
| string Get(int idx) => idx >= 0 && idx < a.Length ? a[idx] : string.Empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool use of local functions
| HGVSNotationDnaLevel = Get(9); | ||
| HGVSNotationProteinLevel = Get(10); | ||
|
|
||
| void ParseSlashField(string value, ref int first, ref int second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a private method?
| ParseSlashField(Get(13), ref pos, ref len); | ||
| OneBasedProteinPosition = pos; | ||
| ProteinLength = len; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use this syntax? Why not just continually redefine/overwrite the post and len variables
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this
| true, DecoyType.None, new List<Modification>(), false, new List<string>(), | ||
| out um, | ||
| maxSequenceVariantsPerIsoform: 4, minAlleleDepth: 1, totalConsensusPlusVariantIsoforms: 1); | ||
| Assert.That(new_proteins.First().OneBasedPossibleLocalizedModifications.First().Value.First().NeutralLosses.First().Value.Count == 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the default behavior doesn't change, then should we be modifying method calls in test that aren't explicitly about variants?
| } | ||
|
|
||
| //[Test] | ||
| //[Explicit("Long-running diagnostic; generates protein_variant_log.txt with per-protein variant expansion results.")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes made are far reaching and extensive, which makes this PR difficult to review.
If possible, please reduce the size of this PR, eliminating unnecessary changes and all changes not directly related to sequence variants. If this PR can be split up into multiple, smaller PRs, that would be preferred.
| using MzLibUtil; | ||
| using Omics.BioPolymer; | ||
| using Omics.Modifications; | ||
| using System.Reflection; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is concerning
| { | ||
| var seq = created?.BaseSequence; | ||
| if (!string.IsNullOrEmpty(seq)) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside this try statement, you're using reflection to access properties of your variant. This is concerning, as reflection should rarely be used in production code. It's slow and finicky.
Can you refactor this to avoid using reflection?
| writer.WriteEndElement(); // feature or subfeature | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large chunks of this file were re-written. Are all rewrites necessary?
Overview of Major Changes
Core Additions (SequenceVariation)
• New semantic / convenience properties:
• SearchableAnnotation (VCF line preferred over Description).
• ReferenceAllele / AlternateAllele passthrough to VariantCallFormat.
• Classification flags: IsPointSubstitution, IsMultiResidueSubstitution, IsInsertion, IsDeletion, IsStopGain, IsLikelyFrameshift.
• LegacyVariantDescription (Obsolete shim).
• Genotype expansion:
• SplitPerGenotype(...) creates per‑sample variants with depth filtering, heterozygous handling, and optional reference emission (suppresses invalid no‑ops).
• Combination utilities:
• CombineEquivalent(...) merges equivalent variants (merges modifications, aggregates descriptions).
• Modification management:
• TryAddModification / AddModifications with positional validation against edited span, stop gains, deletions.
• Validation of OneBasedModifications during construction (throws on invalid positions).
• Enhanced validation logic:
• AreValid() now rejects “no‑op” variants (OriginalSequence == VariantSequence and no variant‑specific modifications).
• Frameshift heuristic (length change without insertion/deletion classification and not a stop gain).
• Equality / hashing:
• Description intentionally excluded from Equals / GetHashCode.
• Modification dictionaries compared as position + multiset of modification identifiers.
• SimpleString() retained; spans use Begin-End form.
• Robust constructor overload accepting parsed VariantCallFormat or raw VCF string.
Core Additions (VariantCallFormat)
• Lightweight VCF parser with:
• REF / ALT allele capture.
• ANN (SnpEff) extraction via SnpEffAnnotation (assumed existing).
• FORMAT token parsing into per‑sample dictionaries.
• GT parsing (handles '/', '|' separators; preserves '.' as missing).
• AD parsing with validation + numeric extraction (TryParseAD).
• AlleleIndex resolution against ANN’s allele.
• Zygosity classification per sample (Unknown / Homozygous / Heterozygous).
• Legacy boolean maps (Homozygous / Heterozygous) retained.
• Truncated line tolerance (IsTruncated).
• Validation helpers for restricted vocabularies.
Core Changes / Refactors (VariantApplication)
• GetVariantBioPolymers:
• Early exits clarified; over‑strict validation fallback (uses raw variants if all fail AreValid()).
• Uses new ApplyAllVariantCombinations pipeline.
• ApplyAllVariantCombinations:
• Genotype-aware expansion via SplitPerGenotype before combination enumeration.
• Collapses equivalent variants first (CombineEquivalent).
• Filters no‑ops aggressively.
• Deterministic combination generation; guards for invalid / overlapping variants (ValidCombination).
• Improved safety / resilience:
• Widespread try/catch wrappers prevent a single malformed variant from collapsing processing.
• SafeAreValid helper.
• ApplySingleVariant:
• Rebuilds SequenceVariation instance for applied result (keeping original span semantics).
• Adjusts truncation products, modifications, and subsequent variant coordinates.
• Handles premature stops (sequence truncation) and length recalculations (mass & length attribute normalization via reflection).
• Index adjustment helpers:
• AdjustSequenceVariationIndices: careful overlap & same VCF record handling.
• AdjustTruncationProductIndices: stop-gain aware length clipping.
• AdjustModificationIndices: preserves or shifts mods, avoids out-of-range issues.
• Name/accession formatting:
• GetVariantName aggregates de‑duplicated descriptors (VCF preferred).
• CombineDescriptions / CombineSimpleStrings centralize formatting.
• Nucleotide substitution mod conversion:
• ConvertNucleotideSubstitutionModificationsToSequenceVariants converts “nucleotide substitution” modifications into SequenceVariation entries, pruning the original modifications.
• Data sanitation:
• SanitizeVariantData(...) drops invalid variants, prunes impossible modification coordinates, reconciles applied variant list, yields human-readable notes.
Behavioral / Validation Shifts (Potential Breaking Changes)
• “No‑op” variants (no sequence change and no variant-specific modifications) now invalid (throw in constructor or AreValid() returns false). Existing workflows that stored reference-only variants must adapt (e.g., attach a modification or skip creation).
• Equality semantics changed (Description ignored, modifications included). Hash-based collections may re-cluster.
• Modification coordinate validation is stricter (invalid positions now throw on construction rather than being silently kept).
• SplitPerGenotype filters out heterozygous reference “ref copy” when it would be a no‑op (as asserted in tests).
• Frameshift heuristic might classify certain previous substitutions differently (downstream logic relying on old classification may need review).
New Test Coverage (SequenceVariationNewPropertiesTests)
• Verifies SearchableAnnotation precedence (VCF over description).
• Checks allele passthrough, classification predicates, stop-gain vs frameshift disambiguation.
• Ensures constructor rejects no‑ops unless variant-specific modifications exist.
• Validates modification position enforcement & deletion + modification rejection.
• Genotype splitting logic (only alt entries produced when ref copy invalid).
• CombineEquivalent merges descriptions and modifications.
• Equality semantics ignoring Description.
• Convenience constructor end coordinate logic.
• SimpleString formatting rules.
• LegacyVariantDescription continuity.
Reliability / Robustness Improvements
• Defensive cloning of modification lists when combining or splitting variants.
• Frequency-based multiset comparison for modifications (stable deterministic hashing).
• Reflection-based safe mass/length updates tolerant of missing or ambiguous residues.
• Comprehensive guards to prevent index drift after frameshifts / stops.
• SanitizeVariantData provides a pre-write data hygiene pipeline.
Performance Considerations
• Combination generation uses iterative index method (avoids recursion allocations).
• Early dismissal of invalid or depth-insufficient genotypes.
• OrderBy stages localized (only where deterministic ordering needed).
• HashCode struct used for consolidated hashing.
Suggested Upgrade Notes for Consumers