Whats wrong with variant application and xml #956

trishorts · 2025-09-17T19:44:32Z

Overview of Major Changes

SequenceVariantDescription.cs renamed VariantCallFormat.cs – this was done for accuracy. All of the parameters of the old class actually map onto VCF file format. This class was not meant to contain generic data about sequence variants. So renaming improves expectations.
VariantCallFormat.cs extended in updated. All of the information about reading VCF files was done several years ago and unsurprisingly there have been updates to the information provided. I examined what is presently reported in VCF files and enabled the newly named method to be able to retrieve all of that information accurately. This revised class should now be able to handle most anything that we throw at it. Depth and Zygosity are key parameters. We use depth to filter which variants get applied. And Zygosity tells us if we have a complete or partial change when applying the variant.
SnpEffAnnotation revisions. Similar story to VCF. Revisions were mostly to better handle string splitting. SnpEff appears out of favor. I don’t expect lots of use. We are keeping this for backwards compatibility. The large block of commented code is for historical reference. It was and is not used. But, it does explain some things. Comments explaining the comments are added.
SequenceVariation revisions. One stop to handle variants from GPTMD, UniProt, VCF. There are a couple of versions because information passed can be different. Sometimes there is only a begin position and sometimes begin and end. We convert everything to the latter. Sequence variants have a modification dictionary because sequence variants can have modifications. In the case where we have a point mutation, it might seem like overkill. But we could have two different mods at the same point. In other cases where we have multi-amino acid change, the modification can be somewhere along the string. Keep a list of mods and positions helps. More later in VariantApplication. Description sometimes comes from VCF and sometimes it comes from UniProt. There is some difference in handling of this parameter. AreValid was a key method as not everything sent in is actually valid. Go figure. Split multi-sample VCF metadata was super confusing to me. But, it turns out that not every line in a VCF describes a single observation. In fact you could have a 100 different samples on the same vcf line. Each would have their own read depth and zygosity. The code here handles the variety of samples thrown at it.
VariantApplication – It was not straightforward deciding how to combine UniProt variants and VCF variants because VCF has depths and zygosity and uniprot does not. Also, counts were bugged. There was no way to specify how many variants per isoform and how many total isoforms. This had to be redone. And many old unit tests were effect by this change. A key in generating combinations of isoforms was making sure that there were no variant collisions. This happens routinely when you have two variants attempting to change the same sequence or sequence that has totally disappeared once a big truncation is applied. This is new code. There are a couple key aspects of applying variants. One, the sequence length can change, which changes all modification coordinates. Two, and modifications on a sequence variant need to get promoted to the variant protein modification dictionary so they are in the correct spot for digestion or not, etc. Then, the applied sequence variant needs to be removed from the list of sequence variants so its not hanging around like a sausage. Formerly, one only got variants from this method. Now it returns the consensus too. Still working to resolve best handling of this.
PeptideWithSetModifications. There were some big changes in the IntersectsAndIdentifiesVariation method. This method is replete with edge cases and was formerly bugged. Revised so all the old unit tests pass and better handling of edge cases. Many more comments.
DecoyProteinGenerator. Manual addition of nics write decoys method. This can be complicated with variants. When reversing a sequence one also needs to adjust all the coordinates of the sequence variants and their modifications. An extra complication is when the variant is at the start and we don’t reverse the start.
RNAdecoygenerator see decoyproteingenerator.
Proteindbloader has new parameters for LoadProteinXML that indicate the desired number of variants per isoform, required minimum sequencing depth and total combinatorial number of isoforms desired. Method converts and variants discovered as modifications in GPTMD to actual sequence variants. All constructors are temporarily retained that convert maxHeterozygousVariants to maxVariantIsoforms etc.
ProteinDbWriter. Fasta change b/c write decoys changes.

Core Additions (SequenceVariation)
• New semantic / convenience properties:
• SearchableAnnotation (VCF line preferred over Description).
• ReferenceAllele / AlternateAllele passthrough to VariantCallFormat.
• Classification flags: IsPointSubstitution, IsMultiResidueSubstitution, IsInsertion, IsDeletion, IsStopGain, IsLikelyFrameshift.
• LegacyVariantDescription (Obsolete shim).
• Genotype expansion:
• SplitPerGenotype(...) creates per‑sample variants with depth filtering, heterozygous handling, and optional reference emission (suppresses invalid no‑ops).
• Combination utilities:
• CombineEquivalent(...) merges equivalent variants (merges modifications, aggregates descriptions).
• Modification management:
• TryAddModification / AddModifications with positional validation against edited span, stop gains, deletions.
• Validation of OneBasedModifications during construction (throws on invalid positions).
• Enhanced validation logic:
• AreValid() now rejects “no‑op” variants (OriginalSequence == VariantSequence and no variant‑specific modifications).
• Frameshift heuristic (length change without insertion/deletion classification and not a stop gain).
• Equality / hashing:
• Description intentionally excluded from Equals / GetHashCode.
• Modification dictionaries compared as position + multiset of modification identifiers.
• SimpleString() retained; spans use Begin-End form.
• Robust constructor overload accepting parsed VariantCallFormat or raw VCF string.
Core Additions (VariantCallFormat)
• Lightweight VCF parser with:
• REF / ALT allele capture.
• ANN (SnpEff) extraction via SnpEffAnnotation (assumed existing).
• FORMAT token parsing into per‑sample dictionaries.
• GT parsing (handles '/', '|' separators; preserves '.' as missing).
• AD parsing with validation + numeric extraction (TryParseAD).
• AlleleIndex resolution against ANN’s allele.
• Zygosity classification per sample (Unknown / Homozygous / Heterozygous).
• Legacy boolean maps (Homozygous / Heterozygous) retained.
• Truncated line tolerance (IsTruncated).
• Validation helpers for restricted vocabularies.
Core Changes / Refactors (VariantApplication)
• GetVariantBioPolymers:
• Early exits clarified; over‑strict validation fallback (uses raw variants if all fail AreValid()).
• Uses new ApplyAllVariantCombinations pipeline.
• ApplyAllVariantCombinations:
• Genotype-aware expansion via SplitPerGenotype before combination enumeration.
• Collapses equivalent variants first (CombineEquivalent).
• Filters no‑ops aggressively.
• Deterministic combination generation; guards for invalid / overlapping variants (ValidCombination).
• Improved safety / resilience:
• Widespread try/catch wrappers prevent a single malformed variant from collapsing processing.
• SafeAreValid helper.
• ApplySingleVariant:
• Rebuilds SequenceVariation instance for applied result (keeping original span semantics).
• Adjusts truncation products, modifications, and subsequent variant coordinates.
• Handles premature stops (sequence truncation) and length recalculations (mass & length attribute normalization via reflection).
• Index adjustment helpers:
• AdjustSequenceVariationIndices: careful overlap & same VCF record handling.
• AdjustTruncationProductIndices: stop-gain aware length clipping.
• AdjustModificationIndices: preserves or shifts mods, avoids out-of-range issues.
• Name/accession formatting:
• GetVariantName aggregates de‑duplicated descriptors (VCF preferred).
• CombineDescriptions / CombineSimpleStrings centralize formatting.
• Nucleotide substitution mod conversion:
• ConvertNucleotideSubstitutionModificationsToSequenceVariants converts “nucleotide substitution” modifications into SequenceVariation entries, pruning the original modifications.
• Data sanitation:
• SanitizeVariantData(...) drops invalid variants, prunes impossible modification coordinates, reconciles applied variant list, yields human-readable notes.
Behavioral / Validation Shifts (Potential Breaking Changes)
• “No‑op” variants (no sequence change and no variant-specific modifications) now invalid (throw in constructor or AreValid() returns false). Existing workflows that stored reference-only variants must adapt (e.g., attach a modification or skip creation).
• Equality semantics changed (Description ignored, modifications included). Hash-based collections may re-cluster.
• Modification coordinate validation is stricter (invalid positions now throw on construction rather than being silently kept).
• SplitPerGenotype filters out heterozygous reference “ref copy” when it would be a no‑op (as asserted in tests).
• Frameshift heuristic might classify certain previous substitutions differently (downstream logic relying on old classification may need review).
New Test Coverage (SequenceVariationNewPropertiesTests)
• Verifies SearchableAnnotation precedence (VCF over description).
• Checks allele passthrough, classification predicates, stop-gain vs frameshift disambiguation.
• Ensures constructor rejects no‑ops unless variant-specific modifications exist.
• Validates modification position enforcement & deletion + modification rejection.
• Genotype splitting logic (only alt entries produced when ref copy invalid).
• CombineEquivalent merges descriptions and modifications.
• Equality semantics ignoring Description.
• Convenience constructor end coordinate logic.
• SimpleString formatting rules.
• LegacyVariantDescription continuity.
Reliability / Robustness Improvements
• Defensive cloning of modification lists when combining or splitting variants.
• Frequency-based multiset comparison for modifications (stable deterministic hashing).
• Reflection-based safe mass/length updates tolerant of missing or ambiguous residues.
• Comprehensive guards to prevent index drift after frameshifts / stops.
• SanitizeVariantData provides a pre-write data hygiene pipeline.
Performance Considerations
• Combination generation uses iterative index method (avoids recursion allocations).
• Early dismissal of invalid or depth-insufficient genotypes.
• OrderBy stages localized (only where deterministic ordering needed).
• HashCode struct used for consolidated hashing.
Suggested Upgrade Notes for Consumers

Audit any creation of “reference-only” variants; adapt to new validation or skip creating them.
Revisit equality-based dictionaries/sets: Description no longer differentiates variants.
Ensure downstream tools expecting original modification leniency handle earlier exceptions.
If using pre-existing variant combination logic, validate new genotype expansion doesn’t inflate search space unexpectedly; tune maxSequenceVariantsPerIsoform / maxSequenceVariantIsoforms.
Review reporting paths that consumed previous variant name/accession formatting (they may now be longer with multiple descriptors).

…lternate isoforms

codecov · 2025-09-18T15:56:19Z

Codecov Report

❌ Patch coverage is 88.07286% with 203 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.89%. Comparing base (a682b7a) to head (3d73744).

Files with missing lines	Patch %	Lines
mzLib/Omics/BioPolymer/VariantApplication.cs	81.73%	49 Missing and 48 partials ⚠️
mzLib/Omics/BioPolymer/SequenceVariation.cs	91.96%	1 Missing and 35 partials ⚠️
...Databases/DecoyGeneration/DecoyProteinGenerator.cs	88.97%	14 Missing and 13 partials ⚠️
...roteolyticDigestion/PeptideWithSetModifications.cs	88.88%	5 Missing and 12 partials ⚠️
mzLib/Omics/BioPolymer/VariantCallFormat.cs	88.13%	3 Missing and 11 partials ⚠️
...micsDatabases/DecoyGeneration/RnaDecoyGenerator.cs	94.48%	6 Missing and 1 partial ⚠️
mzLib/Omics/BioPolymer/SnpEffAnnotation.cs	93.67%	0 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #956      +/-   ##
==========================================
- Coverage   80.97%   80.89%   -0.09%     
==========================================
  Files         269      269              
  Lines       38744    39999    +1255     
  Branches     4228     4551     +323     
==========================================
+ Hits        31374    32357     +983     
- Misses       6639     6779     +140     
- Partials      731      863     +132

Files with missing lines	Coverage Δ
mzLib/Omics/Modifications/Modification.cs	`97.47% <ø> (ø)`
mzLib/Proteomics/Protein/Protein.cs	`97.23% <100.00%> (-0.19%)`	⬇️
mzLib/UsefulProteomicsDatabases/ProteinDbLoader.cs	`95.29% <ø> (-0.73%)`	⬇️
mzLib/UsefulProteomicsDatabases/ProteinDbWriter.cs	`91.32% <ø> (-4.16%)`	⬇️
mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs	`96.17% <ø> (-3.35%)`	⬇️
...ProteomicsDatabases/Transcriptomics/RnaDbLoader.cs	`78.70% <ø> (-14.99%)`	⬇️
mzLib/Omics/BioPolymer/SnpEffAnnotation.cs	`88.32% <93.67%> (+9.40%)`	⬆️
...micsDatabases/DecoyGeneration/RnaDecoyGenerator.cs	`94.73% <94.48%> (+11.01%)`	⬆️
mzLib/Omics/BioPolymer/VariantCallFormat.cs	`88.13% <88.13%> (ø)`
...roteolyticDigestion/PeptideWithSetModifications.cs	`92.66% <88.88%> (-3.27%)`	⬇️
... and 3 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ants

Copilot

Pull Request Overview

This PR addresses two issues with variant application in protein XML parsing: 1) prevents crashes when overlapping sequence variants are applied together, and 2) filters out variants that target alternate protein isoforms rather than the base sequence.

Adds validation to prevent application of overlapping sequence variants that would cause position conflicts
Filters out sequence variants that reference alternate isoforms (e.g., "Q96J88-3") to avoid misapplication
Refactors the ParseAnnotatedMods method for improved readability

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
ProteinXmlEntry.cs	Adds isoform filtering logic and refactors ParseAnnotatedMods method
VariantApplication.cs	Adds overlap validation for variant combinations
TestProteomicsReadWrite.cs	Adds tests for long substitution conflicts and alternate isoform handling
Test.csproj	Includes new test XML files in build output
longSubstitution.xml	Test case with overlapping variants
sequenceVariantOnAlternateIsoform.xml	Test case with variants on alternate isoforms

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-18T16:29:56Z

mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs

+                    appliesToThisSequence =
+                        LocationSequenceId.Equals(acc, StringComparison.OrdinalIgnoreCase)
+                        || (!string.IsNullOrEmpty(acc) && LocationSequenceId.Equals($"{acc}-1", StringComparison.OrdinalIgnoreCase));


[nitpick] The hardcoded '-1' suffix check assumes the base isoform is always numbered as '-1', but this may not be consistent across all protein databases. Consider using a more flexible approach or documenting this assumption.

Copilot · 2025-09-18T16:29:56Z

mzLib/Omics/BioPolymer/VariantApplication.cs

        public static List<TBioPolymerType> GetVariantBioPolymers<TBioPolymerType>(this TBioPolymerType protein, int maxAllowedVariantsForCombinatorics = 4, int minAlleleDepth = 1)
            where TBioPolymerType : IHasSequenceVariants
        {
+            if(maxAllowedVariantsForCombinatorics == 0)


Missing space after 'if' keyword violates C# formatting conventions.

Suggested change

if(maxAllowedVariantsForCombinatorics == 0)

if (maxAllowedVariantsForCombinatorics == 0)

Copilot · 2025-09-18T16:29:56Z

mzLib/Omics/BioPolymer/VariantApplication.cs

            {
                foreach (var combo in GetCombinations(variations, size))
                {
+                    if(!ValidCombination(combo.ToList()))


Missing space after 'if' keyword violates C# formatting conventions.

Suggested change

if(!ValidCombination(combo.ToList()))

if (!ValidCombination(combo.ToList()))

Copilot · 2025-09-18T16:29:57Z

mzLib/Omics/BioPolymer/VariantApplication.cs

            {
                foreach (var combo in GetCombinations(variations, size))
                {
+                    if(!ValidCombination(combo.ToList()))


Calling ToList() on each combination creates unnecessary allocations. Consider changing ValidCombination to accept IEnumerable instead of List.

Suggested change

if(!ValidCombination(combo.ToList()))

if(!ValidCombination(combo))

Copilot

Pull Request Overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-18T23:17:51Z

mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs

+                    string acc = Accession ?? "";
+                    appliesToThisSequence =
+                        LocationSequenceId.Equals(acc, StringComparison.OrdinalIgnoreCase)
+                        || (!string.IsNullOrEmpty(acc) && LocationSequenceId.Equals($"{acc}-1", StringComparison.OrdinalIgnoreCase));


The hardcoded '-1' suffix check assumes a specific isoform naming pattern. Consider using a more flexible approach that checks if LocationSequenceId starts with the accession followed by a dash, or make this configurable to handle different isoform naming conventions.

Suggested change

|| (!string.IsNullOrEmpty(acc) && LocationSequenceId.Equals($"{acc}-1", StringComparison.OrdinalIgnoreCase));

|| (!string.IsNullOrEmpty(acc) && LocationSequenceId.StartsWith(acc + "-", StringComparison.OrdinalIgnoreCase));

…uring read

…unit tests

…iantDescription

…ub.com/trishorts/mzLib into whatsWrongWithVariantApplicationAndXml

…mments.

… combines sequence variants with the same effect

…iantApplicationAndXml

nbollis

Some issues left over from copilot:

Rewrote entire sections of code that should not have been affected by changing how variants operate.
Deleted a bunch of comments unnecessarily.
Commented about what change in the code occurred due to your copilot prompt, but does not add any understanding to what is happening and why in the function.

Max variant isoforms being 1 to get the canonical out feels bad. The sequence we get out is not a variant isoform.

nbollis · 2025-10-21T21:27:53Z

mzLib/Omics/BioPolymer/SnpEffAnnotation.cs

        public int CodingDomainSequenceLengthIncludingStopCodon { get; }
        public int OneBasedProteinPosition { get; }
        public int ProteinLength { get; }
-


Why delete this comment?

all comments back in snpeff

nbollis · 2025-10-21T21:39:28Z

mzLib/Omics/BioPolymer/SequenceVariation.cs

+        /// Positions are validated in <see cref="AreValid"/> against the altered span (<see cref="VariantSequence"/>).
        /// </summary>
-        public int OneBasedEndPosition { get; }
+        public Dictionary<int, List<Modification>> OneBasedModifications { get; }


Do we need to store the full mod dictionary here? This would (likely) more than double the memory footprint of the modifications.

short answer is yes.

nbollis · 2025-10-21T21:40:03Z

mzLib/Omics/BioPolymer/SequenceVariation.cs

+            OriginalSequence?.Length > 1 &&
+            VariantSequence?.Length == OriginalSequence.Length &&
+            OriginalSequence != VariantSequence &&
+            !IsPointSubstitution;


!IsPointSubstitution is a redundant check as the length check already does this.

nbollis · 2025-10-21T21:40:20Z

mzLib/Omics/BioPolymer/SequenceVariation.cs

+        /// <summary>
+        /// True if variant introduces a stop (* at end).
+        /// </summary>
+        public bool IsStopGain => VariantSequence?.EndsWith("*", StringComparison.Ordinal) == true;


if bool == true?

nbollis · 2025-10-21T21:41:44Z

mzLib/Omics/BioPolymer/SequenceVariation.cs

+        /// Heuristic frameshift flag: length difference not equal & not simple stop gain only.
+        /// (Refine if you have explicit annotation elsewhere.)
+        /// </summary>
+        public bool IsLikelyFrameshift =>


What do you mean by frame shift? How can we evaluate a frame shift from only protein sequences?

nbollis · 2025-10-21T23:32:01Z

mzLib/UsefulProteomicsDatabases/ProteinDbWriter.cs

Why did the entirety of XML writing change here? I would assume only the variant relevant components needed to change?

nbollis · 2025-10-21T23:33:07Z

mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs

@@ -30,7 +31,7 @@ public class ProteinXmlEntry
        public string FeatureDescription { get; private set; }
        public string SubFeatureType { get; private set; }
        public string SubFeatureDescription { get; private set; }
-        public string OriginalValue { get; private set; } = ""; // if no content is found, assume it is empty, not null (e.g. <original>A</original><variation/> for a deletion event)


why delete these comments?

mistake. restored to former glory

nbollis · 2025-10-21T23:33:39Z

mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs

@@ -172,19 +157,7 @@ private void ParseEntryAttributes(XmlReader xml)
            DatabaseVersionEntryTag = xml.GetAttribute("version");
            XmlnsEntryTag = xml.GetAttribute("xmlns");
        }
-        /// <summary>
-        /// Parses some attributes of a &lt;sequence&gt; XML element and assigns their values to the corresponding properties of the ProteinXmlEntry.


Comment was deleted again :(

mistake. restored to former glory

nbollis · 2025-10-21T23:33:53Z

mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs

-        /// <summary>
-        /// Computes the monoisotopic mass of the given sequence.
-        /// Returns 0 if the sequence is empty.
-        /// </summary>


mistake. restored to former glory

nbollis · 2025-10-21T23:34:28Z

mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs

                Sequence = ProteinDbLoader.SanitizeAminoAcidSequence(Sequence, 'X');

+                // NEW: prune any sequence variants whose coordinates exceed the now-known sequence length


leftover uninformative AI comment

revised comment

…iantApplicationAndXml

Alexander-Sol · 2025-10-22T18:50:57Z

mzLib/Omics/BioPolymer/SequenceVariation.cs

        public string SimpleString()
        {
-            return OriginalSequence + OneBasedBeginPosition.ToString() + VariantSequence;
+            if (OneBasedBeginPosition == OneBasedEndPosition || (OriginalSequence?.Length ?? 0) <= 1)


Why not just make this the ToString() method. Is there another ToString method that I missed?

Alexander-Sol · 2025-10-22T18:55:04Z

mzLib/Omics/BioPolymer/SequenceVariation.cs

+            segment.OneBasedEndPosition >= OneBasedBeginPosition && segment.OneBasedBeginPosition <= OneBasedEndPosition;
+
+        internal bool Intersects(TruncationProduct segment) =>
+            segment.OneBasedEndPosition >= OneBasedBeginPosition && segment.OneBasedBeginPosition <= OneBasedEndPosition;


These Intersects methods will only return true if the segment is a subset of this SequenceVariation, but not if they merely overlap, e.g.,
segment.OneBasedEndPosition >= OneBasedBeginPosition && segment.OneBasedEndPosition <= OneBasedEndPosition && segment.OneBasedBeginPosition OneBasedEndPosition

Is this the intended behavior?

Alexander-Sol · 2025-10-22T19:08:06Z

mzLib/Omics/BioPolymer/SnpEffAnnotation.cs

+        public string[] Effects { get; } = Array.Empty<string>();
+        public string PutativeImpact { get; } = string.Empty;
+        public string GeneName { get; } = string.Empty;
+        public string GeneID { get; } = string.Empty;


Why are we initializing fields in this fashion? This isn't done anywhere else in the code base

Alexander-Sol · 2025-10-22T19:08:52Z

mzLib/Omics/BioPolymer/SnpEffAnnotation.cs

-
-        public string[] Warnings { get; }
+        public string[] Warnings { get; } = Array.Empty<string>();



Can wer remove lines 70-72, as they're unused?

Alexander-Sol · 2025-10-22T19:11:42Z

mzLib/Omics/BioPolymer/SnpEffAnnotation.cs

-            if (int.TryParse(a[14], out y)) DistanceToFeature = y;
-            Warnings = a[15].Split('&');

+            string Get(int idx) => idx >= 0 && idx < a.Length ? a[idx] : string.Empty;


Cool use of local functions

Alexander-Sol · 2025-10-22T19:12:05Z

mzLib/Omics/BioPolymer/SnpEffAnnotation.cs

+            HGVSNotationDnaLevel = Get(9);
+            HGVSNotationProteinLevel = Get(10);
+
+            void ParseSlashField(string value, ref int first, ref int second)


Could this be a private method?

Alexander-Sol · 2025-10-22T19:14:40Z

mzLib/Omics/BioPolymer/SnpEffAnnotation.cs

+                ParseSlashField(Get(13), ref pos, ref len);
+                OneBasedProteinPosition = pos;
+                ProteinLength = len;
+            }


Why use this syntax? Why not just continually redefine/overwrite the post and len variables

Alexander-Sol · 2025-10-22T19:18:33Z

mzLib/Omics/Modifications/Modification.cs

Remove this

Alexander-Sol · 2025-10-22T19:30:21Z

mzLib/Test/DatabaseTests/TestDatabaseLoaders.cs

+                true, DecoyType.None, new List<Modification>(), false, new List<string>(),
+                out um,
+                maxSequenceVariantsPerIsoform: 4, minAlleleDepth: 1, totalConsensusPlusVariantIsoforms: 1);
            Assert.That(new_proteins.First().OneBasedPossibleLocalizedModifications.First().Value.First().NeutralLosses.First().Value.Count == 2);


If the default behavior doesn't change, then should we be modifying method calls in test that aren't explicitly about variants?

Alexander-Sol · 2025-10-22T19:32:58Z

mzLib/Test/DatabaseTests/TestProteomicsReadWrite.cs

        }
+
+        //[Test]
+        //[Explicit("Long-running diagnostic; generates protein_variant_log.txt with per-protein variant expansion results.")]


Remove commented code

Alexander-Sol

The changes made are far reaching and extensive, which makes this PR difficult to review.

If possible, please reduce the size of this PR, eliminating unnecessary changes and all changes not directly related to sequence variants. If this PR can be split up into multiple, smaller PRs, that would be preferred.

Alexander-Sol · 2025-10-22T20:12:47Z

mzLib/Omics/BioPolymer/VariantApplication.cs

 using MzLibUtil;
-using Omics.BioPolymer;
 using Omics.Modifications;
+using System.Reflection; 


This is concerning

Alexander-Sol · 2025-10-22T20:40:58Z

mzLib/Omics/BioPolymer/VariantApplication.cs

+            {
+                var seq = created?.BaseSequence;
+                if (!string.IsNullOrEmpty(seq))
+                {


Inside this try statement, you're using reflection to access properties of your variant. This is concerning, as reflection should rarely be used in production code. It's slow and finicky.

Can you refactor this to avoid using reflection?

Alexander-Sol · 2025-10-22T21:27:20Z

mzLib/UsefulProteomicsDatabases/ProteinDbWriter.cs

+                    writer.WriteEndElement(); // feature or subfeature
+                }
+            }
+        }


Large chunks of this file were re-written. Are all rewrites necessary?

trishorts added 2 commits September 17, 2025 13:02

d

650f2fa

its broke

132b6ca

trishorts added the bug label Sep 17, 2025

trishorts added 2 commits September 17, 2025 15:06

string unneccessary lines from xml

aed5b13

eliminates conflicting variant combinations and ignores variants on a…

afb11e2

…lternate isoforms

trishorts requested a review from Copilot September 18, 2025 15:48

eliminate unused code

d3f8ab2

enable return of base proteins from xml with no applied sequence vari…

87d1bd1

…ants

trishorts requested review from Copilot, nbollis and zhuoxinshi and removed request for Copilot and nbollis September 18, 2025 16:28

Copilot AI reviewed Sep 18, 2025

View reviewed changes

trishorts requested a review from Copilot September 18, 2025 23:17

trishorts marked this pull request as ready for review September 18, 2025 23:17

Merge branch 'master' into whatsWrongWithVariantApplicationAndXml

4720899

Copilot AI reviewed Sep 18, 2025

View reviewed changes

zhuoxinshi previously approved these changes Sep 19, 2025

View reviewed changes

trishorts added 6 commits September 22, 2025 11:10

nucleotide substitution modifications converted to squence variants d…

71543b0

…uring read

correctly set max variants per isoform and max isoform variants with …

3f90baf

…unit tests

still trouble correctly handling VCF sequence variants

312a385

add extensive comments to SequenceVariantDescription

197a4f8

comprehensive tests for VariantCallFormat formerly called SequenceVar…

241d2a6

…iantDescription

Merge branch 'whatsWrongWithVariantApplicationAndXml' of https://gith…

b2fb89f

…ub.com/trishorts/mzLib into whatsWrongWithVariantApplicationAndXml

trishorts dismissed zhuoxinshi’s stale review via b2fb89f September 25, 2025 15:24

trishorts added 3 commits September 25, 2025 13:03

update VariantCallFormat to improve handling of zygosity and added co…

b09b482

…mments.

new code that splits sequence variants by genotypes and new code that…

6ca8a19

… combines sequence variants with the same effect

we now have one method to apply all variants

7f94034

trishorts added 2 commits October 21, 2025 11:55

organization

1020526

Merge remote-tracking branch 'upstream/master' into whatsWrongWithVar…

085116c

…iantApplicationAndXml

trishorts added the ready for review label Oct 21, 2025

nbollis requested changes Oct 21, 2025

View reviewed changes

trishorts added 12 commits October 22, 2025 10:28

j

4319338

tests

c4d52e8

okay so peptide with set mods intersects was bugged

8292548

ok cool

0618dce

unused code

a2eb07c

summary comments

663d07e

g

01b9de2

comments

6029fda

Merge remote-tracking branch 'upstream/master' into whatsWrongWithVar…

d90c2de

…iantApplicationAndXml

return comments to snpeff

ed48f63

rename maxIsoforms

4dfbd11

better not perfect

3d73744

Alexander-Sol reviewed Oct 22, 2025

View reviewed changes

mzLib/Omics/Modifications/Modification.cs

}

}

Copy link

Contributor

Alexander-Sol Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

Alexander-Sol reviewed Oct 22, 2025

View reviewed changes

Alexander-Sol requested changes Oct 22, 2025

View reviewed changes

Alexander-Sol reviewed Oct 22, 2025

View reviewed changes

	if(maxAllowedVariantsForCombinatorics == 0)
	if (maxAllowedVariantsForCombinatorics == 0)

	if(!ValidCombination(combo.ToList()))
	if (!ValidCombination(combo.ToList()))

	if(!ValidCombination(combo.ToList()))
	if(!ValidCombination(combo))

	\|\| (!string.IsNullOrEmpty(acc) && LocationSequenceId.Equals($"{acc}-1", StringComparison.OrdinalIgnoreCase));
	\|\| (!string.IsNullOrEmpty(acc) && LocationSequenceId.StartsWith(acc + "-", StringComparison.OrdinalIgnoreCase));

		Sequence = ProteinDbLoader.SanitizeAminoAcidSequence(Sequence, 'X');

		// NEW: prune any sequence variants whose coordinates exceed the now-known sequence length


		public string[] Warnings { get; }
		public string[] Warnings { get; } = Array.Empty<string>();

Whats wrong with variant application and xml #956

Are you sure you want to change the base?

Whats wrong with variant application and xml #956

Uh oh!

Conversation

trishorts commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

nbollis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alexander-Sol Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trishorts commented Sep 17, 2025 •

edited

Loading

codecov bot commented Sep 18, 2025 •

edited

Loading

nbollis left a comment •

edited

Loading

Alexander-Sol Oct 22, 2025 •

edited

Loading

Alexander-Sol left a comment •

edited

Loading