Skip to content

Update to motif handling #55

@LeonHafner

Description

@LeonHafner

Description of feature

I'm using this issue to sort my thoughts regarding the problem with duplicated motifs and to guide the implementation.


Summary

When a user provides a motif file containing multiple motifs with the same symbol name but different PWMs (and thus different Jaspar IDs), our current pipeline encounters ambiguity and statistical issues. Since we primarily operate on symbols, these names are not unique identifiers, which causes problems for both downstream tools (like FIMO) and statistical testing.


Problems

1. PWM selection ambiguity

  • If multiple motifs share the same symbol, we could average their affinities for downstream analysis.
  • However, this prevents us from identifying the correct PWM to use for tools like FIMO, which require a single PWM per motif.

2. Statistical testing violations

  • We currently use the Mann–Whitney U test, which assumes sample independence between:
    • Foreground: affinities of the TF being tested.
    • Background: affinities of all other TFs.
  • Related TFs (same symbol, different IDs) are more similar to each other than to unrelated TFs, breaking this assumption.

Proposed Solutions

Option 1: Keep only the first occurrence per symbol

  • Remove all other motifs with the same symbol.
  • Guarantees statistical testing assumptions remain valid.
  • Fully compatible with downstream tools like FIMO.
  • Warn the user if motifs are removed.

Option 2: Merge motifs after affinity computation

  • Use all motifs as input.
  • After computing affinities with STARE, merge motifs with the same symbol.
  • Ensures valid statistical testing by keeping only one TF per symbol-family.
  • No way to map a specific PWM to a merged symbol → disable downstream tools like FIMO.
  • Warn the user about disabled downstream steps.

Option 3: Adaptive background adjustment (most complex)

  • Keep all motifs, including duplicates by symbol.
  • For each statistical test:
    • Remove TFs from the background that share the foreground TF’s symbol.
    • Merge background TFs with the same symbol.
  • Ensures no related TFs are in the same test group.
  • Produces p-values for each TF-ID, enabling direct PWM mapping.
  • Fully compatible with downstream tools like FIMO.

Implementation Plan

  • Add a pipeline parameter to select duplicate motif-handling strategy: remove, merge, or keep.
  • Implement each option as described above.
  • Add user warnings where downstream tool compatibility is affected and motifs are removed.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions