-
-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Labels
Description
Summary
The current implementation of is_subsequence_in_list
checks for contiguous subsequences only.
However, the GSP algorithm’s standard definition requires ordered but non-contiguous subsequences to be considered valid.
This causes patterns like ('a','c')
to be missed even when their support exceeds the minimum threshold.
Steps to Reproduce
from gsppy.gsp import GSP
sequences = [
['a','b','c'],
['a','c'],
['b','c','a'],
['a','b','c','d'],
]
gsp = GSP(sequences)
freq = gsp.search(min_support=0.5)
# Expected: ('a','c') appears (support = 3)
# Observed: pattern missing
Expected Behavior
('a','c')
should be identified as a frequent subsequence (support = 3 / 4 ≥ 0.5).
Observed Behavior
Pattern omitted because the current matcher only detects contiguous subsequences.
Root Cause
In gsppy/utils.py
, is_subsequence_in_list
uses slice equality:
any(sequence[i:i+len_sub] == subsequence ...)
This enforces contiguity rather than order-only matching.
Proposed Fix
Replace the contiguous slice check with an order-based matcher (see PR #XXX).
Optionally add parameters:
contiguous
(bool) – enforce adjacency when needed.max_gap
(int | None) – constrain gap length.
Impact
- Fixes support calculation for all non-contiguous subsequences.
- Aligns implementation with canonical GSP semantics.
- Adds flexibility for future gap-constrained pattern mining.