Skip to content

GSP fails to detect non-contiguous subsequences (e.g., ('a','c')) due to contiguous matcher #115

@project-2-2-2

Description

@project-2-2-2

Summary

The current implementation of is_subsequence_in_list checks for contiguous subsequences only.
However, the GSP algorithm’s standard definition requires ordered but non-contiguous subsequences to be considered valid.
This causes patterns like ('a','c') to be missed even when their support exceeds the minimum threshold.

Steps to Reproduce

from gsppy.gsp import GSP

sequences = [
    ['a','b','c'],
    ['a','c'],
    ['b','c','a'],
    ['a','b','c','d'],
]

gsp = GSP(sequences)
freq = gsp.search(min_support=0.5)

# Expected: ('a','c') appears (support = 3)
# Observed: pattern missing

Expected Behavior

('a','c') should be identified as a frequent subsequence (support = 3 / 4 ≥ 0.5).

Observed Behavior

Pattern omitted because the current matcher only detects contiguous subsequences.

Root Cause

In gsppy/utils.py, is_subsequence_in_list uses slice equality:

any(sequence[i:i+len_sub] == subsequence ...)

This enforces contiguity rather than order-only matching.

Proposed Fix

Replace the contiguous slice check with an order-based matcher (see PR #XXX).
Optionally add parameters:

  • contiguous (bool) – enforce adjacency when needed.
  • max_gap (int | None) – constrain gap length.

Impact

  • Fixes support calculation for all non-contiguous subsequences.
  • Aligns implementation with canonical GSP semantics.
  • Adds flexibility for future gap-constrained pattern mining.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions