Skip to content
This repository was archived by the owner on Apr 24, 2025. It is now read-only.
This repository was archived by the owner on Apr 24, 2025. It is now read-only.

Don't expand character match length with flag i (unless using a new flag) #351

@slevithan

Description

@slevithan

Currently, Oniguruma sometimes applies Unicode's SpecialCasing.txt rules when using flag i, which can lengthen the match of a character, character class, or set (like \w or \S). For example, (?i)^ß$ matches 'ss', and (?i)^ss$ matches 'ß'.

I don't think Oniguruma should do that, unless the behavior is applied behind a dedicated flag or option. And if such a flag was added, that would allow applying the behavior more consistently than it is now, since users would be opting in and it could be documented that there are performance implications.

Following is my understanding of the reasons for and against the current behavior. Are there additional reasons I'm missing?

Reasons to continue expanding the length of a match

  • Changing it now would be a breaking change.
  • It follows Unicode recommendations and the Unicode org's ICU regex engine.
  • It might be a big or complex change in the code (a lot of work).
  • It sometimes solves a real issue with complex case differences.

Opinion: Even though it solves a real casing problem, the problem usually isn't relevant in the context of regular expressions. And when it is, there's usually an easy workaround, or the problem didn't need solving in the first place (for example, because the user wrote \w+ and so it would already match both ß and ss).

Reasons to stop expanding the length

  • It is currently applied inconsistently anyway, based on complicated and nonintuitive conditions that few users will understand (I will show examples below).
  • The fact that e.g. (?i)\w and (?i)[\w] are not equivalent makes it hard to reason about or refactor regexes (similar to Flag W should not interfere with Unicode case folding from flag i #349).
  • It hurts performance (sometimes catastrophically), as discussed in Make character classes atomic with flag i #350.
  • It cannot be fully "fixed" (applied consistently) without hurting performance even more, for things like the dot ..
  • It is not portable with other regex flavors that don't do this, including Perl, PCRE2, JavaScript, Java, .NET, and Rust.
  • Most of the time, the behavior is surprising (this would change if users had to opt in).
  • Most of the time, the behavior is undesired, since it is simply wrong in terms of following user intent. For example, if someone uses \S in a character class, in essentially 100% of cases they do not mean to match 'ss', 'ff', 'fl', etc. They mean "any single character that is not whitespace". And despite "ss" being the case conversion of a single character, it is not itself a single character in any context/language.

If you accept my statements above, unfortunately it means that, in exchange for ① the added complexity in the engine, ② the inconsistency/unpredictability for users, and ③ the resulting performance problems, users get behavior that in almost all cases they didn't expect or want.

Recent precedent from JavaScript

JavaScript is an interesting regex flavor to compare to, because in version ES2024 it added flag v (unicodeSets), which allows character classes and Unicode properties to match more than one character at a time using a few specific "properties of strings" like \p{RGI_Emoji} (which can also be used in character classes) or the new syntax […\q{…|…}]. However, even though JavaScript character classes and Unicode properties can now match more than one character, and even though JavaScript flags u/v change flag i to use Unicode case folding, nevertheless JavaScript did not chose to apply Unicode's special casing rules that change match length (like ßss).

Current Oniguruma behavior

Following are the tests I ran to help me understand the current behavior. It shows the regex and target string for each test. r is for raw strings (without backslash escaping).

✅ = match
❌ = no match
🤔 = inconsistent or questionable behavior
🤯 = very surprising

[
  // Single `s` doesn't map to small sharp s (German eszett, ß) or its case equivalents
  [r`(?i)^s$`, 'ß'], // ❌
  [r`(?i)^s$`, 'ss'], // ❌
  [r`(?i)^[s]$`, 'ß'], // ❌
  [r`(?i)^[s]$`, 'ss'], // ❌
  [r`(?i)^ß$`, 's'], // ❌
  [r`(?i)^[ß]$`, 's'], // ❌

  // Single `s` does map to its case equivalent small long s
  [r`(?i)^s$`, 'ſ'], // ✅

  // Single `ß` maps to `ss` and its case equivalents
  [r`(?i)^ß$`, 'ß'], // ✅
  [r`(?i)^ß$`, 'ss'], // ✅
  [r`(?i)^ß$`, 'SS'], // ✅
  [r`(?i)^ß$`, 'ſſ'], // ✅
  [r`(?i)^ß$`, 'sS'], // ✅
  [r`(?i)^ß$`, 'sſ'], // ✅
  [r`(?i)^ß$`, 'Ss'], // ✅
  [r`(?i)^ß$`, 'Sſ'], // ✅
  [r`(?i)^ß$`, 'ſs'], // ✅
  [r`(?i)^ß$`, 'ſS'], // ✅
  [r`(?i)^ß$`, 'ẞ'], // ✅ Uppercase `ẞ` in target
  [r`(?i)^ẞ$`, 'ß'], // ✅ Uppercase `ẞ` in pattern

  // The same, within a positive class
  [r`(?i)^[ß]$`, 'ß'], // ✅
  [r`(?i)^[ß]$`, 'ss'], // ✅
  [r`(?i)^[ß]$`, 'SS'], // ✅
  [r`(?i)^[ß]$`, 'ſſ'], // ✅
  [r`(?i)^[ß]$`, 'sS'], // ✅
  [r`(?i)^[ß]$`, 'sſ'], // ✅
  [r`(?i)^[ß]$`, 'Ss'], // ✅
  [r`(?i)^[ß]$`, 'Sſ'], // ✅
  [r`(?i)^[ß]$`, 'ſs'], // ✅
  [r`(?i)^[ß]$`, 'ſS'], // ✅
  [r`(?i)^[ß]$`, 'ẞ'], // ✅ Uppercase `ẞ` in target
  [r`(?i)^[ẞ]$`, 'ß'], // ✅ Uppercase `ẞ` in pattern

  // Negated class basics; nothing surprising here
  [r`(?i)^[^ß]$`, 'ß'], // ❌
  [r`(?i)^[^ß]$`, 'ss'], // ❌
  [r`(?i)^[^s]$`, 'ß'], // ✅
  [r`(?i)^[^s]$`, 'ss'], // ❌
  [r`(?i)^[^ſ]$`, 'ß'], // ✅
  [r`(?i)^[^ſ]$`, 'ss'], // ❌
  [r`(?i)^[^ẞ]$`, 'ß'], // ❌ Uppercase `ẞ` in pattern
  [r`(?i)^[^ẞ]$`, 'ss'], // ❌ Uppercase `ẞ` in pattern

  // Other representations of exactly `ß` are OK
  [r`(?i)^\x{DF}$`, 'ss'], // ✅

  // But not sets that include `ß` 🤔
  [r`(?i)^\w$`, 'ss'], // ❌
  [r`(?i)^\p{Word}$`, 'ss'], // ❌
  [r`(?i)^\D$`, 'ss'], // ❌
  [r`(?i)^.$`, 'ss'], // ❌
  [r`(?i)^\O$`, 'ss'], // ❌
  [r`(?i)^\p{Any}$`, 'ss'], // ❌

  // Within positive classes, other representations of `ß`, and sets/ranges that include `ß`, are OK
  [r`(?i)^[\x{DF}]$`, 'ss'], // ✅
  [r`(?i)^[\x{DE}-\x{E0}]$`, 'ss'], // ✅
  [r`(?i)^[\w]$`, 'ss'], // ✅
  [r`(?i)^[\p{Word}]$`, 'ss'], // ✅
  [r`(?i)^[[:word:]]$`, 'ss'], // ✅
  [r`(?i)^[\D]$`, 'ss'], // ✅
  [r`(?i)^[\P{M}]$`, 'ss'], // ✅
  [r`(?i)^[\p{Any}]$`, 'ss'], // ✅

  // But not within negated classes 🤔
  [r`(?i)^[^[^\x{DF}]]$`, 'ss'], // ❌
  [r`(?i)^[^\0]$`, 'ss'], // ❌
  [r`(?i)^[^\W]$`, 'ss'], // ❌
  [r`(?i)^[^\d]$`, 'ss'], // ❌
  [r`(?i)^[^\p{M}]$`, 'ss'], // ❌

  // The negation rule is about negation of the outermost class, only 🤔
  [r`(?i)^[^[\W]]$`, 'ss'], // ❌
  [r`(?i)^[[^\W]]$`, 'ss'], // ✅ 🤯
  [r`(?i)^[\w&&[^\W]]$`, 'ss'], // ✅ 🤯

  // Flags `W` and `P` exclude `ß` from `\w`
  [r`(?iP)^[\w]$`, 'ss'], // ❌
  [r`(?iW)^[\w]$`, 'ss'], // ❌
  [r`(?iW)^\w$`, 'ss'], // ❌
  [r`(?iW)^[ß]$`, 'ss'], // ✅
  [r`(?iW)^ß$`, 'ss'], // ✅
  [r`(?iW)^\x{DF}$`, 'ss'], // ✅

  // Quantifier basics; nothing surprising here
  [r`(?i)^ß{2}$`, 'ßß'], // ✅
  [r`(?i)^ß{2}$`, 'ss'], // ❌
  [r`(?i)^ß{2}$`, 'ssss'], // ✅
  [r`(?i)^[ß]{2}$`, 'ss'], // ❌
  [r`(?i)^[ß]{2}$`, 'ssss'], // ✅
  [r`(?i)^[^ß]{2}$`, 'ss'], // ✅
  [r`(?i)^[^ß]{2}$`, 'ssss'], // ❌

  // Character classes are affected by backtracking (bad news for performance!) 🤔
  [r`(?i)^[\w]{2}$`, 'ss'], // ✅
  [r`(?i)^[\w]{2}$`, 'sss'], // ✅
  [r`(?i)^[\w]{2}$`, 'ssss'], // ✅

  // In the reverse direction
  [r`(?i)^ss$`, 'ss'], // ✅
  [r`(?i)^ss$`, 'ß'], // ✅
  [r`(?i)^ſſ$`, 'ß'], // ✅
  [r`(?i)^ss$`, 'ẞ'], // ✅ Uppercase `ẞ`

  // In the reverse direction with quantifiers; nothing surprising here
  [r`(?i)^s{2}$`, 'ß'], // ❌
  [r`(?i)^ss{2}$`, 'ßß'], // ❌
  [r`(?i)^(?:ss){2}$`, 'ßß'], // ✅
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions