-
Notifications
You must be signed in to change notification settings - Fork 335
Don't expand character match length with flag i
(unless using a new flag) #351
Description
Currently, Oniguruma sometimes applies Unicode's SpecialCasing.txt rules when using flag i
, which can lengthen the match of a character, character class, or set (like \w
or \S
). For example, (?i)^ß$
matches 'ss'
, and (?i)^ss$
matches 'ß'
.
I don't think Oniguruma should do that, unless the behavior is applied behind a dedicated flag or option. And if such a flag was added, that would allow applying the behavior more consistently than it is now, since users would be opting in and it could be documented that there are performance implications.
Following is my understanding of the reasons for and against the current behavior. Are there additional reasons I'm missing?
Reasons to continue expanding the length of a match
- Changing it now would be a breaking change.
- It follows Unicode recommendations and the Unicode org's ICU regex engine.
- It might be a big or complex change in the code (a lot of work).
- It sometimes solves a real issue with complex case differences.
Opinion: Even though it solves a real casing problem, the problem usually isn't relevant in the context of regular expressions. And when it is, there's usually an easy workaround, or the problem didn't need solving in the first place (for example, because the user wrote \w+
and so it would already match both ß
and ss
).
Reasons to stop expanding the length
- It is currently applied inconsistently anyway, based on complicated and nonintuitive conditions that few users will understand (I will show examples below).
- The fact that e.g.
(?i)\w
and(?i)[\w]
are not equivalent makes it hard to reason about or refactor regexes (similar to FlagW
should not interfere with Unicode case folding from flagi
#349). - It hurts performance (sometimes catastrophically), as discussed in Make character classes atomic with flag
i
#350. - It cannot be fully "fixed" (applied consistently) without hurting performance even more, for things like the dot
.
. - It is not portable with other regex flavors that don't do this, including Perl, PCRE2, JavaScript, Java, .NET, and Rust.
- Most of the time, the behavior is surprising (this would change if users had to opt in).
- Most of the time, the behavior is undesired, since it is simply wrong in terms of following user intent. For example, if someone uses
\S
in a character class, in essentially 100% of cases they do not mean to match'ss'
,'ff'
,'fl'
, etc. They mean "any single character that is not whitespace". And despite "ss" being the case conversion of a single character, it is not itself a single character in any context/language.
If you accept my statements above, unfortunately it means that, in exchange for ① the added complexity in the engine, ② the inconsistency/unpredictability for users, and ③ the resulting performance problems, users get behavior that in almost all cases they didn't expect or want.
Recent precedent from JavaScript
JavaScript is an interesting regex flavor to compare to, because in version ES2024 it added flag v
(unicodeSets
), which allows character classes and Unicode properties to match more than one character at a time using a few specific "properties of strings" like \p{RGI_Emoji}
(which can also be used in character classes) or the new syntax […\q{…|…}]
. However, even though JavaScript character classes and Unicode properties can now match more than one character, and even though JavaScript flags u
/v
change flag i
to use Unicode case folding, nevertheless JavaScript did not chose to apply Unicode's special casing rules that change match length (like ß
↔ ss
).
Current Oniguruma behavior
Following are the tests I ran to help me understand the current behavior. It shows the regex and target string for each test. r
is for raw strings (without backslash escaping).
✅ = match
❌ = no match
🤔 = inconsistent or questionable behavior
🤯 = very surprising
[
// Single `s` doesn't map to small sharp s (German eszett, ß) or its case equivalents
[r`(?i)^s$`, 'ß'], // ❌
[r`(?i)^s$`, 'ss'], // ❌
[r`(?i)^[s]$`, 'ß'], // ❌
[r`(?i)^[s]$`, 'ss'], // ❌
[r`(?i)^ß$`, 's'], // ❌
[r`(?i)^[ß]$`, 's'], // ❌
// Single `s` does map to its case equivalent small long s
[r`(?i)^s$`, 'ſ'], // ✅
// Single `ß` maps to `ss` and its case equivalents
[r`(?i)^ß$`, 'ß'], // ✅
[r`(?i)^ß$`, 'ss'], // ✅
[r`(?i)^ß$`, 'SS'], // ✅
[r`(?i)^ß$`, 'ſſ'], // ✅
[r`(?i)^ß$`, 'sS'], // ✅
[r`(?i)^ß$`, 'sſ'], // ✅
[r`(?i)^ß$`, 'Ss'], // ✅
[r`(?i)^ß$`, 'Sſ'], // ✅
[r`(?i)^ß$`, 'ſs'], // ✅
[r`(?i)^ß$`, 'ſS'], // ✅
[r`(?i)^ß$`, 'ẞ'], // ✅ Uppercase `ẞ` in target
[r`(?i)^ẞ$`, 'ß'], // ✅ Uppercase `ẞ` in pattern
// The same, within a positive class
[r`(?i)^[ß]$`, 'ß'], // ✅
[r`(?i)^[ß]$`, 'ss'], // ✅
[r`(?i)^[ß]$`, 'SS'], // ✅
[r`(?i)^[ß]$`, 'ſſ'], // ✅
[r`(?i)^[ß]$`, 'sS'], // ✅
[r`(?i)^[ß]$`, 'sſ'], // ✅
[r`(?i)^[ß]$`, 'Ss'], // ✅
[r`(?i)^[ß]$`, 'Sſ'], // ✅
[r`(?i)^[ß]$`, 'ſs'], // ✅
[r`(?i)^[ß]$`, 'ſS'], // ✅
[r`(?i)^[ß]$`, 'ẞ'], // ✅ Uppercase `ẞ` in target
[r`(?i)^[ẞ]$`, 'ß'], // ✅ Uppercase `ẞ` in pattern
// Negated class basics; nothing surprising here
[r`(?i)^[^ß]$`, 'ß'], // ❌
[r`(?i)^[^ß]$`, 'ss'], // ❌
[r`(?i)^[^s]$`, 'ß'], // ✅
[r`(?i)^[^s]$`, 'ss'], // ❌
[r`(?i)^[^ſ]$`, 'ß'], // ✅
[r`(?i)^[^ſ]$`, 'ss'], // ❌
[r`(?i)^[^ẞ]$`, 'ß'], // ❌ Uppercase `ẞ` in pattern
[r`(?i)^[^ẞ]$`, 'ss'], // ❌ Uppercase `ẞ` in pattern
// Other representations of exactly `ß` are OK
[r`(?i)^\x{DF}$`, 'ss'], // ✅
// But not sets that include `ß` 🤔
[r`(?i)^\w$`, 'ss'], // ❌
[r`(?i)^\p{Word}$`, 'ss'], // ❌
[r`(?i)^\D$`, 'ss'], // ❌
[r`(?i)^.$`, 'ss'], // ❌
[r`(?i)^\O$`, 'ss'], // ❌
[r`(?i)^\p{Any}$`, 'ss'], // ❌
// Within positive classes, other representations of `ß`, and sets/ranges that include `ß`, are OK
[r`(?i)^[\x{DF}]$`, 'ss'], // ✅
[r`(?i)^[\x{DE}-\x{E0}]$`, 'ss'], // ✅
[r`(?i)^[\w]$`, 'ss'], // ✅
[r`(?i)^[\p{Word}]$`, 'ss'], // ✅
[r`(?i)^[[:word:]]$`, 'ss'], // ✅
[r`(?i)^[\D]$`, 'ss'], // ✅
[r`(?i)^[\P{M}]$`, 'ss'], // ✅
[r`(?i)^[\p{Any}]$`, 'ss'], // ✅
// But not within negated classes 🤔
[r`(?i)^[^[^\x{DF}]]$`, 'ss'], // ❌
[r`(?i)^[^\0]$`, 'ss'], // ❌
[r`(?i)^[^\W]$`, 'ss'], // ❌
[r`(?i)^[^\d]$`, 'ss'], // ❌
[r`(?i)^[^\p{M}]$`, 'ss'], // ❌
// The negation rule is about negation of the outermost class, only 🤔
[r`(?i)^[^[\W]]$`, 'ss'], // ❌
[r`(?i)^[[^\W]]$`, 'ss'], // ✅ 🤯
[r`(?i)^[\w&&[^\W]]$`, 'ss'], // ✅ 🤯
// Flags `W` and `P` exclude `ß` from `\w`
[r`(?iP)^[\w]$`, 'ss'], // ❌
[r`(?iW)^[\w]$`, 'ss'], // ❌
[r`(?iW)^\w$`, 'ss'], // ❌
[r`(?iW)^[ß]$`, 'ss'], // ✅
[r`(?iW)^ß$`, 'ss'], // ✅
[r`(?iW)^\x{DF}$`, 'ss'], // ✅
// Quantifier basics; nothing surprising here
[r`(?i)^ß{2}$`, 'ßß'], // ✅
[r`(?i)^ß{2}$`, 'ss'], // ❌
[r`(?i)^ß{2}$`, 'ssss'], // ✅
[r`(?i)^[ß]{2}$`, 'ss'], // ❌
[r`(?i)^[ß]{2}$`, 'ssss'], // ✅
[r`(?i)^[^ß]{2}$`, 'ss'], // ✅
[r`(?i)^[^ß]{2}$`, 'ssss'], // ❌
// Character classes are affected by backtracking (bad news for performance!) 🤔
[r`(?i)^[\w]{2}$`, 'ss'], // ✅
[r`(?i)^[\w]{2}$`, 'sss'], // ✅
[r`(?i)^[\w]{2}$`, 'ssss'], // ✅
// In the reverse direction
[r`(?i)^ss$`, 'ss'], // ✅
[r`(?i)^ss$`, 'ß'], // ✅
[r`(?i)^ſſ$`, 'ß'], // ✅
[r`(?i)^ss$`, 'ẞ'], // ✅ Uppercase `ẞ`
// In the reverse direction with quantifiers; nothing surprising here
[r`(?i)^s{2}$`, 'ß'], // ❌
[r`(?i)^ss{2}$`, 'ßß'], // ❌
[r`(?i)^(?:ss){2}$`, 'ßß'], // ✅
]