why doesn't ripgrep case insensitive search work as expected with the Turkish dotless i? #2221
-
Question and answer taken from: https://news.ycombinator.com/item?id=31471112 In short: why doesn't |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The specific reason is hard to articulate precisely, but it basically boils down to "difficult to implement." The UTS#18 spec is a tortured document. I think it's better that it exists than not, but if you look at its history, it's undergone quite a bit of evolution. For example, there used to be a "level 3" of UTS#18, but it was retracted: https://unicode.org/reports/tr18/#Tailored_Support And to be clear, in order to implement the Turkish dotless 'i' stuff correctly, your implementation needs to have that "level 3" support for custom tailoring based on locale. So you could actually elevate your question to the Unicode consortium itself. I'm not plugged into the Unicode consortium and its decision making process, but based on what I've read and my experience implementing regex engines, the answer to your question is reasonably simple: it is difficult to implement. ripgrep doesn't even have "level 2" support in its regex engine, nevermind a retracted "level 3" support for custom tailoring. And indeed, most regex engines don't bother with level 2 either. Hell, many don't bother with level 1. The specific reasoning boils down to difficulty in the implementation. OK OK, so what is this "difficulty"? The issue comes from how regex engines are implemented. And even that is hard to explain because regex engines are themselves split into two major ideas: unbounded backtracking regex engines that typically support oodles of features (think Perl and PCRE) and regex engines based on finite automata. (Hybrids exist too!) I personally don't know so much about the former, but know a lot about the latter. So that's what I'll speak to. Before the era of Unicode, most things just assumed ASCII and everything was byte oriented and things were glorious. If you wanted to implement a DFA, its alphabet was just consisted of the obvious: 255 bytes. That means your transition table had states as rows and each possible byte value as columns. Depending on how big your state pointers are, even this is quite massive! (Assuming state pointers are the size of an actual pointer, then on x86_64 targets, just 10 states would use 10x255x8=~20KB of memory. Yikes.) But once Unicode came along, your regex engine really wants to know about codepoints. For example, what does '[^a]' match? Does it match any byte except for 'a'? Well, that would be just horrendous on UTF-8 encoded text, because it might give you a match in the middle of a codepoint. No, '[^a]' wants to match "every codepoint except for 'a'." So then you think: well, now your alphabet is just the set of all Unicode codepoints. Well, that's huge. What happens to your transition table size? It's intractable, so then you switch to a sparse representation, e.g., using a hashmap to map the current state and the current codepoint to the next state. Well... Owch. A hashmap lookup for every transition when previously it was just some simple arithmetic and a pointer dereference? You're looking at a huge slowdown. Too huge to be practical. So what do you do? Well, you build UTF-8 into your automaton itself. It makes the automaton bigger, but you retain your small alphabet size. Here, I'll show you. The first example is byte oriented while the second is Unicode aware:
This doesn't look like a huge increase in complexity, but that's only because '[^a]' is simple. Try using something like '\w' and you need hundreds of states. But that's just codepoints. UTS#18 level 2 support requires "full" case folding, which includes the possibility of some codepoints mapping to multiple codepoints when doing caseless matching. For example, 'ß' should match 'SS', but the latter is two codepoints, not one. So that is considered part of "full" case folding. "simple" case folding, which is all that is required by UTS#18 level 1, limits itself to caseless matching for codepoints that are 1-to-1. That is, codepoints whose case folding maps to exactly one other codepoint. UTS#18 even talks about this1, and that specifically, it is difficult for regex engines to support. Hell, it looks like even "full" case folding has been retracted from "level 2" support.2 The reason why "full" case folding is difficult is because regex engine designs are oriented around "codepoint" as the logical units on which to match. If "full" case folding were permitted, that would mean, for example, that '(?i)[^a]' would actually be able to match more than one codepoint. This turns out to be exceptionally difficult to implement, at least in finite automata based regex engines. Now, I don't believe the Turkish dotless-i problem involves multiple codepoints, but it does require custom tailoring. And that means the regex engine would need to be parameterized over a locale. AFAIK, the only regex engines that even attempt this are POSIX and maybe ICU's regex engine. Otherwise, any custom tailoring that's needed is left up to the application. The bottom line is that custom tailoring and "full" case matching don't tend to matter enough to be worth implementing correctly in most regex engines. Usually the application can work around it if they care enough. For example, the application could replace dotless-i/dotted-I with dotted-i/dotless-I before running a regex query. The same thing applies for normalization.3 Regex engines never (I'm not aware of any that do) take Unicode normal forms into account. Instead, the application needs to handle that sort of stuff. So nevermind Turkish special cases, you might not find a 'é' when you search for an 'é':
Unicode is hard. Tooling is littered with footguns. Sometimes you just have to work to find them. The Turkish dotless-i just happens to be a fan favorite example. |
Beta Was this translation helpful? Give feedback.
The specific reason is hard to articulate precisely, but it basically boils down to "difficult to implement." The UTS#18 spec is a tortured document. I think it's better that it exists than not, but if you look at its history, it's undergone quite a bit of evolution. For example, there used to be a "level 3" of UTS#18, but it was retracted: https://unicode.org/reports/tr18/#Tailored_Support
And to be clear, in order to implement the Turkish dotless 'i' stuff correctly, your implementation needs to have that "level 3" support for custom tailoring based on locale. So you could actually elevate your question to the Unicode consortium itself.
I'm not plugged into the Unicode consortium and it…