Replies: 1 comment
-
Whoops, I overlooked 'range', exemplified by a..z. For example, (0xC00..0xC7F) will serve to denote the characters of the Telugu block. That deals with the problem I wrongly highlighted in the second paragraph. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
How is one meant to include Unicode characters? Are they only supported as an arbitrary 32-bit alphabet as opposed to a 16- or 8-bit alphabet, or is there special support for them?
Is there some syntax for including Unicode characters in regular expressions? Mostly one can just include them as machines recognising a single character e.g. /a/ 0x0302 /.*/ as opposed to NFD /â.*/, but I can't work out how to specify a range of characters by codepoint, other than by exhaustively listing the entire range.
Is there any sanity-preserving way of combining Unicode and semantic conditions? Unicode actually only needs 21 bits, so there are 11 bits left over for use in semantic conditions. One could use UTF-8 or UTF-16, but UTF-8 is unpleasant and UTF-16 tends to preserve obscure bugs with lone surrogates.
Beta Was this translation helpful? Give feedback.
All reactions