File tree Expand file tree Collapse file tree 1 file changed +11
-1
lines changed Expand file tree Collapse file tree 1 file changed +11
-1
lines changed Original file line number Diff line number Diff line change @@ -335,7 +335,17 @@ are in the character. */
335
335
/* A continuation byte in a UTF-8 encoded sequence contributes this number of
336
336
* low-order bits to the specification of the code point. In the bit
337
337
* maps above, you see that the first 2 bits are a constant '10', leaving 6 of
338
- * real information */
338
+ * real information. (If you're really curious, the only two numbers that work
339
+ * out for this on an 8-bit byte are 5 and 6. Since the first two bits are
340
+ * already taken, a maximum of 6 are available for anything else. If 6 is
341
+ * used, there are 64 possible continuations 80-BF. With 5, there are 32,
342
+ * A0-BF. And with 4 there would be 0 continuations possible; an
343
+ * impossibility. So 5 is the minimum. UTF-EBCDIC I8 (Intermediate 8) is just
344
+ * setting this to 5. We could have a UTF-8 encoding that is based on ASCII,
345
+ * but uses just 5 bits of payload per continuation byte. The reason someone
346
+ * might want to do this is to extend the set of characters that occupy a
347
+ * single byte when encoded in this hypothetical UTF-8 to additionally include
348
+ * the C1 controls.) */
339
349
# define UTF_CONTINUATION_BYTE_INFO_BITS 6
340
350
341
351
/* ^? is defined to be DEL on ASCII systems. See the definition of toCTRL()
You can’t perform that action at this time.
0 commit comments