Skip to content

Commit f89304f

Browse files
committed
utf8.h: Add comment for the idly curious
1 parent 4a230ea commit f89304f

File tree

1 file changed

+11
-1
lines changed

1 file changed

+11
-1
lines changed

utf8.h

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -335,7 +335,17 @@ are in the character. */
335335
/* A continuation byte in a UTF-8 encoded sequence contributes this number of
336336
* low-order bits to the specification of the code point. In the bit
337337
* maps above, you see that the first 2 bits are a constant '10', leaving 6 of
338-
* real information */
338+
* real information. (If you're really curious, the only two numbers that work
339+
* out for this on an 8-bit byte are 5 and 6. Since the first two bits are
340+
* already taken, a maximum of 6 are available for anything else. If 6 is
341+
* used, there are 64 possible continuations 80-BF. With 5, there are 32,
342+
* A0-BF. And with 4 there would be 0 continuations possible; an
343+
* impossibility. So 5 is the minimum. UTF-EBCDIC I8 (Intermediate 8) is just
344+
* setting this to 5. We could have a UTF-8 encoding that is based on ASCII,
345+
* but uses just 5 bits of payload per continuation byte. The reason someone
346+
* might want to do this is to extend the set of characters that occupy a
347+
* single byte when encoded in this hypothetical UTF-8 to additionally include
348+
* the C1 controls.) */
339349
# define UTF_CONTINUATION_BYTE_INFO_BITS 6
340350

341351
/* ^? is defined to be DEL on ASCII systems. See the definition of toCTRL()

0 commit comments

Comments
 (0)