Skip to content

Commit 907f1f9

Browse files
committed
utf8.h: Reword comment
This comment previously stated it didn't know the derivation of a constant, but the answer recently came to me. A different derivation makes more sense. This commit changes to use the more natural derivaion, with the comment updated. There is no difference in the macro's resulting expansion.
1 parent f89304f commit 907f1f9

File tree

1 file changed

+11
-6
lines changed

1 file changed

+11
-6
lines changed

utf8.h

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -376,12 +376,17 @@ are in the character. */
376376
#define UTF_IS_CONTINUATION_MASK \
377377
((U8) ((0xFF << UTF_ACCUMULATION_SHIFT) & 0xFF))
378378

379-
/* This defines the bits that are to be in the continuation bytes of a
380-
* multi-byte UTF-8 encoded character that mark it is a continuation byte.
381-
* This turns out to be 0x80 in UTF-8, 0xA0 in UTF-EBCDIC. (khw doesn't know
382-
* the underlying reason that B0 works here, except it just happens to work.
383-
* One could solve for two linear equations and come up with it.) */
384-
#define UTF_CONTINUATION_MARK (UTF_IS_CONTINUATION_MASK & 0xB0)
379+
/* This defines the bits that mark a byte in a multi-byte UTF-8 encoded
380+
* character as being a continuation byte. A MASK clears the bits you don't
381+
* want, using a binary '&'; and a MARK sets the ones you do want, using a
382+
* binary '|'. As stated earlier, the fundamental difference between UTF-8 and
383+
* UTF-EBCDIC is that the former has the upper 2 bits of a continuation byte be
384+
* '10', and the latter has the upper 3 bits be '101', leaving 6 and 5 bits
385+
* respectively in which to store information. This is equivalent to "All bits
386+
* are 1 except those that store information (which vary) plus the bit that is
387+
* required to be 0". This yields 1000 0000 (0x80) for ASCII, and 1010 0000
388+
* (0xA0) for UTF-EBCDIC. */
389+
#define UTF_CONTINUATION_MARK (~(0x40 | UTF_CONTINUATION_MASK) & 0xff)
385390

386391
/* These values are clearer in some contexts; still apply to UTF, not UTF-8 */
387392
#define UTF_MIN_CONTINUATION_BYTE UTF_CONTINUATION_MARK

0 commit comments

Comments
 (0)