@@ -1790,14 +1790,14 @@ def convert_i32_to_char(cx, i):
1790
1790
```
1791
1791
1792
1792
Strings are loaded from two ` i32 ` values: a pointer (offset in linear memory)
1793
- and a number of bytes . There are three supported string encodings in [ ` canonopt ` ] :
1794
- [ UTF-8] , [ UTF-16] and ` latin1+utf16 ` . This last options allows a * dynamic *
1795
- choice between [ Latin-1] and UTF-16, indicated by the high bit of the second
1796
- ` i32 ` . String values include their original encoding and byte length as a
1797
- "hint" that enables ` store_string ` (defined below) to make better up-front
1798
- allocation size choices in many cases. Thus, the value produced by
1799
- ` load_string ` isn't simply a Python ` str ` , but a * tuple* containing a ` str ` ,
1800
- the original encoding and the original byte length .
1793
+ and a number of [ code units ] . There are three supported string encodings in
1794
+ [ ` canonopt ` ] : [ UTF-8] , [ UTF-16] and ` latin1+utf16 ` . This last options allows a
1795
+ * dynamic * choice between [ Latin-1] and UTF-16, indicated by the high bit of the
1796
+ second ` i32 ` . String values include their original encoding and length in
1797
+ tagged code units as a "hint" that enables ` store_string ` (defined below) to
1798
+ make better up-front allocation size choices in many cases. Thus, the value
1799
+ produced by ` load_string ` isn't simply a Python ` str ` , but a * tuple* containing
1800
+ a ` str ` , the original encoding and the number of source code units .
1801
1801
``` python
1802
1802
String = tuple[str , str , int ]
1803
1803
@@ -2091,12 +2091,12 @@ approach to update the allocation size during the single copy. A blind
2091
2091
` realloc ` approach would normally suffer from multiple reallocations per string
2092
2092
(e.g., using the standard doubling-growth strategy). However, as already shown
2093
2093
in ` load_string ` above, string values come with two useful hints: their
2094
- original encoding and byte length . From this hint data, ` store_string ` can do a
2095
- much better job minimizing the number of reallocations.
2094
+ original encoding and number of source [ code units ] . From this hint data,
2095
+ ` store_string ` can do a much better job minimizing the number of reallocations.
2096
2096
2097
2097
We start with a case analysis to enumerate all the meaningful encoding
2098
2098
combinations, subdividing the ` latin1+utf16 ` encoding into either ` latin1 ` or
2099
- ` utf16 ` based on the ` UTF16_BIT ` flag set by ` load_string ` :
2099
+ ` utf16 ` based on the ` UTF16_TAG ` flag set by ` load_string ` :
2100
2100
``` python
2101
2101
def store_string (cx , v : String, ptr ):
2102
2102
begin, tagged_code_units = store_string_into_range(cx, v)
@@ -2156,7 +2156,7 @@ def store_string_copy(cx, src, src_code_units, dst_code_unit_size, dst_alignment
2156
2156
return (ptr, src_code_units)
2157
2157
```
2158
2158
The choice of ` MAX_STRING_BYTE_LENGTH ` constant ensures that the high bit of a
2159
- string's byte length is never set, keeping it clear for ` UTF16_BIT ` .
2159
+ string's number of code units is never set, keeping it clear for ` UTF16_TAG ` .
2160
2160
2161
2161
The 2 cases of transcoding into UTF-8 share an algorithm that starts by
2162
2162
optimistically assuming that each code unit of the source string fits in a
@@ -4183,6 +4183,7 @@ def canon_thread_available_parallelism():
4183
4183
[ Latin-1 ] : https://en.wikipedia.org/wiki/ISO/IEC_8859-1
4184
4184
[ Unicode Scalar Value ] : https://unicode.org/glossary/#unicode_scalar_value
4185
4185
[ Unicode Code Point ] : https://unicode.org/glossary/#code_point
4186
+ [ Code Units ] : https://www.unicode.org/glossary/#code_unit
4186
4187
[ Surrogate ] : https://unicode.org/faq/utf_bom.html#utf16-2
4187
4188
[ Name Mangling ] : https://en.wikipedia.org/wiki/Name_mangling
4188
4189
[ Fibers ] : https://en.wikipedia.org/wiki/Fiber_(computer_science)
0 commit comments