Skip to content

Commit 57510bc

Browse files
authored
Update CanonicalABI.md description to match Python defs regarding bytes vs. code units (#510)
* Update CanonicalABI.md prose to match Python regarding code units Closes #509 * Also fix UTF16_BIT/TAG in the text
1 parent 7fd71ac commit 57510bc

File tree

1 file changed

+13
-12
lines changed

1 file changed

+13
-12
lines changed

design/mvp/CanonicalABI.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1790,14 +1790,14 @@ def convert_i32_to_char(cx, i):
17901790
```
17911791

17921792
Strings are loaded from two `i32` values: a pointer (offset in linear memory)
1793-
and a number of bytes. There are three supported string encodings in [`canonopt`]:
1794-
[UTF-8], [UTF-16] and `latin1+utf16`. This last options allows a *dynamic*
1795-
choice between [Latin-1] and UTF-16, indicated by the high bit of the second
1796-
`i32`. String values include their original encoding and byte length as a
1797-
"hint" that enables `store_string` (defined below) to make better up-front
1798-
allocation size choices in many cases. Thus, the value produced by
1799-
`load_string` isn't simply a Python `str`, but a *tuple* containing a `str`,
1800-
the original encoding and the original byte length.
1793+
and a number of [code units]. There are three supported string encodings in
1794+
[`canonopt`]: [UTF-8], [UTF-16] and `latin1+utf16`. This last options allows a
1795+
*dynamic* choice between [Latin-1] and UTF-16, indicated by the high bit of the
1796+
second `i32`. String values include their original encoding and length in
1797+
tagged code units as a "hint" that enables `store_string` (defined below) to
1798+
make better up-front allocation size choices in many cases. Thus, the value
1799+
produced by `load_string` isn't simply a Python `str`, but a *tuple* containing
1800+
a `str`, the original encoding and the number of source code units.
18011801
```python
18021802
String = tuple[str, str, int]
18031803

@@ -2091,12 +2091,12 @@ approach to update the allocation size during the single copy. A blind
20912091
`realloc` approach would normally suffer from multiple reallocations per string
20922092
(e.g., using the standard doubling-growth strategy). However, as already shown
20932093
in `load_string` above, string values come with two useful hints: their
2094-
original encoding and byte length. From this hint data, `store_string` can do a
2095-
much better job minimizing the number of reallocations.
2094+
original encoding and number of source [code units]. From this hint data,
2095+
`store_string` can do a much better job minimizing the number of reallocations.
20962096

20972097
We start with a case analysis to enumerate all the meaningful encoding
20982098
combinations, subdividing the `latin1+utf16` encoding into either `latin1` or
2099-
`utf16` based on the `UTF16_BIT` flag set by `load_string`:
2099+
`utf16` based on the `UTF16_TAG` flag set by `load_string`:
21002100
```python
21012101
def store_string(cx, v: String, ptr):
21022102
begin, tagged_code_units = store_string_into_range(cx, v)
@@ -2156,7 +2156,7 @@ def store_string_copy(cx, src, src_code_units, dst_code_unit_size, dst_alignment
21562156
return (ptr, src_code_units)
21572157
```
21582158
The choice of `MAX_STRING_BYTE_LENGTH` constant ensures that the high bit of a
2159-
string's byte length is never set, keeping it clear for `UTF16_BIT`.
2159+
string's number of code units is never set, keeping it clear for `UTF16_TAG`.
21602160

21612161
The 2 cases of transcoding into UTF-8 share an algorithm that starts by
21622162
optimistically assuming that each code unit of the source string fits in a
@@ -4183,6 +4183,7 @@ def canon_thread_available_parallelism():
41834183
[Latin-1]: https://en.wikipedia.org/wiki/ISO/IEC_8859-1
41844184
[Unicode Scalar Value]: https://unicode.org/glossary/#unicode_scalar_value
41854185
[Unicode Code Point]: https://unicode.org/glossary/#code_point
4186+
[Code Units]: https://www.unicode.org/glossary/#code_unit
41864187
[Surrogate]: https://unicode.org/faq/utf_bom.html#utf16-2
41874188
[Name Mangling]: https://en.wikipedia.org/wiki/Name_mangling
41884189
[Fibers]: https://en.wikipedia.org/wiki/Fiber_(computer_science)

0 commit comments

Comments
 (0)