UTF-8 Offsets into Encoding as VAL_INDEX() Invalidated Via Other References

Good to see you're moving to "UTF-8 Everywhere"...I've found it to be undeniably the right answer.  No regrets!

However: you are using byte offsets inside Cells as the VAL_INDEX().  This means that offset can be at a place that is no longer a valid codepoint boundary, if the data is modified through another reference.

In your [utf8-squash build](https://github.com/Oldes/Rebol3/commit/daa5e143c36551a18ce73d12c1830de7f216c370):

    >> str: "abc"
    == "abc"

    >> str2: next str
    == "bc"

    >> change str "😺"
    == "bc"

    >> str
    == "😺bc"

    >> str2
    ==
    [process exited with code 3221225477 (0xc0000005)]

This means that there'd have to be some kind of reconciliation of VAL_INDEX() when you notice you're on a continuation byte.  I guess the options are to just sync up to an arbitrary new position, or perhaps raise an error...

Admittedly, [Rebol's invariants have never been great when you modify a series reference](https://rebol.metaeducation.com/t/where-the-series-ends-simplifying-out-of-bounds-rules/1141)... in terms of what happens to the other references.  But I think the small anchor of being able to conceptually model a series index living in a Cell as at least "no worse than an integer" offers some benefit.

So if you are curious: Ren-C stores the character index in the strings, and bears the cost of calculating the offset from that.  In order to make this tolerable, there are "Bookmarks" on the string (currently at most one at a time) which map indices for offsets.  There are optimizations (all ASCII strings can do O(1) jumps, short strings don't pay for the bookmarks, etc.)

  https://github.com/metaeducation/ren-c/blob/45806b2fb6f56e8aaefcf718d77cae2c6a1241fb/src/core/types/t-string.c#L60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

UTF-8 Offsets into Encoding as VAL_INDEX() Invalidated Via Other References #2656

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

UTF-8 Offsets into Encoding as VAL_INDEX() Invalidated Via Other References #2656

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions