Skip to content

UTF-8 Offsets into Encoding as VAL_INDEX() Invalidated Via Other References #2656

@hostilefork

Description

@hostilefork

Good to see you're moving to "UTF-8 Everywhere"...I've found it to be undeniably the right answer. No regrets!

However: you are using byte offsets inside Cells as the VAL_INDEX(). This means that offset can be at a place that is no longer a valid codepoint boundary, if the data is modified through another reference.

In your utf8-squash build:

>> str: "abc"
== "abc"

>> str2: next str
== "bc"

>> change str "😺"
== "bc"

>> str
== "😺bc"

>> str2
==
[process exited with code 3221225477 (0xc0000005)]

This means that there'd have to be some kind of reconciliation of VAL_INDEX() when you notice you're on a continuation byte. I guess the options are to just sync up to an arbitrary new position, or perhaps raise an error...

Admittedly, Rebol's invariants have never been great when you modify a series reference... in terms of what happens to the other references. But I think the small anchor of being able to conceptually model a series index living in a Cell as at least "no worse than an integer" offers some benefit.

So if you are curious: Ren-C stores the character index in the strings, and bears the cost of calculating the offset from that. In order to make this tolerable, there are "Bookmarks" on the string (currently at most one at a time) which map indices for offsets. There are optimizations (all ASCII strings can do O(1) jumps, short strings don't pay for the bookmarks, etc.)

https://github.com/metaeducation/ren-c/blob/45806b2fb6f56e8aaefcf718d77cae2c6a1241fb/src/core/types/t-string.c#L60

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions