Skip to content

Incorporate UTF-32 changes #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Sep 23, 2024

Conversation

ridiculousfish
Copy link
Member

This merges in the proposed UTF32 changes from PR 38, with some cleanup to work on the latest release of rust-pcre2.

BurntSushi and others added 30 commits June 18, 2023 10:06
I fucking hate submodules. So I'm going to switch to a lower-tech
solution that will just copy the files into the repository.

Every time I come into this repo to update PCRE2, I have to do a deep
dive on submodules to figure out how to do a simple update. No more. I'm
done.

This commit breaks the PCRE2_SYS_STATIC=1 build option, but we'll fix it
up in a subsequent commit.
We still retain the common/default case of "dynamically link to PCRE2
system library" if possible, but otherwise update the build to be a bit
simpler for the static build. We also add a shell script for pulling in
a PCRE2 release into this repository (which we will add in a subsequent
commit).

This infrastructure was copied from the (yet to be released) rebar
project.

Fixes BurntSushi#24 (hopefully)
This also updates the binding script to fix churn produced by renaming
whitelist/blacklist to allowlist/blocklist.
Apparently it isn't building on these targets?

My best guess is that there is something wrong I'm doing in the build
script when compiling for PCRE2. And probably that should be fixed. I'm
pretty sure, for example, PCRE2's JIT supports arm. But this seems like
a better choice to make the build succeed.

Closes BurntSushi#25, Closes BurntSushi#29, Closes BurntSushi#30
... and remove uses of PCRE2_NO_UTF_CHECK.

We now mark the "disable UTF check" method as deprecated. And also
remove the `unsafe` annotation since it's now a no-op.

Fixes BurntSushi#5
It's kind of amazing this bug has persisted without anyone noticing.
Suggests perhaps that not many are using this crate, and ripgrep
probably just uses the match offsets which were unaffected by this bug.

Closes BurntSushi#22
This adds a routine for escaping a pattern such that it is guaranteed to
match literally. We basically copy the routine from regex-syntax while
ensuring the meta characters match what is documented in the
`pcre2pattern` man page.
Dependencies are using features that require a newer Rust. Since this
crate doesn't have an MSRV policy, I'm just going to bump the pinned
version in CI to `{latest} - 2` and move on with my life.
Instead, we copy the 'Pool' implementation from the regex crate.

The impetus for this is that this was the last crate using
'thread_local' in ripgrep.
Instead of duplicating each manually, use the config.h from the
sources to simplify future upgrades.

PR BurntSushi#35
Note:

    $ rustup target list | grep musleabi
    arm-unknown-linux-musleabi
    arm-unknown-linux-musleabihf
    armv5te-unknown-linux-musleabi
    armv7-unknown-linux-musleabi
    armv7-unknown-linux-musleabihf

Ref: BurntSushi/ripgrep#2698
`cargo install bindgen-cli` is how it's done these days.
The PCRE2 jit is disabled dependent on the platform in build.rs. If it
is disabled, then tests which assume the jit is available will fail.

Fix these tests by switching to jit_if_available. This fixes the static
build on macOS.
This moves the "bytes" module into regex_impl, and equips it with a
trait, in preparation for UTF-32 matching.
This adds a module `utf32` which mirrors the module in `bytes`.
It uses `CodeUnitWidth32` to provide the implementation.
ridiculousfish and others added 21 commits February 3, 2024 19:30
This adds a new crate feature `jit` to enable the JIT. It is on by
default.
By default, PCRE2 enables the strange sequence "(*UTF)" which turns on UTF
validity checking for both patterns and subjects. This is hinted at as a
potential security concern in the man page:

"If the data string is very long, such a check might use sufficiently many
resources as to cause your application to lose performance"

For this reason, pcre2 provides a flag to avoid interpreting this
sequence. Re-expose that in rust-pcre2, under the clearer name
`block_utf_pattern_directive`.
Prior to this commit, rust-pcre2 would wrap pcre2's error messages with
a prefix like "PCRE2: error compiling pattern:". However some clients
want the raw error message as returned by pcre2. Allow access to this.
This both respects the PCRE2 API better and allows us to compile without
UTF8 support.
This adds a new regex function `capture`, which captures matching
substrings, using a new type `Captures`.

It also adds new functions `replace` and `replace_all`, allowing
substring replacement.
This includes the bump to PCRE2 10.43.
Since [bindgen 0.61.0][changelog], the `bindgen`
CLI binary must be installed via the `bindgen-cli`
crate.

[changelog]: https://github.com/rust-lang/rust-bindgen/blob/main/CHANGELOG.md#0610

PR BurntSushi#39
This commit uses `jit_if_available` instead of `jit` in the tests, allowing them
to fall back to non-JIT compilation on targets that don't support JIT.

PR BurntSushi#41
To work around likely bugs in (older versions of) PCRE2. Namely, at one
point, PCRE2 would dereference the haystack pointer even when the length
was zero.

This was reported in BurntSushi#10 and we worked around this in BurntSushi#11 by passing a
pointer to a const `&[]`, with the (erroneous) presumption that this
would be a valid pointer to dereference. In retrospect though, this was
a little silly, because you should never be dereferencing a pointer to
an empty slice. It's not valid. Alas, at that time, Rust did actually
hand you a valid pointer that could be dereferenced. But [this
PR][rust-pull] changed that. And thus, we're back to where we started:
handing buggy versions of PCRE2 a zero length haystack with a dangling
pointer.

So we fix this once and for all by passing a slice of length 1, but with
a haystack length of 0, to the PCRE2 search routine when searching an
empty haystack. This will guarantee the provision of a dereferencable
pointer should PCRE2 decide to dereference it.

Fixes BurntSushi#42

[rust-pull]: rust-lang/rust#123936
This feature does not exist in rust-pcre2.
This is an "ours" merge which ignores the changes from master. This is
because this utf32-update branch should supersede the old development
from master, and this will allow merging our branch into master.
@ridiculousfish ridiculousfish merged commit d7d8e75 into fish-shell:master Sep 23, 2024
1 of 8 checks passed
@ridiculousfish ridiculousfish deleted the utf32-update branch September 23, 2024 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants