forked from ridiculousfish/rust-pcre2
-
Notifications
You must be signed in to change notification settings - Fork 3
Incorporate UTF-32 changes #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ridiculousfish
merged 51 commits into
fish-shell:master
from
ridiculousfish:utf32-update
Sep 23, 2024
Merged
Incorporate UTF-32 changes #3
ridiculousfish
merged 51 commits into
fish-shell:master
from
ridiculousfish:utf32-update
Sep 23, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I fucking hate submodules. So I'm going to switch to a lower-tech solution that will just copy the files into the repository. Every time I come into this repo to update PCRE2, I have to do a deep dive on submodules to figure out how to do a simple update. No more. I'm done. This commit breaks the PCRE2_SYS_STATIC=1 build option, but we'll fix it up in a subsequent commit.
We still retain the common/default case of "dynamically link to PCRE2 system library" if possible, but otherwise update the build to be a bit simpler for the static build. We also add a shell script for pulling in a PCRE2 release into this repository (which we will add in a subsequent commit). This infrastructure was copied from the (yet to be released) rebar project. Fixes BurntSushi#24 (hopefully)
This also updates the binding script to fix churn produced by renaming whitelist/blacklist to allowlist/blocklist.
Apparently it isn't building on these targets? My best guess is that there is something wrong I'm doing in the build script when compiling for PCRE2. And probably that should be fixed. I'm pretty sure, for example, PCRE2's JIT supports arm. But this seems like a better choice to make the build succeed. Closes BurntSushi#25, Closes BurntSushi#29, Closes BurntSushi#30
... and remove uses of PCRE2_NO_UTF_CHECK. We now mark the "disable UTF check" method as deprecated. And also remove the `unsafe` annotation since it's now a no-op. Fixes BurntSushi#5
It's kind of amazing this bug has persisted without anyone noticing. Suggests perhaps that not many are using this crate, and ripgrep probably just uses the match offsets which were unaffected by this bug. Closes BurntSushi#22
This adds a routine for escaping a pattern such that it is guaranteed to match literally. We basically copy the routine from regex-syntax while ensuring the meta characters match what is documented in the `pcre2pattern` man page.
Dependencies are using features that require a newer Rust. Since this crate doesn't have an MSRV policy, I'm just going to bump the pinned version in CI to `{latest} - 2` and move on with my life.
Instead, we copy the 'Pool' implementation from the regex crate. The impetus for this is that this was the last crate using 'thread_local' in ripgrep.
Instead of duplicating each manually, use the config.h from the sources to simplify future upgrades. PR BurntSushi#35
Note: $ rustup target list | grep musleabi arm-unknown-linux-musleabi arm-unknown-linux-musleabihf armv5te-unknown-linux-musleabi armv7-unknown-linux-musleabi armv7-unknown-linux-musleabihf Ref: BurntSushi/ripgrep#2698
`cargo install bindgen-cli` is how it's done these days.
The PCRE2 jit is disabled dependent on the platform in build.rs. If it is disabled, then tests which assume the jit is available will fail. Fix these tests by switching to jit_if_available. This fixes the static build on macOS.
This moves the "bytes" module into regex_impl, and equips it with a trait, in preparation for UTF-32 matching.
This adds a module `utf32` which mirrors the module in `bytes`. It uses `CodeUnitWidth32` to provide the implementation.
This adds a new crate feature `jit` to enable the JIT. It is on by default.
By default, PCRE2 enables the strange sequence "(*UTF)" which turns on UTF validity checking for both patterns and subjects. This is hinted at as a potential security concern in the man page: "If the data string is very long, such a check might use sufficiently many resources as to cause your application to lose performance" For this reason, pcre2 provides a flag to avoid interpreting this sequence. Re-expose that in rust-pcre2, under the clearer name `block_utf_pattern_directive`.
Prior to this commit, rust-pcre2 would wrap pcre2's error messages with a prefix like "PCRE2: error compiling pattern:". However some clients want the raw error message as returned by pcre2. Allow access to this.
This both respects the PCRE2 API better and allows us to compile without UTF8 support.
This adds a new regex function `capture`, which captures matching substrings, using a new type `Captures`. It also adds new functions `replace` and `replace_all`, allowing substring replacement.
This includes the bump to PCRE2 10.43.
Since [bindgen 0.61.0][changelog], the `bindgen` CLI binary must be installed via the `bindgen-cli` crate. [changelog]: https://github.com/rust-lang/rust-bindgen/blob/main/CHANGELOG.md#0610 PR BurntSushi#39
This commit uses `jit_if_available` instead of `jit` in the tests, allowing them to fall back to non-JIT compilation on targets that don't support JIT. PR BurntSushi#41
To work around likely bugs in (older versions of) PCRE2. Namely, at one point, PCRE2 would dereference the haystack pointer even when the length was zero. This was reported in BurntSushi#10 and we worked around this in BurntSushi#11 by passing a pointer to a const `&[]`, with the (erroneous) presumption that this would be a valid pointer to dereference. In retrospect though, this was a little silly, because you should never be dereferencing a pointer to an empty slice. It's not valid. Alas, at that time, Rust did actually hand you a valid pointer that could be dereferenced. But [this PR][rust-pull] changed that. And thus, we're back to where we started: handing buggy versions of PCRE2 a zero length haystack with a dangling pointer. So we fix this once and for all by passing a slice of length 1, but with a haystack length of 0, to the PCRE2 search routine when searching an empty haystack. This will guarantee the provision of a dereferencable pointer should PCRE2 decide to dereference it. Fixes BurntSushi#42 [rust-pull]: rust-lang/rust#123936
This feature does not exist in rust-pcre2.
This is an "ours" merge which ignores the changes from master. This is because this utf32-update branch should supersede the old development from master, and this will allow merging our branch into master.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This merges in the proposed UTF32 changes from PR 38, with some cleanup to work on the latest release of rust-pcre2.