Skip to content

Commit e10b004

Browse files
committed
Merge Searcher and Consumer.
Explained why Pattern cannot be merged into Searcher. Block on RFC 1672.
1 parent d0782c8 commit e10b004

File tree

1 file changed

+195
-37
lines changed

1 file changed

+195
-37
lines changed

text/0000-pattern-3.md

Lines changed: 195 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Generalize the pattern API to support `&str`, `&mut str`, `&[T]`, `&mut [T]`, `V
2424
- [Principles](#principles)
2525
- [Design rationales](#design-rationales)
2626
- [Miscellaneous decisions](#miscellaneous-decisions)
27+
- [Alternatives](#alternatives)
2728
- [Prior art](#prior-art)
2829
- [Previous attempts](#previous-attempts)
2930
- [Haskell](#haskell)
@@ -245,21 +246,27 @@ to recover the indices in the middle (`5 == 3 + 2` and `7 == 3 + 4`).
245246

246247
### Searcher
247248

248-
A searcher only provides a single method: `.search()`. It takes a span as input,
249-
and returns the first sub-range where the given pattern is found.
249+
A searcher has two required methods `.search()` and `.consume()`,
250+
and an optional method `.trim_start()`.
250251

251252
```rust
252253
pub unsafe trait Searcher<A: Hay + ?Sized> {
253254
fn search(&mut self, span: Span<&A>) -> Option<Range<A::Index>>;
255+
fn consume(&mut self, span: Span<&A>) -> Option<A::Index>;
256+
fn trim_start(&mut self, hay: &A) -> A::Index { ... }
254257
}
258+
255259
pub unsafe trait ReverseSearcher<A: Hay + ?Sized>: Searcher<A> {
256260
fn rsearch(&mut self, span: Span<&A>) -> Option<Range<A::Index>>;
261+
fn rconsume(&mut self, span: Span<&A>) -> Option<A::Index>;
262+
fn trim_end(&mut self, hay: &A) -> A::Index { ... }
257263
}
264+
258265
pub unsafe trait DoubleEndedSearcher<A: Hay + ?Sized>: ReverseSearcher<A> {}
259266
```
260267

261-
The `.search()` function is safe because there is no safe ways to construct a `Span<&A>`
262-
with invalid ranges. Implementations of `.search()` often start with:
268+
`.search()` and `.consume()` are safe because there is no safe ways to construct a `Span<&A>`
269+
with invalid ranges. Implementations of these methods often start with:
263270

264271
```rust
265272
fn search(&mut self, span: SharedSpan<&A>) -> Option<Range<A::Index>> {
@@ -270,36 +277,62 @@ with invalid ranges. Implementations of `.search()` often start with:
270277

271278
The trait is unsafe to implement because it needs to guarantee the returned range is valid.
272279

273-
### Consumer
274-
275-
A consumer provides the `.consume()` method to implement `starts_with()` and `trim_start()`. It
276-
takes a span as input, and if the beginning matches the pattern, returns the end index of the match.
277-
278-
The trait also provides a `.trim_start()` method in case a faster specialization exists.
280+
The `.search()` method will look for the first slice matching the searcher's pattern in the span,
281+
and returns the range where the slice is found (relative to the hay's start index).
282+
The `.consume()` method will is similar, but anchored to the start of the span.
279283

280284
```rust
281-
pub unsafe trait Consumer<A: Hay + ?Sized> {
282-
fn consume(&mut self, span: Span<&A>) -> Option<A::Index>;
283-
fn trim_start(&mut self, hay: &A) -> A::Index { ... }
284-
}
285-
pub unsafe trait ReverseConsumer<A: Hay + ?Sized>: Consumer<A> {
286-
fn rconsume(&mut self, span: Span<&A>) -> Option<A::Index>;
287-
fn trim_end(&mut self, hay: &A) -> A::Index { ... }
288-
}
289-
pub unsafe trait DoubleEndedConsumer<A: Hay + ?Sized>: ReverseConsumer<A> {}
285+
let span = unsafe { Span::from_parts("CDEFG", 3..8) };
286+
// we can find "CD" at the start of the span.
287+
assert_eq!("CD".into_searcher().search(span.clone()), Some(3..5));
288+
assert_eq!("CD".into_searcher().consume(span.clone()), Some(5));
289+
// we can only find "EF" in the middle of the span.
290+
assert_eq!("EF".into_searcher().search(span.clone()), Some(5..7));
291+
assert_eq!("EF".into_searcher().consume(span.clone()), None);
292+
// we cannot find "GH" in the span.
293+
assert_eq!("GH".into_searcher().search(span.clone()), None);
294+
assert_eq!("GH".into_searcher().consume(span.clone()), None);
290295
```
291296

297+
The trait also provides a `.trim_start()` method in case a faster specialization exists.
298+
292299
### Pattern
293300

294301
A pattern is simply a "factory" of a searcher and consumer.
295302

296303
```rust
297304
trait Pattern<H: Haystack>: Sized {
298305
type Searcher: Searcher<H::Target>;
299-
type Consumer: Consumer<H::Target>;
300306

301307
fn into_searcher(self) -> Self::Searcher;
302-
fn into_consumer(self) -> Self::Consumer;
308+
fn into_consumer(self) -> Self::Searcher { self.into_searcher() }
309+
}
310+
```
311+
312+
Patterns are the types where users used to supply into the algorithms.
313+
Patterns are usually immutable (stateless), while searchers sometimes require pre-computation and
314+
mutable state when implementing some more sophisticated string searching algorithms.
315+
316+
The relation between `Pattern` and `Searcher` is thus like `IntoIterator` and `Iterator`.
317+
318+
There is a required method `.into_searcher()` as well as an optional method `.into_consumer()`.
319+
In some patterns (e.g. substring search), checking if a prefix match will require much less
320+
pre-computation than checking if any substring match. Therefore, if an algorithm can declare that
321+
it will only call `.consume()`, the searcher could use a more efficient structure.
322+
323+
```rust
324+
impl<H: Haystack<Target = str>> Pattern<H> for &'p str {
325+
type Searcher = SliceSearcher<'p, [u8]>;
326+
#[inline]
327+
fn into_searcher(self) -> Self::Searcher {
328+
// create a searcher based on Two-Way algorithm.
329+
SliceSearcher::new_searcher(self)
330+
}
331+
#[inline]
332+
fn into_consumer(self) -> Self::Searcher {
333+
// create a searcher based on naive search (which requires no pre-computation)
334+
SliceSearcher::new_consumer(self)
335+
}
303336
}
304337
```
305338

@@ -320,7 +353,7 @@ where
320353
pub fn ends_with<H, P>(haystack: H, pattern: P) -> bool
321354
where
322355
H: Haystack,
323-
P: Pattern<H, Consumer: ReverseConsumer<H::Target>>;
356+
P: Pattern<H, Searcher: ReverseSearcher<H::Target>>;
324357
```
325358

326359
**Trim**
@@ -334,12 +367,12 @@ where
334367
pub fn trim_end<H, P>(haystack: H, pattern: P) -> H
335368
where
336369
H: Haystack,
337-
P: Pattern<H, Consumer: ReverseConsumer<H::Target>>;
370+
P: Pattern<H, Searcher: ReverseSearcher<H::Target>>;
338371

339372
pub fn trim<H, P>(haystack: H, pattern: P) -> H
340373
where
341374
H: Haystack,
342-
P: Pattern<H, Consumer: DoubleEndedConsumer<H::Target>>;
375+
P: Pattern<H, Searcher: DoubleEndedSearcher<H::Target>>;
343376
```
344377

345378
**Matches**
@@ -625,7 +658,7 @@ The main performance improvement comes from `trim()`. In v1.0, `trim()` depends
625658
the `Searcher::next_reject()` method, which requires initializing a searcher and compute
626659
the critical constants for the Two-Way search algorithm. Search algorithms mostly concern about
627660
quickly skip through mismatches, but the purpose of `.next_reject()` is to find mismatches, so a
628-
searcher would be a job mismatch for `trim()`. This justifies the `Consumer` trait in v3.0.
661+
searcher would be a job mismatch for `trim()`. This justifies the `.into_consumer()` method in v3.0.
629662

630663
<details><summary>Summary of benchmark</summary>
631664

@@ -677,8 +710,72 @@ searcher would be a job mismatch for `trim()`. This justifies the `Consumer` tra
677710

678711
[suffix table]: https://docs.rs/suffix/1.0.0/suffix/struct.SuffixTable.html#method.positions
679712

680-
2. Patterns are still moved when converting to a Searcher or Consumer.
681-
Taking the entire ownership might prevent some use cases... ?
713+
2. Patterns are still moved when converting to a Searcher.
714+
Taking the entire ownership of the pattern might prevent some use cases... ?
715+
716+
* Stabilization of this RFC is blocked by [RFC 1672] \(disjointness based on associated types)
717+
which is postponed.
718+
719+
The default Pattern implementation currently uses an impl that covers all haystacks
720+
(`impl<H: Haystack<Target = A>> Pattern<H> for Pat`) for some types, and several impls for
721+
individual types for others (`impl<'h> Pattern<&'h A> for Pat`). Ideally *every* such impl
722+
should use the blanket impl.
723+
Unfortunately, due to lack of RFC 1672, there would be conflict between these impls:
724+
725+
```rust
726+
// 1.
727+
impl<'p, H> Pattern<H> for &'p [char]
728+
where
729+
H: Haystack<Target = str>,
730+
{ ... }
731+
impl<'p, H> Pattern<H> for &'p [T] // `T` can be `char`
732+
where
733+
H: Haystack<Target = [T]>,
734+
T: PartialEq + 'p,
735+
{ ... }
736+
737+
// 2.
738+
impl<H, F> Pattern<H> for F
739+
where
740+
H: Haystack<Target = str>,
741+
F: FnMut(char) -> bool,
742+
{ ... }
743+
impl<T, H, F> Pattern<H> for F
744+
where
745+
H: Haystack<Target = [T]>,
746+
F: FnMut(&T) -> bool, // `F` can impl both `FnMut(char)->bool` and `FnMut(&T)->bool`.
747+
T: PartialEq,
748+
{ ... }
749+
750+
// 3.
751+
impl<'p, H> Pattern<H> for &'p str
752+
where
753+
H: Haystack<Target = str>,
754+
{ ... }
755+
impl<'p, H> Pattern<H> for &'p str
756+
where
757+
H: Haystack<Target = OsStr>,
758+
{ ... }
759+
```
760+
761+
We currently provide concrete impls like `impl<'h, 'p> Pattern<&'h OsStr> for &'p str`
762+
as workaround, but if we stabilize the `Pattern` trait before RFC 1672 is implemented,
763+
a third-party crate can sneak in an impl:
764+
765+
```rust
766+
struct MyOsString { ... };
767+
impl Deref for MyOsString {
768+
type Target = OsStr;
769+
...
770+
}
771+
impl Haystack for MyOsString { ... }
772+
773+
impl<'p> Pattern<MyOsString> for &'p str { ... }
774+
```
775+
776+
and causes the standard library not able to further generalize (this is a breaking change).
777+
778+
RFC 1672 is currently blocked by `chalk` integration before it could be reopened.
682779

683780
# Rationale and alternatives
684781
[alternatives]: #alternatives
@@ -983,6 +1080,26 @@ trait Consumer<A: Hay + ?Sized> {
9831080
Both `starts_with()` and `trim()` can be efficiently implemented in terms of `.consume()`,
9841081
though for some patterns a specialized `trim()` can be even faster, so we keep this default method.
9851082

1083+
During the RFC, after we have actually tried the API on third party code, we found that having
1084+
`Searcher` and `Consumer` as two distinct traits seldom have any advantages as most of the time they
1085+
are the same type anyway. Therefore, we *merge* the consumer methods into the `Searcher` trait,
1086+
while still keeping `Pattern::into_consumer()` so we could still choose the less expensive algorithm
1087+
at runtime.
1088+
1089+
```rust
1090+
// v3.0-alpha.8
1091+
trait Pattern<H: Haystack> {
1092+
type Searcher: Searcher<H::Target>;
1093+
fn into_searcher(self) -> Self::Searcher;
1094+
fn into_consumer(self) -> Self::Searcher { self.into_searcher() }
1095+
}
1096+
trait Searcher<A: Hay + ?Sized> {
1097+
fn search(&mut self, hay: Span<&A>) -> Option<Range<A::Index>>;
1098+
fn consume(&mut self, hay: Span<&A>) -> Option<A::Index>;
1099+
fn trim_start(&mut self, hay: &A) -> A::Index { /* default impl */ }
1100+
}
1101+
```
1102+
9861103
## Miscellaneous decisions
9871104

9881105
### `usize` as index instead of pointers
@@ -1059,12 +1176,13 @@ And thus the more general `Borrow` trait offers no advantage over `Deref`.
10591176

10601177
### Searcher makes Hay an input type instead of associated type
10611178

1062-
The `Searcher` and `Consumer` traits make the hay as input type.
1179+
The `Searcher` trait makes the hay as input type.
10631180
This makes any algorithm relying on a `ReverseSearcher` need to spell out the hay as well.
10641181

10651182
```rust
10661183
trait Searcher<A: Hay + ?Sized> {
10671184
fn search(&mut self, span: Span<&A>) -> Option<Range<A::Index>>;
1185+
...
10681186
}
10691187

10701188
fn rfind<H, P>(haystack: H, pattern: P) -> Option<H::Target::Index>
@@ -1080,6 +1198,7 @@ An alternative is to make Hay an associated type:
10801198
trait Searcher {
10811199
type Hay: Hay + ?Sized;
10821200
fn search(&mut self, span: Span<&Self::Hay>) -> Option<Range<Self::Hay::Index>>;
1201+
...
10831202
}
10841203

10851204
fn rfind<H, P>(haystack: H, pattern: P) -> Option<H::Target::Index>
@@ -1125,6 +1244,47 @@ With specialization, this dilemma can be easily fixed: we will fallback to an al
11251244
which only requires `T: PartialEq` (e.g. [`galil-seiferas`] or even naive search),
11261245
and use the faster Two-Way algorithm when `T: Ord`.
11271246

1247+
### Not having default implementations for `Searcher::{search, consume}`
1248+
1249+
In the `Searcher` trait, `.search()` and `.consume()` can be implemented in terms of each other:
1250+
1251+
```rust
1252+
trait Searcher<A: Hay + ?Sized> {
1253+
fn search(&mut self, span: Span<&A>) -> Option<Range<A::Index>> {
1254+
// we can implement `search` in terms of `consume`
1255+
let (hay, range) = span.into_parts();
1256+
loop {
1257+
unsafe {
1258+
if let Some(end) = self.consume(Span::from_span(hay, range.clone())) {
1259+
return Some(range.start..end);
1260+
}
1261+
if range.start == range.end {
1262+
return None;
1263+
}
1264+
range.start = hay.next_index(range.start);
1265+
}
1266+
}
1267+
}
1268+
1269+
fn consume(&mut self, span: Span<&A>) -> Option<A::Index> {
1270+
// we can implement `consume` in terms of `search`
1271+
let start = span.original_range().start;
1272+
let range = self.search(span)?;
1273+
if range.start == start {
1274+
Some(range.end)
1275+
} else {
1276+
None
1277+
}
1278+
}
1279+
1280+
...
1281+
}
1282+
```
1283+
1284+
These fallbacks should only be used when the pattern does not allow more efficient implementations,
1285+
which is often not the case. To encourage pattern implementations to support both primitives,
1286+
where they should have full control of the details, we keep them as required methods.
1287+
11281288
### Names of everything
11291289

11301290
* **Haystack**. Inherited from the v1.0 method `Searcher::haystack()`. v2.0 called it
@@ -1141,19 +1301,12 @@ and use the faster Two-Way algorithm when `T: Ord`.
11411301
`.next_match()` since it needs to take a span as input and thus no longer iterator-like.
11421302
It is renamed to `.search()` as a shorter verb and also consistent with the trait name.
11431303

1144-
* **Consumer::consume()**. The name is almost randomly chosen as there's no good name for
1304+
* **Searcher::consume()**. The name is almost randomly chosen as there's no good name for
11451305
this operation. This name is taken from the same function in the [`re2` library][re2-consume].
11461306

1147-
* `Consumer` is totally different from `Searcher`. Calling it `PrefixSearcher` or
1148-
`AnchoredSearcher` would imply a non-existing sub-classing relationship.
1149-
11501307
* We would also like a name which is only a single word.
11511308

1152-
* We want the name *not* start with the letter **S**
1153-
so we could easily distinguish between this and `Searcher` when quick-scanning the code,
1154-
in particular when `ReverseXxxer` is involved.
1155-
1156-
* "Matcher" (using name from Python) is incompatible with the existing `.matches()` method.
1309+
* "match" (using name from Python) is incompatible with the existing `.matches()` method.
11571310
Besides, the meaning of "match" is very ambiguous among other libraries.
11581311

11591312
<details><summary>Names from other languages and libraries</summary>
@@ -1215,6 +1368,10 @@ and use the faster Two-Way algorithm when `T: Ord`.
12151368

12161369
* **Span**. The name is taken from the rustc compiler.
12171370

1371+
## Alternatives
1372+
1373+
* The names of everything except `Searcher`, `Pattern` and `Haystack` are not finalized.
1374+
12181375
# Prior art
12191376

12201377
## Previous attempts
@@ -1474,6 +1631,7 @@ Unlike this RFC, the `Extract` class is much simpler.
14741631

14751632
[RFC 528]: https://github.com/rust-lang/rfcs/pull/528
14761633
[RFC 1309]: https://github.com/rust-lang/rfcs/pull/1309
1634+
[RFC 1672]: https://github.com/rust-lang/rfcs/pull/1672
14771635
[RFC 2089]: https://github.com/rust-lang/rfcs/pull/2089
14781636
[RFC 2289]: https://github.com/rust-lang/rfcs/pull/2289
14791637
[RFC 2295]: https://github.com/rust-lang/rfcs/pull/2295

0 commit comments

Comments
 (0)