Skip to content

Commit c618ed5

Browse files
MaxSagebaumhsutter
andauthored
Documentation for regular expressions. (#1195)
* Documentation for regular expressions. * Update for regex documentation and improved matching detection of regex names. * Fixes for line endings in msvc-2022. * Pass through regex docs Mainly turn code lists into tables * Small fixes after update from Herb. * Fix for regression tests. * Quote tweak Signed-off-by: Herb Sutter <herb.sutter@gmail.com> --------- Signed-off-by: Herb Sutter <herb.sutter@gmail.com> Co-authored-by: Herb Sutter <herb.sutter@gmail.com>
1 parent 630e61a commit c618ed5

File tree

6 files changed

+330
-34
lines changed

6 files changed

+330
-34
lines changed

docs/cpp2/metafunctions.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -360,6 +360,88 @@ main: () = {
360360
```
361361

362362

363+
### For computational and functional types
364+
365+
366+
#### `regex`
367+
368+
A `regex` type has data members that are regular expression objects. This metafunction replaces all of the type's data members named `regex` or `regex_*` with regular expression objects of the same type. For example:
369+
370+
``` cpp title="Regular expression example" hl_lines="1 3 4 16 17 19 27 30 31"
371+
name_matcher: @regex type
372+
= {
373+
regex := R"((\w+) (\w+))"; // for example: Margaret Hamilton
374+
regex_no_case := R"(/(ab)+/i)"; // case insensitive match of "ab"+
375+
}
376+
377+
main: (args) = {
378+
m: name_matcher = ();
379+
380+
data: std::string = "Donald Duck";
381+
if args.ssize() >= 2 {
382+
data = args[1];
383+
}
384+
385+
// regex.match requires matches to match the entire string, from start to end
386+
result := m.regex.match(data);
387+
if result.matched {
388+
// We found a match; reverse the order of the substrings
389+
std::cout << "Hello (result.group(2))$, (result.group(1))$!\n";
390+
}
391+
else {
392+
std::cout << "I only know names of the form: <name> <family name>.\n";
393+
}
394+
395+
// regex.search finds a match anywhere within the target string
396+
std::cout << "Case insensitive match: "
397+
"(m.regex_no_case.search(\"blubabABblah\").group(0))$\n";
398+
}
399+
// Prints:
400+
// Hello Duck, Donald!
401+
// Case insensitive match: abAB
402+
```
403+
404+
The `@regex` metafunction currently supports most of [Perl regex syntax](https://perldoc.perl.org/perlre), except for Unicode characters and the syntax tokens associated with them. See [Supported regular expression features](../notes/regex_status.md) for a list of regex options.
405+
406+
Each regex object has the type `cpp2::regex::regular_expression`, which is defined in `include/cpp2regex.h2`. The member functions are:
407+
408+
``` cpp title="Member functions for regular expressions"
409+
// .match() requires matches to match the entire string, from start to end
410+
// .search() finds a match anywhere within the target string
411+
412+
match : (this, str: std::string_view) -> search_return;
413+
search: (this, str: std::string_view) -> search_return;
414+
415+
match : (this, str: std::string_view, start) -> search_return;
416+
search: (this, str: std::string_view, start) -> search_return;
417+
418+
match : (this, str: std::string_view, start, length) -> search_return;
419+
search: (this, str: std::string_view, start, length) -> search_return;
420+
421+
match : <Iter> (this, start: Iter, end: Iter) -> search_return;
422+
search: <Iter> (this, start: Iter, end: Iter) -> search_return;
423+
```
424+
425+
The return type `search_return` is defined in `cpp2::regex::regular_expression`. It has these members:
426+
427+
``` cpp title="Members of a regular expression result"
428+
matched: bool;
429+
pos: int;
430+
431+
// Functions to access groups by number
432+
group_number: (this) -> size_t;;
433+
group: (this, g: int) -> std::string;
434+
group_start: (this, g: int) -> int;
435+
group_end: (this, g: int) -> int;
436+
437+
// Functions to access groups by name
438+
group: (this, g: bstring<CharT>) -> std::string;
439+
group_start: (this, g: bstring<CharT>) -> int;
440+
group_end: (this, g: bstring<CharT>) -> int;
441+
```
442+
443+
444+
363445
### Helpers and utilities
364446

365447

docs/notes/regex_status.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# Supported regular expression features
2+
3+
The listings are taken from the [Perl regex docs](https://perldoc.perl.org/perlre). Regular expressions are applied via the [`regex` metafunction](../cpp2/metafunctions.md#regex).
4+
5+
6+
## Currently supported or planned features
7+
8+
9+
### Modifiers
10+
11+
| Modifier | Notes | Status |
12+
| --- | --- | --- |
13+
| **`i`** | Do case-insensitive pattern matching. For example, "A" will match "a" under `/i`. | <span style="color:green">Supported</span> |
14+
| **`m`** | Treat the string being matched against as multiple lines. That is, change `^` and `$` from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string. | <span style="color:green">Supported</span> |
15+
| **`s`** | Treat the string as single line. That is, change `.` to match any character whatsoever, even a newline, which normally it would not match. | <span style="color:green">Supported</span> |
16+
| ***`x` and `xx`** | Extend your pattern's legibility by permitting whitespace and comments. For details see: [Perl regex docs: `/x` and `/xx`](https://perldoc.perl.org/perlre#/x-and-/xx). | <span style="color:green">Supported</span> |
17+
| **`n`** | Prevent the grouping metacharacters `(` and `)` from capturing. This modifier will stop `$1`, `$2`, etc. from being filled in. | <span style="color:green">Supported</span> |
18+
| **`c`** | Keep the current position during repeated matching. | <span style="color:gray">Planned</span> |
19+
20+
21+
### Escape sequences __(Complete)__
22+
23+
| Escape sequence | Notes | Status |
24+
| --- | --- | --- |
25+
| **`\t`** | Tab (HT, TAB)X | <span style="color:green">Supported</span> |
26+
| **`\n`** | Newline (LF, NL) | <span style="color:green">Supported</span> |
27+
| **`\r`** | Return (CR) | <span style="color:green">Supported</span> |
28+
| **`\f`** | Form feed (FF) | <span style="color:green">Supported</span> |
29+
| **`\a`** | Alarm (bell) (BEL) | <span style="color:green">Supported</span> |
30+
| **`\e`** | Escape (think troff) (ESC) | <span style="color:green">Supported</span> |
31+
| **`\x{}`, `\x00`** | Character whose ordinal is the given hexadecimal number | <span style="color:green">Supported</span> |
32+
| **`\o{}`, `\000`** | Character whose ordinal is the given octal number | <span style="color:green">Supported</span> |
33+
34+
35+
### Quantifiers __(Complete)__
36+
37+
| Quantifier | Notes | Status |
38+
| --- | --- | --- |
39+
| **`*`** | Match 0 or more times | <span style="color:green">Supported</span> |
40+
| **`+`** | Match 1 or more times | <span style="color:green">Supported</span> |
41+
| **`?`** | Match 1 or 0 times | <span style="color:green">Supported</span> |
42+
| **`{n}`** | Match exactly n times | <span style="color:green">Supported</span> |
43+
| **`{n,}`** | Match at least n times | <span style="color:green">Supported</span> |
44+
| **`{,n}`** | Match at most n times | <span style="color:green">Supported</span> |
45+
| **`{n,m}`** | Match at least n but not more than m times | <span style="color:green">Supported</span> |
46+
| | | |
47+
| **`*?`** | Match 0 or more times, not greedily | <span style="color:green">Supported</span> |
48+
| **`+?`** | Match 1 or more times, not greedily | <span style="color:green">Supported</span> |
49+
| **`??`** | Match 0 or 1 time, not greedily | <span style="color:green">Supported</span> |
50+
| **`{n}?`** | Match exactly n times, not greedily (redundant) | <span style="color:green">Supported</span> |
51+
| **`{n,}?`** | Match at least n times, not greedily | <span style="color:green">Supported</span> |
52+
| **`{,n}?`** | Match at most n times, not greedily | <span style="color:green">Supported</span> |
53+
| **`{n,m}?`** | Match at least n but not more than m times, not greedily | <span style="color:green">Supported</span> |
54+
| | | |
55+
| **`*+`** | Match 0 or more times and give nothing back | <span style="color:green">Supported</span> |
56+
| **`++`** | Match 1 or more times and give nothing back | <span style="color:green">Supported</span> |
57+
| **`?+`** | Match 0 or 1 time and give nothing back | <span style="color:green">Supported</span> |
58+
| **`{n}+`** | Match exactly n times and give nothing back (redundant) | <span style="color:green">Supported</span> |
59+
| **`{n,}+`** | Match at least n times and give nothing back | <span style="color:green">Supported</span> |
60+
| **`{,n}+`** | Match at most n times and give nothing back | <span style="color:green">Supported</span> |
61+
| **`{n,m}+`** | Match at least n but not more than m times and give nothing back | <span style="color:green">Supported</span> |
62+
63+
64+
### Character Classes and other Special Escapes __(Complete)__
65+
66+
| Feature | Notes | Status |
67+
| --- | --- | --- |
68+
| **`[`...`]`** | Match a character according to the rules of the bracketed character class defined by the "...". Example: `[a-z]` matches "a" or "b" or "c" ... or "z" | <span style="color:green">Supported</span> |
69+
| **`[[:`...`:]]`** | Match a character according to the rules of the POSIX character class "..." within the outer bracketed character class. Example: `[[:upper:]]` matches any uppercase character. | <span style="color:green">Supported</span> |
70+
| **`\g1`** or **`\g{-1}`** | Backreference to a specific or previous group. The number may be negative indicating a relative previous group and may optionally be wrapped in curly brackets for safer parsing. | <span style="color:green">Supported</span> |
71+
| **`\g{name}`** | Named backreference | <span style="color:green">Supported</span> |
72+
| **`\k<name>`** | Named backreference | <span style="color:green">Supported</span> |
73+
| **`\k'name'`** | Named backreference | <span style="color:green">Supported</span> |
74+
| **`\k{name}`** | Named backreference | <span style="color:green">Supported</span> |
75+
| **`\w`** | Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks) | <span style="color:green">Supported</span> |
76+
| **`\W`** | Match a non-"word" character | <span style="color:green">Supported</span> |
77+
| **`\s`** | Match a whitespace character | <span style="color:green">Supported</span> |
78+
| **`\S`** | Match a non-whitespace character | <span style="color:green">Supported</span> |
79+
| **`\d`** | Match a decimal digit character | <span style="color:green">Supported</span> |
80+
| **`\D`** | Match a non-digit character | <span style="color:green">Supported</span> |
81+
| **`\v`** | Vertical whitespace | <span style="color:green">Supported</span> |
82+
| **`\V`** | Not vertical whitespace | <span style="color:green">Supported</span> |
83+
| **`\h`** | Horizontal whitespace | <span style="color:green">Supported</span> |
84+
| **`\H`** | Not horizontal whitespace | <span style="color:green">Supported</span> |
85+
| **`\1`** | Backreference to a specific capture group or buffer. '1' may actually be any positive integer. | <span style="color:green">Supported</span> |
86+
| **`\N`** | Any character but \n. Not affected by /s modifier | <span style="color:green">Supported</span> |
87+
| **`\K`** | Keep the stuff left of the \K, don't include it in $& | <span style="color:green">Supported</span> |
88+
89+
90+
### Assertions
91+
92+
| Assertion | Notes | Status |
93+
| --- | --- | --- |
94+
| **`\b`** | Match a \w\W or \W\w boundary | <span style="color:green">Supported</span> |
95+
| **`\B`** | Match except at a \w\W or \W\w boundary | <span style="color:green">Supported</span> |
96+
| **`\A`** | Match only at beginning of string | <span style="color:green">Supported</span> |
97+
| **`\Z`** | Match only at end of string, or before newline at the end | <span style="color:green">Supported</span> |
98+
| **`\z`** | Match only at end of string | <span style="color:green">Supported</span> |
99+
| **`\G`** | Match only at pos() (e.g. at the end-of-match position of prior m//g) | <span style="color:gray">Planned</span> |
100+
101+
102+
### Capture groups __(Complete)__
103+
104+
| Feature | Status |
105+
| --- | --- |
106+
| **`(`...`)`** | <span style="color:green">Supported</span> |
107+
108+
109+
### Quoting metacharacters __(Complete)__
110+
111+
| Feature | Status |
112+
| --- | --- |
113+
| **For `^.[]$()*{}?+|\`** | <span style="color:green">Supported</span> |
114+
115+
116+
### Extended Patterns
117+
118+
| Extended pattern | Notes | Status |
119+
| --- | --- | --- |
120+
| **`(?<NAME>pattern)`** | Named capture group | <span style="color:green">Supported</span> |
121+
| **`(?#text)`** | Comments | <span style="color:green">Supported</span> |
122+
| **`(?adlupimnsx-imnsx)`** | Modification for surrounding context | <span style="color:green">Supported</span> |
123+
| **`(?^alupimnsx)`** | Modification for surrounding context | <span style="color:green">Supported</span> |
124+
| **`(?:pattern)`** | Clustering, does not generate a group index. | <span style="color:green">Supported</span> |
125+
| **`(?adluimnsx-imnsx:pattern)`** | Clustering, does not generate a group index and modifications for the cluster. | <span style="color:green">Supported</span> |
126+
| **`(?^aluimnsx:pattern)`** | Clustering, does not generate a group index and modifications for the cluster. | <span style="color:green">Supported</span> |
127+
| **`(?`<code>&#124;</code>`pattern)`** | Branch reset | <span style="color:green">Supported</span> |
128+
| **`(?'NAME'pattern)`** | Named capture group | <span style="color:green">Supported</span> |
129+
| **`(?(condition)yes-pattern`<code>&#124;</code>`no-pattern)`** | Conditional patterns. | <span style="color:gray">Planned</span> |
130+
| **`(?(condition)yes-pattern)`** | Conditional patterns. | <span style="color:gray">Planned</span> |
131+
| **`(?>pattern)`** | Atomic patterns. (Disable backtrack.) | <span style="color:gray">Planned</span> |
132+
| **`(*atomic:pattern)`** | Atomic patterns. (Disable backtrack.) | <span style="color:gray">Planned</span> |
133+
134+
135+
### Lookaround Assertions
136+
137+
| Lookaround assertion | Notes | Status |
138+
| --- | --- | --- |
139+
| **`(?=pattern)`** | Positive look ahead. | <span style="color:green">Supported</span> |
140+
| **`(*pla:pattern)`** | Positive look ahead. | <span style="color:green">Supported</span> |
141+
| **`(*positive_lookahead:pattern)`** | Positive look ahead. | <span style="color:green">Supported</span> |
142+
| **`(?!pattern)`** | Negative look ahead. | <span style="color:green">Supported</span> |
143+
| **`(*nla:pattern)`** | Negative look ahead. | <span style="color:green">Supported</span> |
144+
| **`(*negative_lookahead:pattern)`** | Negative look ahead. | <span style="color:green">Supported</span> |
145+
| **`(?<=pattern)`** | Positive look behind. | <span style="color:gray">Planned</span> |
146+
| **`(*plb:pattern)`** | Positive look behind. | <span style="color:gray">Planned</span> |
147+
| **`(*positive_lookbehind:pattern)`** | Positive look behind. | <span style="color:gray">Planned</span> |
148+
| **`(?<!pattern)`** | Negative look behind. | <span style="color:gray">Planned</span> |
149+
| **`(*nlb:pattern)`** | Negative look behind. | <span style="color:gray">Planned</span> |
150+
| **`(*negative_lookbehind:pattern)`** | Negative look behind. | <span style="color:gray">Planned</span> |
151+
152+
153+
### Special Backtracking Control Verbs
154+
155+
| Backtracking control verb | Notes | Status |
156+
| --- | --- | --- |
157+
| **`(*SKIP) (*SKIP:NAME)`** | Start next search here. | <span style="color:gray">Planned</span> |
158+
| **`(*PRUNE) (*PRUNE:NAME)`** | No backtracking over this point. | <span style="color:gray">Planned</span> |
159+
| **`(*MARK:NAME) (*:NAME)`** | Place a named mark. | <span style="color:gray">Planned</span> |
160+
| **`(*THEN) (*THEN:NAME)`** | Like PRUNE. | <span style="color:gray">Planned</span> |
161+
| **`(*COMMIT) (*COMMIT:arg)`** | Stop searching. | <span style="color:gray">Planned</span> |
162+
| **`(*FAIL) (*F) (*FAIL:arg)`** | Fail the pattern/branch. | <span style="color:gray">Planned</span> |
163+
| **`(*ACCEPT) (*ACCEPT:arg)`** | Accept the pattern/subpattern. | <span style="color:gray">Planned</span> |
164+
165+
166+
## Not planned (Mainly because of Unicode or perl specifics)
167+
168+
### Modifiers
169+
170+
| Modifier | Notes | Status |
171+
| --- | --- | --- |
172+
| `p` | Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching. | <span style="color:darkred">Not planned</span> |
173+
| `a`, `d`, `l`, and `u` | These modifiers affect which character-set rules (Unicode, etc.) are used, as described below in "Character set modifiers". | <span style="color:darkred">Not planned</span> |
174+
| `g` | globally match the pattern repeatedly in the string | <span style="color:darkred">Not planned</span> |
175+
| `e` | evaluate the right-hand side as an expression | <span style="color:darkred">Not planned</span> |
176+
| `ee` | evaluate the right side as a string then eval the result | <span style="color:darkred">Not planned</span> |
177+
| `o` | pretend to optimize your code, but actually introduce bugs | <span style="color:darkred">Not planned</span> |
178+
| `r` | perform non-destructive substitution and return the new value | <span style="color:darkred">Not planned</span> |
179+
180+
181+
### Escape sequences
182+
183+
| Escape sequence | Notes | Status |
184+
| --- | --- | --- |
185+
| `\cK` | control char (example: VT) | <span style="color:darkred">Not planned</span> |
186+
| `\N{name}` | named Unicode character or character sequence | <span style="color:darkred">Not planned</span> |
187+
| `\N{U+263D}` | Unicode character (example: FIRST QUARTER MOON) | <span style="color:darkred">Not planned</span> |
188+
| `\l` | lowercase next char (think vi) | <span style="color:darkred">Not planned</span> |
189+
| `\u` | uppercase next char (think vi) | <span style="color:darkred">Not planned</span> |
190+
| `\L` | lowercase until \E (think vi) | <span style="color:darkred">Not planned</span> |
191+
| `\U` | uppercase until \E (think vi) | <span style="color:darkred">Not planned</span> |
192+
| `\Q` | quote (disable) pattern metacharacters until \E | <span style="color:darkred">Not planned</span> |
193+
| `\E` | end either case modification or quoted section, think vi | <span style="color:darkred">Not planned</span> |
194+
195+
196+
### Character Classes and other Special Escapes
197+
198+
| Character class or escape | Notes | Status |
199+
| --- | --- | --- |
200+
| `(?[...])` | Extended bracketed character class | <span style="color:darkred">Not planned</span> |
201+
| `\pP` | Match P, named property. Use \p{Prop} for longer names | <span style="color:darkred">Not planned</span> |
202+
| `\PP` | Match non-P | <span style="color:darkred">Not planned</span> |
203+
| `\X` | Match Unicode "eXtended grapheme cluster" | <span style="color:darkred">Not planned</span> |
204+
| `\R` | Linebreak | <span style="color:darkred">Not planned</span> |
205+
206+
207+
### Assertions
208+
209+
| Assertion | Notes | Status |
210+
| --- | --- | --- |
211+
| `\b{}` | Match at Unicode boundary of specified type | <span style="color:darkred">Not planned</span> |
212+
| `\B{}` | Match where corresponding \b{} doesn't match | <span style="color:darkred">Not planned</span> |
213+
214+
### Extended Patterns
215+
216+
217+
| Extended pattern | Notes | Status |
218+
| --- | --- | --- |
219+
| `(?{ code })` | Perl code execution. | <span style="color:darkred">Not planned</span> |
220+
| `(*{ code })` | Perl code execution. | <span style="color:darkred">Not planned</span> |
221+
| `(??{ code })` | Perl code execution. | <span style="color:darkred">Not planned</span> |
222+
| `(?PARNO)` `(?-PARNO)` `(?+PARNO)` `(?R)` `(?0)` | Recursive subpattern. | <span style="color:darkred">Not planned</span> |
223+
| `(?&NAME)` | Recursive subpattern. | <span style="color:darkred">Not planned</span> |
224+
225+
226+
### Script runs
227+
228+
| Script runs | Notes | Status |
229+
| --- | --- | --- |
230+
| `(*script_run:pattern)` | All chars in pattern need to be of the same script. | <span style="color:darkred">Not planned</span> |
231+
| `(*sr:pattern)` | All chars in pattern need to be of the same script. | <span style="color:darkred">Not planned</span> |
232+
| `(*atomic_script_run:pattern)` | Without backtracking. | <span style="color:darkred">Not planned</span> |
233+
| `(*asr:pattern)` | Without backtracking. | <span style="color:darkred">Not planned</span> |

0 commit comments

Comments
 (0)