-
Notifications
You must be signed in to change notification settings - Fork 18
re2 regular expressions
Regular expressions are powerful tools for parsing loosely structured text data. Russ Cox’s ”Regular Expression Matching Can Be Simple And Fast” explains that due to backreference support, several common regular expression engines have a time complexity which is exponential in pattern size (including PCRE which is used by R). One way to achieve polynomial time complexity is to drop backreference support and use the re2 C++ library.
- Base R functions such as
regexpr
use PCRE when given the perl=TRUE argument. PCRE includes many useful features such as named capture, but has an exponential time complexity. - Base R functions such as
regexpr
use TRE when given the perl=FALSE argument. TRE has a polynomial time complexity but does not include named capture groups. -
stringr::str_match
andstringi::stri_match
use the ICU library, which has an exponential time complexity. The stringi package does not support named capture.
The GSOC student should write an R package interface to the re2 C++ code, providing the R community with the first regular expression package with both named capture and polynomial time complexity.
R function | library | named capture | complexity |
---|---|---|---|
regexpr(perl=FALSE) | TRE | no | polynomial |
stringi::stri_match() | ICU | no | exponential |
regexpr(perl=TRUE) | PCRE | yes | exponential |
str_match_re2() | re2 | yes | polynomial |
- Use the namedCapture package as a model for the re2 package API:
namedCapture function | re2 function | finds | returns |
---|---|---|---|
str_match_named | str_match_re2 | first match | character matrix |
str_match_all_named | str_match_all_re2 | several matches | list of character matrices |
- Copy and modify the tests in the namedCapture package.
- Write an R interface that compiles the pattern and then uses the RE2::PartialMatch C++ function.
- Write a vignette with benchmarks.
Please get in touch with Toby Dylan Hocking <tdhock5@gmail.com> as soon as possible.
Do one or several — doing more hard tests makes you more likely to be selected.
- Easy: add the
stringr::str_match
function to the benchmark, and re-compute the timings. - Medium: make a package with a vignette that has the benchmark timings figures.
- Hard: demonstrate that you know how to interface C++ code with R by writing a package with a function that calls some custom C++ code.
None yet.