Skip to content

re2 regular expressions

Marek Gagolewski edited this page Oct 9, 2015 · 16 revisions

Background

Regular expressions are powerful tools for parsing loosely structured text data. Russ Cox’s ”Regular Expression Matching Can Be Simple And Fast” explains that due to backreference support, several common regular expression engines have a time complexity which is exponential in pattern size (including PCRE which is used by R). One way to achieve polynomial time complexity is to drop backreference support and use the re2 C++ library.

Related work

  • Base R functions such as regexpr use PCRE when given the `perl=TRUE` argument. PCRE includes many

useful features, such as named capture, but has an exponential time complexity.

  • Base R functions such as regexpr use TRE when given the `perl=FALSE` argument.

TRE has a polynomial time complexity but does not include named capture groups.

  • stringr::str_match and stringi::stri_match use the regex engine from the ICU library,

which has an exponential time complexity. The stringi package does not support named capture yet as such a feature set is still considered as experimental in ICU.

The figures below show timings of these libraries for matching the pattern a?a?a?aaa against the subject aaa (N=3), for several different values of N. Source.

Coding project: library(re2)

The GSOC student should write an R package interface to the re2 C++ code, providing the R community with the first regular expression package with both named capture and polynomial time complexity.

R function library named capture complexity
`regexpr(perl=FALSE)` TRE no polynomial
`stringi::stri_match()` ICU no exponential
`regexpr(perl=TRUE)` PCRE yes exponential
`str_match_re2()` re2 yes polynomial
  • Use the namedCapture package as a model for the re2 package API:
namedCapture function re2 function finds returns
str_match_named str_match_re2 first match character matrix
str_match_all_named str_match_all_re2 several matches list of character matrices
  • Copy and modify the tests in the namedCapture package.
  • Write an R interface that compiles the pattern and then uses the RE2::PartialMatch C++ function.
  • Write a vignette with extensive benchmarks, covering also real-world text data samples and popular regexes like email, www search, etc.

Mentors

Please get in touch with Toby Dylan Hocking <tdhock5@gmail.com> or Marek Gagolewski <gagolewski@rexamine.com> as soon as possible.

Tests

Do one or several — doing more hard tests makes you more likely to be selected.

  • Easy: add the stringr::str_match function to the benchmark, and re-compute the timings.
  • Medium: make a package with a vignette that has the benchmark timings figures.
  • Hard: demonstrate that you know how to interface C++ code with R by writing a package with a function that calls some custom C++ code (e.g., try to call `stri_enc_toutf8` from the latest [https://github.com/Rexamine/stringi/](stringi-devel) C API from within C++ code).

Solutions of tests

None yet.

Clone this wiki locally