Skip to content

ReppTop

StephanOepen edited this page Feb 10, 2009 · 48 revisions

Overview

This page discusses the Regular Expression Pre-Processor (REPP), a relatively simple finite-state device used to prepare textual input for 'deep' parsing (using DELPH-IN grammars).

This page was predominantly authored by StephanOepen, who is the original REPP designer and current maintainer. Please do not make substantial changes unless you (a) are quite certain of the technical correctness of your revisions and (b) believe strongly that your changes are compatible with the general design and recommended use patterns for the REPP machinery, and the intentions of this page.

An Example

Following are a few sample REPP rules, taken from the [http://svn.delph-in.net/erg/tags/0902/tokenizer.rpp REPP tokenizer] of the ERG. Note that, for display purposes, we show space characters in REPP rules as the '▁' symbol (a lower one eighth bar), and tabulator characters as '→' (a rightwards arrow).

  ;;
  ;; preprocessor rules versioning; auto-maintained upon SVN check-in.
  ;; 
  @$Date: 2009-01-29 09:10:22 +0100 (tor, 29 jan 2009) $

  ;;
  ;; tokenization pattern: after normalization, the string will be broken up at
  ;; each occurrence of this pattern; the pattern match itself is deleted.
  ;;
  :[▁\t]+

  ;;
  ;; pad the full string with trailing and leading whitespace; makes matches for
  ;; word boundaries a little easier down the road; also, squash multiple spaces
  ;; and replace tabulators with a space.
  ;;
  !^(.+)$→      →       →       →       →       →       →       ▁\1▁
  !▁+→   →      →       →       →       →       →       →       ▁
  !\t→   →      →       →       →       →       →       →       ▁

The example makes use of three REPP operators (these are the at sign '@', the colon ':', and the exclamation point '!', i.e. the character in column zero of actual rules) and the convention used for comments (lines starting with a semicolon ';' in column zero). Furthermore, empty lines are ignored by the REPP reader. Besides these three operators (not counting the comment marker), the following are valid REPP operators and are discussed in more detail below: the hash mark '#' (for group formation), the left angle bracket '<' (for file inclusion), and the right angle bracket '>' (for group calls).

REPP Syntax

In the following, we will refer to a collection of pre-processing rules as a REPP instance, or sometimes simply as a REPP. In the example above, there are three rules, one for each line starting with the exclamation point operator '!' in column zero. The colon operator ':' does not specify a string re-writing rule, but rather the pattern used to break up the string (after re-writing) into a sequence of tokens; there cannot be multiple usages of ':' within one REPP. Finally, the at sign operator '@' records REPP meta-information, exclusively used for versioning. Like ':', there can be at most one '@' declaration, but whether it is provided or not has no bearing on actual processing.

REPP files use a simple, line-oriented format with relatively rigid syntax. Non-empty lines must start with a valid REPP operator in column zero. The operator is immediately followed by one or more operands, i.e. the first operand starts from column 1. The exclamation point operator '!' is the central element of string-level rewriting. Each '!' rule provides two operands, a search pattern and a replacement pattern. Both patterns are regular expressions (REs; see below for the exact RE syntax used in REPP), and the '!' operator is the REPP equivalent of the 's' command in text processing tools like sed(1) and perl(1). The first operand seeks to match a sub-string of input; when matching succeeds, the corresponding sub-string is deleted from the input, and the replacement operand of the rule is inserted in its place. In other words, REPP '!' rules are string rewrite rules, and we can think of the search operand as the left-hand (or input) side of each rule, and of the replacement operand as the right-hand (or output) side. Operands in REPP rules are separated by one or more tabulator characters, i.e. the REPP reader applies the regular expression /\t+/ (where, conventionally, we use slashes to delimit regular expressions in running text) to the string following the operator (everything starting from column 1) to extract multiple operands. For the first '!' rule in our example above, thus, the left-hand side RE is /^(.+)$/, and the right-hand side /▁\1▁/.

There is a strict sequential order to the application of REPP rules, determined by the order of appearance in the REPP definition file. Unless requested specifically (see below for iterative groups), the REPP processor makes a single pass through the rule set, i.e. each rule is invoked exactly once. However, the left-hand side of a rule can of course match multiple times against the current input, and in such cases all matches are substituted in a single rule application. There is 'feeding and bleeding' among the rules, in the sense that the output of one rule is given as input to the next rule. Therefore, at each point in time, there is only one current string as the target for rule applications. The original input string is re-written in a piecemeal fashion, and once REPP processing has completed (i.e. the last of the substitution rules has been invoked), the final string is broken up into tokens according to the pattern specified as the argument of the colon operator ':'.

File Inclusion and External Group Calls

There are two mechanisms that facilitate modularization of REPP rule sets. The left angle bracket operator '<' requests textual inclusion of a file; for example:

  <muri.rpp

When reading the line above, the contents of the file muri.rpp will be inserted into the current REPP (at the current position).

A more flexible means of modularization is provided by the right angle bracket operator '>', which we will refer to as an external group call. The current ERG pre-processor, near its top, contains the following lines:

  ;;
  ;; a set of `mark-up modules', often replacing mark-up character entitities
  ;; with actual UniCode characters (e.g. |&mdash;| or |---|), or just ditching
  ;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up).
  ;; these modules can be activated selectively by name in the REPP environment
  ;; or the top-level call into REPP.
  ;;
  >xml
  >latex
  >ascii
  >wiki

When reading a '>' rule, it is expected that a separate REPP exists, by the name given as the sole operand to the '>' operator (see below for the conventions used in assigning names to REPPs). While processing a REPP Q that contains an external group call to REPP R, the REPP processor will apply all rules from R at the point in sequential processing where the '>' call occurs in Q. To this effect, the left and right angle bracket operators '<' and '>' have very similar effects. However, the REPP processor provides a configuration mechanism to selectively enable or disable external groups. Thus, with external group calls, a single REPP definition can be used with multiple distinct configurations, pipelining or leaving out a specific sub-set of external group calls. In the ERG setup, for example, the external XML and wiki groups might be combined, but it will hardly be desirable to activate both the XML and LaTeX groups at the same time.

Iterative, Internal Groups

Up to this point, all REPP processing was strictly sequential, and typically no rule was invoked more than once (unless it was part of an external group that is literally called multiple times). In some cases, however, it can be necessary to iterate one or more rules, for example when the rule is conditioned on a specific string-level property, and the result of successfully applying the rule may give rise to additional instances of that specific property. The ERG tokenizer (as of early 2009) includes an example of this requirement. In splitting off punctuation marks (following the so-called PTB conventions, even though this token-level separation is ultimately undesirable), it is important to only split off punctuation marks that occur in the right or left periphery of a token. Colons, for example, can occur as part of web addresses, and clearly it is undesirable to break up a string like 'http://lingo.stanford.edu/' into two tokens (or even six, if slashes were to be tokenized off too). Hence, the condition for splitting off punctuation marks is adjacency to a token boundary, i.e. preceding or following whitespace. However, quote marks and parenthesis (for example) often lead to 'clusters' of punctuation marks, and for a string like '(42%),' there are three trailing punctuation marks that should ultimately become separate tokens each. Such examples require iterative rule application(s), and REPP provides the facility of internal group calls for this purpose. Consider the ERG example:

  #1
  !([^▁])([][(){}?!,;:@#$€¢£¥%&“”"‘’'])▁([^▁]|$)→       →       \1▁\2▁\3
  !([^▁])\.▁([])}”"’'▁]*)$→     →       →       →       →       \1▁.▁\2
  !(^|[^▁])▁([][(){}?!,;:@#$€¢£¥%&“”"‘’'])([^▁])→       →       \1▁\2▁\3
  #
  >1

This example introduces an additional operator: the hash mark '#'. The rules enclosed between the pairs of hash marks '#1' ... '#' define a numbered internal group. Thus, the group-opening '#' has to be followed by a group identifier, which is required to take the form of an integer. Hash marks not followed by an identifier present group-closing operators, and they apply to the most recently opened group. Thus, it is possible for internal groups to nest inside each other. The right angle bracket operator '>', when followed by a numeric group identifier, corresponds to a group call, much like the invocation of external groups. However, in processing internal groups, the call will be iterated until processing reaches a fix-point, i.e. until none of the rules in the group could be applied. In this way, the above example will split off one punctuation mark at a time for each call to the numbered group. Assuming our earlier '(42%),' example, the first time the group is invoked, both its first and third rules will apply, corresponding to suffix and prefix punctuation, respectively. Because at least one of the rules in the group fired, the group will be called again. With the trailing comma tokenized off now, the suffix punctuation rule will fire another time, this time splitting off the closing parenthesis. Another call to the same group will tokenize off the percent sign next, until a final group call will no longer match successfully any of the rules. At that point, processing of the internal group call is complete, and the REPP machinery moves on to the next rule.

REPP Support in the LKB and [incr tsdb()]

An implementation of the REPP machinery is available as part of the LKB as of early 2009; for using REPP, please make sure that your build is dated at least February 1, 2009, or newer. An LKB grammar can load any number of REPP instances by means of the read-repp() directive. The January 2009 release of the ERG, for example, includes the following in its script file:

  ;;;
  ;;; as of September 2008, REPP supports `ensembles' of rule sets, where select
  ;;; modules (XML or LaTeX markup normalization, for example) can be activated
  ;;; in the REPP environment or top-level repp() call.  by default, turn on the
  ;;; XML and ASCII modules.
  ;;;
  (read-repp (lkb-pathname (parent-directory) "xml.rpp"))
  (read-repp (lkb-pathname (parent-directory) "latex.rpp"))
  (read-repp (lkb-pathname (parent-directory) "ascii.rpp"))
  (read-repp (lkb-pathname (parent-directory) "wiki.rpp"))
  (read-repp (lkb-pathname (parent-directory) "erg.rpp"))
  (read-repp (lkb-pathname (parent-directory) "tokenizer.rpp"))
  (setf *repp-calls* '(:xml :ascii))
  (setf *repp-interactive* '(:tokenizer :xml :ascii :erg))
  (setf *repp-characterize-p* t)

Besides loading no less than six distinct REPP instances (of which most are intended for use in external group calls), the parameter *repp-calls* controls the default set of active external groups. By default, each REPP is named after the basename of its file, i.e. the first of the above corresonds to the XML mark-up module that we saw in our earlier example already. If, for whatever reason, it were desirable to give a different name to a REPP instance, it is possible to provide an :id keyword argument in the read-repp() call.

The programmatic interface to REPP is through the repp() and repp-for-pet() functions. These can be used in debugging REPP rules, or in preparing input to another parser, specifically PET (see the PetInput page for background). Both functions take optional keyword arguments :repp and :calls, which determine the top-level ('master') REPP to be used, and set of active external groups, respectively. The value of :calls defaults to the current value of *repp-calls*, and the default master REPP (value of the :repp parameter) is the last REPP instance that was loaded (hence, in the ERG example above, it would be the REPP called tokenizer). Following is an example of debugging the ERG tokenizer (the default REPP), including its XML and wiki external groups:

  (repp 
   "Wikipedia [[wikimedia markup|mark-up]] is ''relatively'' straightforward."
   :repp :tokenizer :calls '(:xml :wiki) :verbose t :format :string)

The non-nil value to the (optional) :verbose argument will cause a trace as follows:

  (42) [0:1] |Wikipedia|
  (43) [1:2] |mark-up|
  (44) [2:3] |is|
  (45) [3:4] |¦i|
  (46) [4:5] |relatively|
  (47) [5:6] |i¦|
  (48) [6:7] |straightforward|
  (49) [7:8] |.|

Furthermore, the :format argument selects one of several available output formats, in the example above a plain string (where the final sequence of tokens is concatenated, with whitespace inserted at token boundaries):

  "Wikipedia mark-up is ¦i relatively i¦ straightforward ."

Other available output formats include :pet (the default, returning the so-called YY format; see the PetInput page); :sppp (an LKB-internal format, see the LkbSppp page); and :raw (a list of token structures, providing all available information). The last of these is the most generic output option; it could be used to wrap an XML serialization around the REPP core (e.g. if one were to emulate the (S)MAF output option of the older FSPP implementation; see below and the SmafTop page).

When using the LKB parser with a grammar that provides one or more REPP instances, the parameter *repp-interactive* determines the specific REPP configuration that is applied prior to parsing. In the ERG case, the above example activates the top-level tokenizer and several of the external groups, including a final 'patch-up' rule set, essentially a stand-in for the token mapping facilities that mediate between the tokenization assumptions made in pre-processing and those used grammar-internally.

Finally, the function repp-for-pet() is a wrapper around repp(), suitable as a :preprocessor hook in itsdb cpu definitions. The LOGON tree (see the LogonTop page), for example, includes the followingg two [http://svn.emmtee.net/trunk/dot.tsdbrc cpu definitions] for the ERG (reproduced here in a simplified form):

  (make-cpu 
   :spawn cheap
   :options (list "-t" "-tsdb" "-yy" "-packing"
                  "-chart-mapping" "-default-les=all"
                  (format nil "~a/lingo/terg/english.grm" root))
   :preprocessor '("lkb::repp-for-pet"
                   :repp :tokenizer :calls (:xml :ascii :latex))
   :reader "tsdb::yy-read-input"
   :class :terg :task '(:parse) ...)
  (make-cpu 
   :spawn cheap
   :options (list "-t" "-tsdb" "-yy" "-packing"
                  "-chart-mapping" "-default-les=all"
                  (format nil "~a/lingo/terg/english.grm" root))
   :preprocessor '("lkb::repp-for-pet"
                   :repp :tokenizer :calls (:xml :ascii :latex))
   :tagger (list
            :tnt
            (format
             nil
             "~a/bin/tnt -z100 ~a/coli/tnt/models/wsj -"
             root root)
            :n 2)
   :reader "tsdb::yy-read-input"
   :class :terg+tnt :task '(:parse) ...)

In both definitions, the :preprocessor component includes a set of keyword arguments to repp-for-pett(), analoguous to our discussion of the basic repp() interface above. All itsdb pre-processor hooks take two obligatory arguments, (a) the string to be processed, and (b) the definition of a procedure (which can be an external process) for PoS tagging. The latter can be optionally provided by the :tagger component in cpu definitions, or can be (implicitly) specified as nil for no PoS tagging. Thus, the internal pre-processing effects of the first cpu definition can be simulated by a Lisp function call like the following:

  (repp-for-pet
   "Tokenization, a non-trivial exercise, foozed oe@yy.com." nil
   :repp :tokenizer :calls '(:xml :ascii :latex) :stream "sample.yy")

The additional :stream argument will re-direct the result of pre-processing into a file (overwriting an existing file by that name, if need be), in this case the file sample.yy. See PetInput for further discussion on how REPP outputs can be processed interactively in PET, for example for in-depth debugging.

REPP Regular Expression Syntax

The LKB implementation of REPP is built using the [http://weitz.de/cl-ppcre/ Portable Perl-Compatible Regular Expressions] library for Common Lisp (PPCRE). PPCRE supports most of the regex syntax of Perl 5.8 (as described in man perlre), including extended features like non-greedy repetitions, positive and negative look-ahead and look-behind assertions, 'standalone' sub-expressions, and conditional sub-patterns. However, PPCRE does not support Posix character classes like, say, /:alnum:+/, and the version of PPCRE included with the LKB as of early 2009 neither has built-in support for testing so-called UniCode properties (using a syntax like /\p{Script:Latin}+/). We expect to update the PPCRE bundle in the near future, but note that already now there is otherwise complete UniCode support.

A native REPP implementation in PET is being prepared as of early 2009, which will build on the [http://www.boost.org/doc/libs/release/libs/regex/ Boost Regular Expression] library, which in turn is Perl-compatible (including Posic character classes and testing UniCode properties).

History and Outlook

REPP has evolved from a similar but undocumented device that was available in the LKB since around 2003: the Finite-State Pre-Processor (FSPP). Since its inception, FSPP has been availabe in two versions. FSPP 1.0 was originally designed and (partially) implemented (as part of the LKB) by StephanOepen in early 2003, based on earlier experience at the YY Corporation. In 2005, BenWaldron (who was part of the YY team too) created a parallel FSPP implementation in the LKB, which we will refer to as FSPP 2.0; this version added new functionality (notably characterization and (S)MAF support; see the SafFspp page), albeit at the expense of backwards compatibility with FSPP 1.0. Ben took over FSPP maintenance in 2006 and enabled compiling FSPP into PET, adding (partial) (S)MAF support to PET at the same time However, due to a limitation in the Embedded Common Lisp (ECL) system used in this integration, FSPP 2.0 support in PET provides no support for international characters (aka UniCode), which nowadays limits its utility (in PET) severely. Since sometime in 2007, this line of development has been unsupported.

In a sense, REPP can be viewed as FSPP 3.0, but we prefer coining a new name to emphasize the conceptual and technological revisions. REPP, in a sense, re-energizes some aspects of the FSPP design (which were not implemented at the time) and incorporates characterization from FSPP 2.0. At the same time, REPP drops some earlier FSPP facilities, making a number of simplifying assumptions that reflect the intervening years of experience with the integration of 'deep' and 'shallow' processing approaches. Most importantly, REPP can be conceptualized as pure string-level rewriting, in that its output is a flat sequence of tokens (which could be concatenated into a whitespace-separated string format). There is no provision for token-level ambiguity in the REPP design, not because we doubt the utility of such ambiguity (on the contrary), but because in late 2008 a new, and more adequate mechanism becomes available (albeit initially only in the PET parser), the so-called chart mapping approach of [http://www.lrec-conf.org/proceedings/lrec2008/summaries/349.html Adolphs et al. (2008)].

Even though the transition from FSPP to REPP will require a moderate amount of adaptation to existing pre-processor rules sets, we encourage everyone to make the transition sooner rather or later. Future development is expected to focus on REPP, including a native implementation (with UniCode support) in PET. StephanOepen will be happy to hear from current FSPP users, especially those making use of FSPP facilities that are not part of the REPP design: token-level rules and tokenization ambiguity. In principle, the combination of REPP and chart mapping should make it possible to obtain similar (or better) behavior regarding token-level processing, but at least in early 2009, chart mapping is only available in PET. Hence, in some cases it may be necessary to jointly define bridging solutions, possibly even expose some of the FSPP legacy support in the current REPP implementation.

Clone this wiki locally