Skip to content

ErgTokenization

StephanOepen edited this page Aug 13, 2010 · 17 revisions

Overview

Aiming for a balance of linguistic precision and broad coverage, the [http://www.delph-in.net/erg English Resource Grammar] (ERG) includes detailed analyses of punctuation and a wide variety of 'text-level' phenomena (e.g. various formats for temporal and numeric expressions). The grammar makes specific assumptions about tokenization, and for the successful application of the grammar it is important to understand and respect these assumptions. In early 2009, the ERG approach to tokenization has undergone a major revision, and this page aims to spell out some of the basic assumptions, specific decisions made, and technology used in preparing input text for parsing with the ERG.

This page was predominantly authored by StephanOepen, who jointly with DanFlickinger developed the current ERG approach to tokenization. As of early 2009, Stephan is the maintainer of the ERG tokenizer and token mapping rules. Please do not make substantial changes to this page unless you (a) are reasonably sure of the technical correctness of your revisions and (b) believe strongly that your changes are compatible with the general design and recommended use patterns for the ERG, and of course with the goals of this page.

String-Level Pre-Processing and Initial Tokenization

This section documents tokenization and a handful of other surface-level decisions. Technically speaking, when parsing with the ERG and PET (which is the reference setup for production use), the parser takes as its input a lattice of tokens, each a structured object (aka typed feature structure). Please see the PetInput page for additional background. In this view, string-level pre-processing and initial tokenization is the process of mapping a 'flat' string into a token lattice.

In the standard setup for the ERG, this task is solved by means of so-called REPP (Regular Expression Pre-Processor) modules, which are included with the ERG sources (in the rpp/ subdirectory); for general background on the technology, please see the ReppTop page. The REPP modules provided by the ERG can be configured in various ways, to accommodate different input conventions, i.e. variation in punctuation and markup conventions used in texts from various sources. As of mid-2010, these REPP modules have stabilized to a certain degree but remain to be documented (beyond the generous use of comments in the REPP source files). In the following, we document the normalized result of string-level pre-processing, i.e. the expected input to the ERG (and result of the application of a set of REPP modules).

General Principles

For compatibility with existing tools, specifically taggers trained on the Penn Treebank (PTB), we assume a PTB-like tokenization in pre-processing. The ERG internally (still) analyzes most punctuation marks as pseudo-affixes (rather than as separate tokens, as in the PTB). To accomodate any discrepancies, the grammar includes token mapping rules to adjust (i.e. correct) externally supplied tokenization (see the ChartMapping page for general background); specifically, punctuation marks will be re-combined with preceding or following tokens, reflecting standard orthographic convention.

The REPP pre-processing modules included with the ERG are inspired by the PTB tokenizer.sed script and by and large yield quite similar results (with a number of extensions going beyond 7-bit ASCII strings, as discussed below). To actually tokenize (following PTB principles), we need to do more than just break at whitespace. Some punctuation marks give rise to token boundaries, but not all. Also, inputs (in the 21st century) may contain some amount of mark-up, where XML character references for example have become relatively common. Full UniCode support in the toolchain now makes it possible to represent a much larger range of characters, e.g. various types of quotes and dashes. In general, we aim to map mark-up to corresponding UniCode characters, where appropriate, and typically analyze those in parsing.

However, the original tokenizer.sed actually does not always yield the exact tokenization found in the PTB. For example, the script unconditionally separates a set of punctuation or other non-alphanumeric characters (e.g. & and ! that may be part of a single token (say in acronyms like AT&T or URLs). We aim to do better than the original script, here, conditioning token boundaries on (transitively) adjacent whitespace. See the examples discussed below for details.

A Running Example

To exemplify the above basic principles, consider the following sample input:

  The shipment, 'chairs', arrived.

This will be tokenized into a total of nine tokens, i.e. each of the punctuation marks will form a token in its own right. In REPP, each token will be annotated with so-called 'characterization', i.e. a range of character indices into the original string (allowing one to recover the distinction between immediate adjacency between two consecutive tokens vs. intervening whitespace). Thus, in the so-called YY input format to PET (see the PetInput page for background), our example would be represented as follows:

  (42, 0, 1, <0:3>, 1, "The", 0, "null")
  (43, 1, 2, <4:12>, 1, "shipment", 0, "null")
  (44, 2, 3, <12:13>, 1, ",", 0, "null")
  (45, 3, 4, <14:15>, 1, "‘", 0, "null")
  (46, 4, 5, <15:21>, 1, "chairs", 0, "null")
  (47, 5, 6, <21:22>, 1, "’", 0, "null")
  (48, 6, 7, <22:23>, 1, ",", 0, "null")
  (49, 7, 8, <24:31>, 1, "arrived", 0, "null")
  (50, 8, 9, <31:32>, 1, ".", 0, "null")

Quotation Marks

In naturally occuring texts, there is a wide variety of conventions for representing quotation marks. Much like in the PTB, the ERG expects its inputs (after pre-processing) to distinguish between opening (aka left) and closing (aka right) quotes. Note how REPP in the above example turns the straight single quotes (so-called 'typewriter quotes') into directional UniCode characters, i.e. ‘ and ’.

Token Mapping

Unknown Word Handling

Clone this wiki locally