PetInput

Overview

This page discusses available input formats to the PET parser cheap, where the order of presentation is largely reflects historical order of PET development, but also corresponds to increasing complexity (and, thus, control of system behavior).

Textual, Line-Oriented Input

Punctuation characters, as specified in the settings file are ignored by PET (removed from the input chart) for pure, textual input.

Here is an example of the punctuation characters found in pet/japanese.set:

  punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。？…．，　○●◎＊".

Note that punctuation-characters are defined separately for the LKB (typically in lkb/globals.lsp) and that, in recent years, grammars are moving towards inclusion of punctuation marks in the syntactic analysis.

Punctuation characters are not removed from the other input modes (YY mode, PET Input Char, or MAF). Rather, in these modes they should be removed (or treated otherwise, as appropriate) by the preprocessor that created the token lattice (in whatever syntax) provided to PET.

YY Input Mode

YY (activated by the -yy option) input mode facilities parsing from a partial (lexical) chart, i.e. it assumes that tokenization (and other text-level pre-processing) have been performed outside of cheap. YY input mode facilitates token-level ambiguity, multi-word tokens, some control over what PET should do for morphological analysis, the use of POS tags on input tokens to enable (better) unknown word handling, and generally feeding a word graph (as, for example, obtained from a speech recognizer) into the parser.

Following is a discussion of the YY [http://svn.delph-in.net/erg/trunk/pet/sample.yy input example] provided with the ERG (as of early 2009). In this example, the words are shown on separate lines for clarity. In the actual input given to PET, all YY tokens must appear as a single line (terminated by newline), as each line of input is processed as a separate utterance.

  (42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)
  (43, 1, 2, <12:12>, 1, ",", 0, "null", "," 1.0000)
  (44, 2, 3, <14:14>, 1, "a", 0, "null", "DT" 1.0000)
  (45, 3, 4, <16:26>, 1, "non-trivial", 0, "null", "JJ" 1.0000)
  (46, 4, 5, <28:35>, 1, "exercise", 0, "null", "NN" 0.9887 "VB" 0.0113)
  (47, 5, 6, <36:36>, 1, ",", 0, "null", "," 1.0000)
  (48, 6, 7, <38:43>, 1, "bazed", 0, "null", "VBD" 0.5975 "VBN" 0.4025)
  (49, 7, 8, <45:57>, 1, "oe@ifi.uio.no", 0, "null", "NN" 0.7342 "JJ" 0.2096)
  (50, 8, 9, <58:58>, 1, ".", 0, "null", "." 1.0000)

An input in this form can be processed by PET as follows:

  cat sample.yy | cheap -yy -packing -verbose=4 -mrs -chart-mapping -default-les=all english.grm

where -yy (a shorthand for -tok=yy) turns on YY partial chart input mode and we request ambiguity packing (which is always a good idea), some verbosity of tracing, and the output of MRSs. The additional options enable chart mapping (see [http://www.lrec-conf.org/proceedings/lrec2008/summaries/349.html Adolphs, et al. (2008)]) and turn the unknown word machinery into 2008 mode (see the section Unknown Word Handling below). Note that these options, as of early 2009, are only supported in the so-called chart mapping [https://pet.opendfki.de/repos/pet/branches/cm branch] of the PET code base (corresponding pre-compiled binaries are available in the LOGON tree; see the LogonTop page).

Each token in the above example has the following format:

(id, start, end, [link,] path⁺, form [surface], ipos, lrule⁺[, {pos p}⁺])

i.e. each token has a unique identifier and start and end vertex. Optionally, tokens can be annotated with a surface link, an indication of underlying string positions in the original document; currently (as of January 2009), link information is only supported as character positions, in the format <from:to> (but in principle, link could have other forms, with from and to being arbitrary strings, e.g. stand-off pointers in whatever underlying markup). We will ignore the path component (membership in one or more paths through a word lattice) for our purposes.

The actual token string is provided by the form field, and this is what PET uses for morphological analysis and lexical look-up. In case the form does not correspond to the original string in the document, e.g. because there was some textual normalization prior to creation of YY tokens already, the optional surface field can be used to record the original string. Until early 2009, the ERG had inherited a mechanism called ersatzing where a set of regular expressions were applied prior to parsing, associating for example a form value of EmailErsatz with a surface value of oe@yy.com. In the newer, chart mapping universe, the ERG no longer makes use of this facility and instead makes it a policy to never 'mess' with the actual token string (but use other token properties instead).

YY mode can be used in two variants regarding morphological analysis. Our example above leaves morphological analysis to PET, i.e. using the lexical rules and orthographemic annotation provided by the grammar. This built-in morphology mode is activated by an lrules value of "null", and the ipos field is ignored (but still has to be given, conventionally as 0). Another option is to provide information about morphological segmentation as part of the input tokens, in which case ipos specifies the position to which orthographemic rules apply, and one or more lrule values (as strings) name lexical rules provided by the grammar.

Finally, each token can be annotated with an optional sequence of tag plus probability pairs. The ERG, for example, includes a set of underspecified generic lexical entries which can be activated on the basis of PoS information, obtained for example from running a PoS tagger prior to parsing. We used to include the probabilities in (heuristic) parse ranking, but since sometime in 2002 (when MaxEnt parse selection became available in PET) they are just ignored.

YY input mode supports a genuine token lattice, i.e. It is legitimate to have multiple tokens for an input position, or tokens spanning multiple positions.

Unknown Word Handling

if you look at pet/english.set in the ERG distribution, you will find some settings that determine the treatment of unknown words:

posmapping :=
  UpperAndLowerCase $genericname
  UpperAndLowerCaseInitial $genericname
  JJ $generic_adj
  JJR $generic_adj_compar
  JJS $generic_adj_superl
  NN $generic_mass_count_noun
  NNS $generic_pl_noun
  NNPS $generic_pl_noun
  NNP $genericname
  FW $generic_mass_noun
  RB $generic_adverb
  VB $generic_trans_verb_bse
  VBD $generic_trans_verb_past
  VBG $generic_trans_verb_prp
  VBN $generic_trans_verb_psp
  VBP $generic_trans_verb_presn3sg
  VBZ $generic_trans_verb_pres3sg
.

which determines what happens for unknown words, i.e. tokens whose form is not found in the native lexicon. the top part of the mapping (which is commented out in the current release version) is for PTB tags, the lower part for CLAWS tags.

i suspect both the mapping and constraints on generic entries will need some fine-tuning. consider our initial example: FAQ is not in the ERG lexicon. RASP (wrongly, i think) tags it as a proper noun, thus we use the $genericname lexical entry. when dan did these generic entries, we did not have a tagger (i.e. always threw in all of them), hence he made these entries fairly constrained with respect to their combinatorics: in this case, $genericname does not allow combination with a specifier, hence the above still fails to parse. changing its tag to NN or NN1 we get nine readings, the first of which looks plausible.

History and Alternate Lattice-Based Input Modes

YY input mode was first developed in 2000 and has undergone three revisions since. YY input mode revision 0.0 was a purely internal version that is no longer supported. Since 2001, YY 1.0 has been in active use and is still fully supported. The format described above, and the example given from the ERG, use YY 2.0, a conservative, backwards-compatible extension made in January 2009. Compared to YY 1.0, only the optional link field is new, i.e. the ability to provide information about external surface positions.

Alternate, lattice-based input modes are available using XML markup to encode the parser input. See the PetInputChart and SmafTop pages for the so-called PIC and SMAF mode, respectively.

Home | Forum | Discussions | Events

PetInput

Overview

Textual, Line-Oriented Input

YY Input Mode

Unknown Word Handling

History and Alternate Lattice-Based Input Modes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!