Skip to content

PetInput

StephanOepen edited this page Jan 25, 2009 · 53 revisions

Overview

This page discusses available input formats to the PET parser cheap, where the order of presentation is largely reflects historical order of PET development, but also corresponds to increasing complexity (and, thus, control of system behavior).

TableOfContents

Textual, Line-Oriented Input

Punctuation characters, as specified in the settings file are ignored by PET (removed from the input chart) for pure, textual input.

Here is an example of the punctuation characters found in pet/japanese.set:

  punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。?…., ○●◎*".

Note that punctuation-characters are defined separately for the LKB (typically in lkb/globals.lsp) and that, in recent years, grammars are moving towards inclusion of punctuation marks in the syntactic analysis.

Punctuation characters are not removed from the other input modes (YY mode, PET Input Char, or MAF). Rather, in these modes they should be removed (or treated otherwise, as appropriate) by the preprocessor that created the token lattice (in whatever syntax) provided to PET.

YY Input Mode

YY (activated by the -yy option) input mode facilities parsing from a partial (lexical) chart, i.e. it assumes that tokenization (and other text-level pre-processing) have been performed outside of cheap. YY input mode facilitates token-level ambiguity, multi-word tokens, some control over what PET should do for morphological analysis, the use of POS tags on input tokens to enable (better) unknown word handling, and generally feeding a word graph (as, for example, obtained from a speech recognizer) into the parser.

Following is a discussion of the YY [http://svn.delph-in.net/erg/trunk/pet/sample.yy input example] provided with the ERG (as of early 2009). In this example, the words are shown on separate lines for clarity. In the actual input given to PET, all YY tokens must appear as a single line (terminated by newline), as each line of input is processed as a separate utterance.

  (42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)
  (43, 1, 2, <12:12>, 1, ",", 0, "null", "," 1.0000)
  (44, 2, 3, <14:14>, 1, "a", 0, "null", "DT" 1.0000)
  (45, 3, 4, <16:26>, 1, "non-trivial", 0, "null", "JJ" 1.0000)
  (46, 4, 5, <28:35>, 1, "exercise", 0, "null", "NN" 0.9887 "VB" 0.0113)
  (47, 5, 6, <36:36>, 1, ",", 0, "null", "," 1.0000)
  (48, 6, 7, <38:43>, 1, "bazed", 0, "null", "VBD" 0.5975 "VBN" 0.4025)
  (49, 7, 8, <45:57>, 1, "oe@ifi.uio.no", 0, "null", "NN" 0.7342 "JJ" 0.2096)
  (50, 8, 9, <58:58>, 1, ".", 0, "null", "." 1.0000)

An input in this form can be processed by PET as follows:

  cat sample.yy | cheap -yy -packing -verbose=4 -mrs -chart-mapping -default-les=all english.grm 

where -yy (a shorthand for -tok=yy) turns on YY partial chart input mode and we request ambiguity packing (which is always a good idea), some verbosity of tracing, and the output of MRSs. The additional options enable chart mapping (see [http://www.lrec-conf.org/proceedings/lrec2008/summaries/349.html Adolphs, et al. (2008)]) and turn the unknown word machinery into 2008 mode (see the section Unknown Word Handling below). Note that these options, as of early 2009, are only supported in the so-called chart mapping [https://pet.opendfki.de/repos/pet/branches/cm branch] of the PET code base (corresponding pre-compiled binaries are available in the LOGON tree; see the LogonTop page).

Each token in the above example has the following format:

  • (id, start, end, [link,] path+, form [surface], ipos, lrule+[, {pos p}+])

i.e. each token has a unique identifier and start and end vertex. Optionally, tokens can be annotated with a surface link, an indication of underlying string positions in the original document; currently (as of January 2009), link information is only supported as character positions, in the format <from:to> (but in principle, link could have other forms, with from and to being arbitrary strings, e.g. stand-off pointers in whatever underlying markup). We will ignore the path component (membership in one or more paths through a word lattice) for our purposes.

The actual token string is provided by the form field, and this is what PET uses for morphological analysis and lexical look-up. In case the form does not correspond to the original string in the document, e.g. because there was some textual normalization prior to creation of YY tokens already, the optional surface field can be used to record the original string. Until early 2009, the ERG had inherited a mechanism called ersatzing where a set of regular expressions were applied prior to parsing, associating for example a form value of EmailErsatz with a surface value of oe@yy.com. In the newer, chart mapping universe, the ERG no longer makes use of this facility and instead makes it a policy to never 'mess' with the actual token string (but use other token properties instead).

YY mode can be used in two variants regarding morphological analysis. Our example above leaves morphological analysis to PET, i.e. using the lexical rules and orthographemic annotation provided by the grammar. This built-in morphology mode is activated by an lrules value of "null", and the ipos field is ignored (but still has to be given, conventionally as 0). Another option is to provide information about morphological segmentation as part of the input tokens, in which case ipos specifies the position to which orthographemic rules apply, and one or more lrule values (as strings) name lexical rules provided by the grammar.

Finally, each token can be annotated with an optional sequence of tag plus probability pairs. The ERG, for example, includes a set of underspecified generic lexical entries which can be activated on the basis of PoS information, obtained for example from running a PoS tagger prior to parsing. We used to include the probabilities in (heuristic) parse ranking, but since sometime in 2002 (when MaxEnt parse selection became available in PET) they are just ignored.

YY input mode supports a genuine token lattice, i.e. It is legitimate to have multiple tokens for an input position, or tokens spanning multiple positions.

Unknown Word Handling

if you look at pet/english.set in the ERG distribution, you will find some settings that determine the treatment of unknown words:

posmapping :=
  UpperAndLowerCase $genericname
  UpperAndLowerCaseInitial $genericname
  JJ $generic_adj
  JJR $generic_adj_compar
  JJS $generic_adj_superl
  NN $generic_mass_count_noun
  NNS $generic_pl_noun
  NNPS $generic_pl_noun
  NNP $genericname
  FW $generic_mass_noun
  RB $generic_adverb
  VB $generic_trans_verb_bse
  VBD $generic_trans_verb_past
  VBG $generic_trans_verb_prp
  VBN $generic_trans_verb_psp
  VBP $generic_trans_verb_presn3sg
  VBZ $generic_trans_verb_pres3sg
.

which determines what happens for unknown words, i.e. tokens whose form is not found in the native lexicon. the top part of the mapping (which is commented out in the current release version) is for PTB tags, the lower part for CLAWS tags.

i suspect both the mapping and constraints on generic entries will need some fine-tuning. consider our initial example: FAQ is not in the ERG lexicon. RASP (wrongly, i think) tags it as a proper noun, thus we use the $genericname lexical entry. when dan did these generic entries, we did not have a tagger (i.e. always threw in all of them), hence he made these entries fairly constrained with respect to their combinatorics: in this case, $genericname does not allow combination with a specifier, hence the above still fails to parse. changing its tag to NN or NN1 we get nine readings, the first of which looks plausible.

Pet Input Chart (XML Input)

XML input mode is very similar to YY input mode. It allows you to specify only simple tokens that get analysed internally by \cheap or to put all kinds of preprocessing information \cheap can handle into the input directly, namely POS, morphology, lexicon lookup and multi-component entries.

It extends the YY mode in that it allows to have structured input tokens to provide a means to encode, say, named entities resulting from base tokens. It also allows to specify modifications to feature structures (coming from lexicon entries.

It is called with -tok=pic_counts and can be used in combination with -default-les to trigger unknown words with POS tags, much like in YY mode.

Examples

A typical way of calling it, with xml input and the best ranked xml rmrs output) would be:

  cat input.xml | cheap -tok=pic_counts -default-les -packing -mrs=rmrx -results=1 grammar.grm

A simple example input is given below:

<?xml version="1.0" encoding="utf-8" standalone="no" ?>
<!DOCTYPE pet-input-chart
 SYSTEM "/use/local/lib/xml/pic.dtd">
<pet-input-chart>
<!-- This FAQ is short -->
  <w id="W1" cstart="1" cend="5">
    <surface>This</surface>
    <pos tag="DD1" prio = "1.0" />
  </w>
  <w id="W2" cstart="7" cend="9">
    <surface>FAQ</surface>
    <pos tag="NP1" prio = "1.0" />
  </w>
   <w id="W2" cstart="7" cend="9" constant="yes">
    <surface>FAQ</surface>
    <typeinfo id="n_-_pn_le" baseform="no" prio="1.0">
      <stem>$genericname</stem>
      <fsmod path="SYNSEM.LKEYS.KEYREL.CARG" value="F.A.Q."/>
      </typeinfo>
  </w>
  <w id="W3" cstart="11" cend="12">
    <surface>is</surface>
    <pos tag="BE" prio = "1.0" />
  </w>
  <w id="W4" cstart="14" cend="18">
    <surface>short</surface>
    <pos tag="JJ" prio = "1.0" />
  </w>
 </pet-input-chart>

[note: the two empty lines at the end of the input file appear necessary when piping data into PET using the above command]

The input is broken up into tokens <w>...</w>, which must have unique ids. Each token gives its start (cstart) and end (cend) (inclusive) character position. It can also include a pos element, with the tag and confidence (priority).

It also allows more detailed specifications (named entities, modified feature structures, ...).

You can only enter a single pet-input-chart in a stream, and it must start with the xml declaration and finish with at least two empty lines. Alternatively, you can give the name of a file consisting of a single pet-input-chart, or a list of such filenames, one on each line.

The example given below illustrates most of the available features. Tokens W0 and W1 are not analysed at all by cheap because the (boolean) constant attribute is yes.

  • The default value of this attribute is {\tt no}, which means that

the token W3 will be analysed by all of the activated preprocessing modules in cheap.

<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?>
<!DOCTYPE pet-input-chart
  SYSTEM "/path/to/src/pet/doc/pic.dtd">
<pet-input-chart>
  <w id="W0" cstart="1" cend="3" constant="yes">
    <surface>Kim</surface>
  </w>
  <w id="W1" cstart="5" cend="9" constant="yes">
    <surface>Novak</surface>
  </w>
  <ne id="NE0" prio="1.0">
    <ref dtr="W0">
    <ref dtr="W1">
    <pos tag="PN" prio="1.0">
    <typeinfo id="TNE0" baseform="no">
      <stem>$generic_name</stem>
      <fsmod path="SYNSEM.LOCAL.HEAD.FORM" value="Kim Novak"/>
    </typeinfo>
  </ne>
  <w id="W2" cstart="11" cend="16" constant="yes">
    <surface>sleeps</surface>
    <pos tag="VVFIN" prio="7.80000e-1"/>
    <pos tag="NN" prio="2.30000e-2"/>
    <typeinfo id="W1A1">
      <stem>sleep</stem>
      <infl name="$third_sg_fin_verb_infl_rule"/>
    </typeinfo>
    <typeinfo id="W1A2">
      <stem>sleep</stem>
      <infl name="$plur_noun_infl_rule"/>
    </typeinfo>
  </w>
  <w id="W3" cstart="18" cend="22">
    <surface>badly</surface>
    <pos tag="ADV" prio="1.00000e+1"/>
  </w>
</pet-input-chart>

Token NE0 is an example of a complex token referencing a sequence of two base tokens. Its typeinfo directly gives the HPSG type name whose feature structure should be used as lexical item in cheap. While in YY mode this was triggered by a leading special character, in XML the attribute baseform decides if the string enclosed by the <stem> tag is to be interpreted as lexical base form or as type name. The default value of baseform is yes. In this token, the surface string is unified into the feature structure under path SYNSEM.LOCAL.HEAD.FORM, which is specified with the <fsmod> tag. The value of an <fsmod> may be an arbitrary string. cheap will add a dynamic symbol if the string is not a known type or symbol name.

Every <typeinfo> tag potentially generates a lexical item (if it leads to a valid lexical feature structure). Thus, there will be two readings for the token W2 (sleeps), whereas internal analysis of the surface form has been inhibited. This need not be necessarily so. It is possible to provide external analyses and have a <w> token also being analysed internally if the constant flag is omitted or set to no.

The XML tag <surface> encloses the surface string, <pos> and <path> tags are analogous to YY mode; multiple <infl> rules in a <typeinfo> will have to be considered from first to last.

XML input mode can be used in two different ways, either by specifying a file name containing the XML data (preferably with correct XML header and DTD or DTD URL specification) or by giving the XML data directly.

If the XML data is put directly into the standard input, it must start with a valid XML header <?xml version="1.0" ... ?> with no leading whitespace, because recognition of the header triggers the reading of XML from standard input. The end of the data is marked by an empty line (two consecutive newline characters), therefore, the data itself, including an eventually given DTD, may not contain empty lines.

PIC (pet-input-chart) DTD

This is the pic.dtd from the [wiki:HeartofgoldTop Heart of Gold].

<!ELEMENT pet-input-chart ( w | ne )* >
  <!-- base input token -->
  <!ELEMENT w ( surface, path*, pos*, typeinfo* ) >
  <!ATTLIST w         id ID      #REQUIRED
                  cstart NMTOKEN #REQUIRED
                    cend NMTOKEN #REQUIRED
                    prio CDATA   #IMPLIED
                constant (yes | no) "no" >
  <!-- constant "yes" means: do not analyse, i.e., if the tag contains
       no typeinfo, no lexical item will be build by the token -->
 
  <!-- The surface string -->
  <!ELEMENT surface ( #PCDATA ) >

  <!-- numbers that encode valid paths through the input graph (optional) -->
  <!ELEMENT path EMPTY >
  <!ATTLIST path     num NMTOKEN #REQUIRED >
 
  <!-- every typeinfo generates a lexical token -->
  <!ELEMENT typeinfo ( stem, infl*, fsmod* ) >
  <!ATTLIST typeinfo   id ID     #REQUIRED
                     prio CDATA  #IMPLIED
                 baseform (yes | no) "yes" >
  <!-- Baseform yes: lexical base form; no: type name -->

  <!-- lexical base form or type name -->
  <!ELEMENT stem ( #PCDATA ) >

  <!-- type name of an inflection rule-->
  <!ELEMENT infl  EMPTY >
  <!ATTLIST infl    name CDATA   #REQUIRED >

  <!-- put type value under path into the lexical feature structure -->
  <!ELEMENT fsmod  EMPTY >
  <!ATTLIST fsmod   path CDATA   #REQUIRED
                   value CDATA   #REQUIRED >

  <!-- part-of-speech tags with priorities -->
  <!ELEMENT pos  EMPTY >
  <!ATTLIST pos      tag CDATA   #REQUIRED
                    prio CDATA   #IMPLIED >

  <!-- structured input items, mostly to encode named entities -->
  <!ELEMENT ne  ( ref+, pos*, typeinfo+ )  >
  <!ATTLIST ne        id ID      #REQUIRED
                    prio CDATA   #IMPLIED >
 
  <!-- reference to a base token -->
  <!ELEMENT ref  EMPTY >
  <!ATTLIST ref      dtr IDREF   #REQUIRED >

Encoding issues

By default the XML parser used with cheap (libxerces) can handle iso-8859-1 and utf-8. To get other encodings, such as euc-jp, you need to link the xml parser against the icu libraries.

For debian and derivatives this means:

apt-get install sudo apt-get install libxercesicu25 icu

rather than:

apt-get install sudo apt-get install libxerces25 

SMAF

See SmafTop.

Clone this wiki locally