-
Notifications
You must be signed in to change notification settings - Fork 4
PetInput
This page discusses available input formats to the PET parser cheap, where the order of presentation is largely reflects historical order of PET development, but also corresponds to increasing complexity (and, thus, control of system behavior).
Punctuation characters, as specified in the settings file are ignored by PET (removed from the input chart) for pure, textual input.
Here is an example of the punctuation characters found in pet/japanese.set:
punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。?…., ○●◎*".
Note that punctuation-characters are defined separately for the LKB (typically in lkb/globals.lsp) and that, in recent years, grammars are moving towards inclusion of punctuation marks in the syntactic analysis.
Punctuation characters are not removed from the other input modes (YY mode, PET Input Char, or MAF). Rather, in these modes they should be removed (or treated otherwise, as appropriate) by the preprocessor that created the token lattice (in whatever syntax) provided to PET.
YY (activated by the -yy option) input mode facilities parsing from a partial (lexical) chart, i.e. it assumes that tokenization (and other text-level pre-processing) have been performed outside of cheap. YY input mode facilitates token-level ambiguity, multi-word tokens, some control over what PET should do for morphological analysis, the use of POS tags on input tokens to enable (better) unknown word handling, and generally feeding a word graph (as, for example, obtained from a speech recognizer) into the parser.
Following is a discussion of the YY [http://svn.delph-in.net/erg/trunk/pet/sample.yy input example] provided with the ERG (as of early 2009). In this example, the words are shown on separate lines for clarity. In the actual input given to PET, all YY tokens must appear as a single line (terminated by newline), as each line of input is processed as a separate utterance.
(42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)
(43, 1, 2, <12:12>, 1, ",", 0, "null", "," 1.0000)
(44, 2, 3, <14:14>, 1, "a", 0, "null", "DT" 1.0000)
(45, 3, 4, <16:26>, 1, "non-trivial", 0, "null", "JJ" 1.0000)
(46, 4, 5, <28:35>, 1, "exercise", 0, "null", "NN" 0.9887 "VB" 0.0113)
(47, 5, 6, <36:36>, 1, ",", 0, "null", "," 1.0000)
(48, 6, 7, <38:43>, 1, "bazed", 0, "null", "VBD" 0.5975 "VBN" 0.4025)
(49, 7, 8, <45:57>, 1, "oe@ifi.uio.no", 0, "null", "NN" 0.7342 "JJ" 0.2096)
(50, 8, 9, <58:58>, 1, ".", 0, "null", "." 1.0000)
An input in this form can be processed by PET as follows:
cat sample.yy | cheap -yy -packing -verbose=4 -mrs -chart-mapping -default-les=all english.grm
where -yy (a shorthand for -tok=yy) turns on YY partial chart input mode and we request ambiguity packing (which is always a good idea), some verbosity of tracing, and the output of MRSs. The additional options enable chart mapping (see [http://www.lrec-conf.org/proceedings/lrec2008/summaries/349.html Adolphs, et al. (2008)]) and turn the unknown word machinery into 2008 mode (see the section Unknown Word Handling below). Note that these options, as of early 2009, are only supported in the so-called chart mapping [https://pet.opendfki.de/repos/pet/branches/cm branch] of the PET code base (corresponding pre-compiled binaries are available in the LOGON tree; see the LogonTop page).
Each token in the above example has the following format:
- (id, start, end, [link,] path+, form [surface], ipos, lrule+[, {pos p}+])
i.e. each token has a unique identifier and start and end vertex. Optionally, tokens can be annotated with a surface link, an indication of underlying string positions in the original document; currently (as of January 2009), link information is only supported as character positions, in the format <from:to> (but in principle, link could have other forms, with from and to being arbitrary strings, e.g. stand-off pointers in whatever underlying markup). We will ignore the path component (membership in one or more paths through a word lattice) for our purposes.
The actual token string is provided by the form field, and this is what PET uses for morphological analysis and lexical look-up. In case the form does not correspond to the original string in the document, e.g. because there was some textual normalization prior to creation of YY tokens already, the optional surface field can be used to record the original string. Until early 2009, the ERG had inherited a mechanism called ersatzing where a set of regular expressions were applied prior to parsing, associating for example a form value of EmailErsatz with a surface value of oe@yy.com. In the newer, chart mapping universe, the ERG no longer makes use of this facility and instead makes it a policy to never 'mess' with the actual token string (but use other token properties instead).
YY mode can be used in two variants regarding morphological analysis. Our example above leaves morphological analysis to PET, i.e. using the lexical rules and orthographemic annotation provided by the grammar. This built-in morphology mode is activated by an lrules value of "null", and the ipos field is ignored (but still has to be given, conventionally as 0). Another option is to provide information about morphological segmentation as part of the input tokens, in which case ipos specifies the position to which orthographemic rules apply, and one or more lrule values (as strings) name lexical rules provided by the grammar.
Finally, each token can be annotated with an optional sequence of tag plus probability pairs. The ERG, for example, includes a set of underspecified generic lexical entries which can be activated on the basis of PoS information, obtained for example from running a PoS tagger prior to parsing. We used to include the probabilities in (heuristic) parse ranking, but since sometime in 2002 (when MaxEnt parse selection became available in PET) they are just ignored.
YY input mode supports a genuine token lattice, i.e. It is legitimate to have multiple tokens for an input position, or tokens spanning multiple positions.
if you look at pet/english.set in the ERG distribution, you will find some settings that determine the treatment of unknown words:
posmapping :=
UpperAndLowerCase $genericname
UpperAndLowerCaseInitial $genericname
JJ $generic_adj
JJR $generic_adj_compar
JJS $generic_adj_superl
NN $generic_mass_count_noun
NNS $generic_pl_noun
NNPS $generic_pl_noun
NNP $genericname
FW $generic_mass_noun
RB $generic_adverb
VB $generic_trans_verb_bse
VBD $generic_trans_verb_past
VBG $generic_trans_verb_prp
VBN $generic_trans_verb_psp
VBP $generic_trans_verb_presn3sg
VBZ $generic_trans_verb_pres3sg
.
which determines what happens for unknown words, i.e. tokens whose form is not found in the native lexicon. the top part of the mapping (which is commented out in the current release version) is for PTB tags, the lower part for CLAWS tags.
i suspect both the mapping and constraints on generic entries will need some fine-tuning. consider our initial example: FAQ is not in the ERG lexicon. RASP (wrongly, i think) tags it as a proper noun, thus we use the $genericname lexical entry. when dan did these generic entries, we did not have a tagger (i.e. always threw in all of them), hence he made these entries fairly constrained with respect to their combinatorics: in this case, $genericname does not allow combination with a specifier, hence the above still fails to parse. changing its tag to NN or NN1 we get nine readings, the first of which looks plausible.
XML input mode is very similar to YY input mode. It allows you to specify only simple tokens that get analysed internally by \cheap or to put all kinds of preprocessing information \cheap can handle into the input directly, namely POS, morphology, lexicon lookup and multi-component entries.
It extends the YY mode in that it allows to have structured input tokens to provide a means to encode, say, named entities resulting from base tokens. It also allows to specify modifications to feature structures (coming from lexicon entries.
It is called with -tok=pic_counts and can be used in combination with -default-les to trigger unknown words with POS tags, much like in YY mode.
A typical way of calling it, with xml input and the best ranked xml rmrs output) would be:
cat input.xml | cheap -tok=pic_counts -default-les -packing -mrs=rmrx -results=1 grammar.grm
A simple example input is given below:
<?xml version="1.0" encoding="utf-8" standalone="no" ?>
<!DOCTYPE pet-input-chart
SYSTEM "/use/local/lib/xml/pic.dtd">
<pet-input-chart>
<!-- This FAQ is short -->
<w id="W1" cstart="1" cend="5">
<surface>This</surface>
<pos tag="DD1" prio = "1.0" />
</w>
<w id="W2" cstart="7" cend="9">
<surface>FAQ</surface>
<pos tag="NP1" prio = "1.0" />
</w>
<w id="W2" cstart="7" cend="9" constant="yes">
<surface>FAQ</surface>
<typeinfo id="n_-_pn_le" baseform="no" prio="1.0">
<stem>$genericname</stem>
<fsmod path="SYNSEM.LKEYS.KEYREL.CARG" value="F.A.Q."/>
</typeinfo>
</w>
<w id="W3" cstart="11" cend="12">
<surface>is</surface>
<pos tag="BE" prio = "1.0" />
</w>
<w id="W4" cstart="14" cend="18">
<surface>short</surface>
<pos tag="JJ" prio = "1.0" />
</w>
</pet-input-chart>
[note: the two empty lines at the end of the input file appear necessary when piping data into PET using the above command]
The input is broken up into tokens <w>...</w>, which must have unique ids. Each token gives its start (cstart) and end (cend) (inclusive) character position. It can also include a pos element, with the tag and confidence (priority).
It also allows more detailed specifications (named entities, modified feature structures, ...).
You can only enter a single pet-input-chart in a stream, and it must start with the xml declaration and finish with at least two empty lines. Alternatively, you can give the name of a file consisting of a single pet-input-chart, or a list of such filenames, one on each line.
The example given below illustrates most of the available features. Tokens W0 and W1 are not analysed at all by cheap because the (boolean) constant attribute is yes.
- The default value of this attribute is {\tt no}, which means that
the token W3 will be analysed by all of the activated preprocessing modules in cheap.
<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?>
<!DOCTYPE pet-input-chart
SYSTEM "/path/to/src/pet/doc/pic.dtd">
<pet-input-chart>
<w id="W0" cstart="1" cend="3" constant="yes">
<surface>Kim</surface>
</w>
<w id="W1" cstart="5" cend="9" constant="yes">
<surface>Novak</surface>
</w>
<ne id="NE0" prio="1.0">
<ref dtr="W0">
<ref dtr="W1">
<pos tag="PN" prio="1.0">
<typeinfo id="TNE0" baseform="no">
<stem>$generic_name</stem>
<fsmod path="SYNSEM.LOCAL.HEAD.FORM" value="Kim Novak"/>
</typeinfo>
</ne>
<w id="W2" cstart="11" cend="16" constant="yes">
<surface>sleeps</surface>
<pos tag="VVFIN" prio="7.80000e-1"/>
<pos tag="NN" prio="2.30000e-2"/>
<typeinfo id="W1A1">
<stem>sleep</stem>
<infl name="$third_sg_fin_verb_infl_rule"/>
</typeinfo>
<typeinfo id="W1A2">
<stem>sleep</stem>
<infl name="$plur_noun_infl_rule"/>
</typeinfo>
</w>
<w id="W3" cstart="18" cend="22">
<surface>badly</surface>
<pos tag="ADV" prio="1.00000e+1"/>
</w>
</pet-input-chart>
Token NE0 is an example of a complex token referencing a sequence of two base tokens. Its typeinfo directly gives the HPSG type name whose feature structure should be used as lexical item in cheap. While in YY mode this was triggered by a leading special character, in XML the attribute baseform decides if the string enclosed by the <stem> tag is to be interpreted as lexical base form or as type name. The default value of baseform is yes. In this token, the surface string is unified into the feature structure under path SYNSEM.LOCAL.HEAD.FORM, which is specified with the <fsmod> tag. The value of an <fsmod> may be an arbitrary string. cheap will add a dynamic symbol if the string is not a known type or symbol name.
Every <typeinfo> tag potentially generates a lexical item (if it leads to a valid lexical feature structure). Thus, there will be two readings for the token W2 (sleeps), whereas internal analysis of the surface form has been inhibited. This need not be necessarily so. It is possible to provide external analyses and have a <w> token also being analysed internally if the constant flag is omitted or set to no.
The XML tag <surface> encloses the surface string, <pos> and <path> tags are analogous to YY mode; multiple <infl> rules in a <typeinfo> will have to be considered from first to last.
XML input mode can be used in two different ways, either by specifying a file name containing the XML data (preferably with correct XML header and DTD or DTD URL specification) or by giving the XML data directly.
If the XML data is put directly into the standard input, it must start with a valid XML header <?xml version="1.0" ... ?> with no leading whitespace, because recognition of the header triggers the reading of XML from standard input. The end of the data is marked by an empty line (two consecutive newline characters), therefore, the data itself, including an eventually given DTD, may not contain empty lines.
This is the pic.dtd from the [wiki:HeartofgoldTop Heart of Gold].
<!ELEMENT pet-input-chart ( w | ne )* >
<!-- base input token -->
<!ELEMENT w ( surface, path*, pos*, typeinfo* ) >
<!ATTLIST w id ID #REQUIRED
cstart NMTOKEN #REQUIRED
cend NMTOKEN #REQUIRED
prio CDATA #IMPLIED
constant (yes | no) "no" >
<!-- constant "yes" means: do not analyse, i.e., if the tag contains
no typeinfo, no lexical item will be build by the token -->
<!-- The surface string -->
<!ELEMENT surface ( #PCDATA ) >
<!-- numbers that encode valid paths through the input graph (optional) -->
<!ELEMENT path EMPTY >
<!ATTLIST path num NMTOKEN #REQUIRED >
<!-- every typeinfo generates a lexical token -->
<!ELEMENT typeinfo ( stem, infl*, fsmod* ) >
<!ATTLIST typeinfo id ID #REQUIRED
prio CDATA #IMPLIED
baseform (yes | no) "yes" >
<!-- Baseform yes: lexical base form; no: type name -->
<!-- lexical base form or type name -->
<!ELEMENT stem ( #PCDATA ) >
<!-- type name of an inflection rule-->
<!ELEMENT infl EMPTY >
<!ATTLIST infl name CDATA #REQUIRED >
<!-- put type value under path into the lexical feature structure -->
<!ELEMENT fsmod EMPTY >
<!ATTLIST fsmod path CDATA #REQUIRED
value CDATA #REQUIRED >
<!-- part-of-speech tags with priorities -->
<!ELEMENT pos EMPTY >
<!ATTLIST pos tag CDATA #REQUIRED
prio CDATA #IMPLIED >
<!-- structured input items, mostly to encode named entities -->
<!ELEMENT ne ( ref+, pos*, typeinfo+ ) >
<!ATTLIST ne id ID #REQUIRED
prio CDATA #IMPLIED >
<!-- reference to a base token -->
<!ELEMENT ref EMPTY >
<!ATTLIST ref dtr IDREF #REQUIRED >
By default the XML parser used with cheap (libxerces) can handle iso-8859-1 and utf-8. To get other encodings, such as euc-jp, you need to link the xml parser against the icu libraries.
For debian and derivatives this means:
apt-get install sudo apt-get install libxercesicu25 icu
rather than:
apt-get install sudo apt-get install libxerces25
See SmafTop.
Home | Forum | Discussions | Events