Skip to content

LkbSppp

StephanOepen edited this page Mar 25, 2005 · 17 revisions

Overview

The Simple PreProcessor Protocol (SPPP) provides a way of hooking up to an external tokenizer and morphological analyzer; optionally, SPPP also allows inclusion of part of speech (PoS) information in the preprocessor output, but as of March 2005 back-end support to take advantage of PoS assignments on input tokens (e.g. in the treatment of unknown words) is lacking. SPPP assumes that the preprocessor runs as an external process to the LKB that communicates with its caller through its standard input and output channels. Communication is in XML, using a designated control character for synchronization, and always in UTF-8 encoding (in other words, the encoding value in the XML document header is hard-wired for communication in both directions). SPPP allows ambiguous preprocessing output, for example alternative morphological analyses for one token. All references to LKB-internal parameters and functions in the following are within the lkb package, i.e. users need to make sure they are setting things up in the right package (aka namespace).

XML Examples

Below is an example of the SPPP representation that might be returned from the preprocessor for the input kim sleeps (and assuming the inflectional rules as found in the [http://www.delph-in.net/erg/ ERG]). The top-level XML document is a segment object that contains a sequence of token entities. Each token encodes the string that is its surface form and the character start and end positions (from and to, respectively) corresponding to it. Preprocessing results are keyed off individual tokens in the form of one or more analysis entities embedded in token.

  <?xml version="1.0" encoding="utf-8"?>
  <segment>
    <token form="kim" from="0" to="3">
      <analysis inflection="zero" stem="kim" probability="1.0" tag="NNP"/>
    </token>
    <token form="sleeps" from="4" to="10">
      <analysis inflection="plur_noun" probability="0.1" stem="sleep" tag="NN"/>
      <analysis inflection="third_sg_fin_verb" probability="0.9" stem="sleep" tag="VBZ"/>
    </token>
  </segment>
  ^L

Each analysis needs to supply at least two bits of information, viz. the stem resulting from morphological analysis and the identifier of the inflection to be applied. The latter is encoded as the identifier of an orthographemic lexical rule in the LKB, where rule name is looked up as the value of inflection in preprocessor output concatenated with the value of the LKB parameter *lex-rule-suffix*. Thus, in order to make LKB grammars work well with SPPP, make sure to match the setting of *lex-rule-suffix* (typically in globals.lsp) to the naming conventions used for orthographemic rules. As of March 2005, the value of inflection is constrained to a singleton identifier, thus there is no way of triggering chains of orthographemic rules from SPPP output; the special value zero can be used to indicate that no orthographemic rules shall be applied. Other, optional elements of the analysis entity are probability and tag to encode PoS tagger output, but in the current LKB back-end processing these remain unused.

SPPP Configuration

To enable SPPP-based preprocessing for the LKB, you will need to provide a binary of the preprocessor that the LKB can invoke as an external process. For each sentence being analyzed in the LKB, the system will send an XML document to the standard input of the preprocessor, terminate the input with a formfeed character (`^L' or ASCII code 0x0c), and then wait for the preprocessor output to be printed on the standard output of the external binary. Again, to allow synchronization between the two processes, the LKB will look for another formfeed character to terminate the XML document returned by the preprocessor. Following is an example input

  <?xml version="1.0" encoding="utf-8"?>
  <text>kim sleeps</text>
  ^L

Note that the formfeed characters sent to the preprocessor input and required following its output are, strictly speaking, not part of the XML itself. They rather serve to simplify the protocol, in that each side of the SPPP protocol can rely on a formfeed to mark the end of a turn. Also, remember that SPPP communication is hard-wired to UTF-8, i.e. in generating XML output for a preprocessor, the external binary will need to make sure to use the UTF-8 encoding on its standard output stream.

Assuming a suitable preprocessor binary is available, its location in the file system needs to be declared to the LKB. Assuming the binary was called sppp and added to the platform-specific LKB binary directory, then the following would point the LKB to it:

  (setf *sppp-application* (namestring (merge-pathnames bin-dir "sppp")))

Alternatively, users could make sure to have the binary directory in their shell search path (and then just specify the name of the executable file itself, without directory information), or just use a fully-qualified, valid path name, e.g.

  (setf *sppp-application* "/usr/local/bin/sppp")

Once the external preprocessing binary is know to the LKB, the function call

  (initialize-sppp)

will activate external preprocessing. Try parsing a sentence and observe the results. Typically, the setting of *sppp-application* could be done as part of the globals.lsp in an LKB grammar, and calling initialize-sppp() might be done in the grammar load script. To stop using SPPP (and go back to the default LKB-internal processing), use the function shutdown-sppp().

SPPP Debugging

Given the asynchronous communication, it can be tedious to debug the initial creation of an SPPP interface. Once the communication works, it will likely be very stable, but getting both sides of the protocol to cooperate can be a bit tricky at first. Following are a few suggestions for debugging the initial SPPP set-up, in the unlikely event that it did not just work out of the box.

There is a sample shell script to implement a fake external preprocessor for the ERG. The script is called sppp and resides in the lkb/src/glue/ sub-directory of the DELPH-IN tree (assuming your installation includes source code for the LKB; see the LkbInstallation pages). Maybe start out by loading the ERG and making sure you understand the SPPP basic by activating the fake SPPP script; note that, no matter what input you parse, this script will always return the example XML shown above as the preprocessor output.

Once you know that the LKB is actually calling your own external binary, make sure the formfeed-based synchronization works. It may be worth augmenting your own binary with some code to log its activity to a separate file (note that logging must not occur on standard output, as that will be read by the LKB as preprocessor results). On the LKB side, try calling something like

  (sppp "kim sleeps")

Should the sppp() call not return, then most likely the preprocessor is not including a trailing formfeed to terminate its output. In case the function does return but reports errors, e.g. problems parsing the XML, then inspect the value of the global variable *sppp-input-buffer* to confirm the output from the external process actually adheres to the SPPP protocol. Should everything else fail, consult the source code in sppp.lsp, which is in the same directory as the fake sppp script.

Clone this wiki locally