Skip to content

LkbSppp

StephanOepen edited this page Mar 18, 2005 · 17 revisions

Overview

The Simple PreProcessor Protocol (SPPP) provides a way of hooking up to an external tokenizer and morphological analyzer; optionally, SPPP also allows inclusion of part of speech (PoS) information in the preprocessor output, but as of March 2005 back-end support to take advantage of PoS assignments on input tokens (e.g. in the treatment of unknown words) is lacking. SPPP assumes that the preprocessor runs as an external process to the LKB that communicates with its caller through its standard input and output channels. Communication is in XML, using a designated control character for synchronization, and always in UTF-8 encoding (in other words, the encoding value in the XML document header is hard-wired for communication in both directions). SPPP allows ambiguous preprocessing output, for example alternative morphological analyses for one token.

SPPP Configuration

XML Examples

<?xml version="1.0" encoding="utf-8"?>
<segment>
  <token form="la" from="0" path="1" to="2">
    <analysis cat="articolo" format="" inflection="fem_sing"
              probability="0.0" stem="det_art" tag="articolo"/>
    <analysis cat="clitico" format="" inflection="fem_sing"
              probability="1.0" stem="la_cli" tag="clitico"/>
  </token>
  <token form="il" from="3" path="1" to="5">
    <analysis cat="articolo" format="" inflection="masc_sing"
              probability="1.0" stem="det_art" tag="articolo"/>
  </token>
  <token form="letto" from="6" path="1" to="11">
    <analysis cat="nome-maschile" format="" inflection="masc_sing"
              probability="1.0" stem="letto_nm" tag="nome-masc-sing"/>
    <analysis cat="verbo" format="" inflection="partic_passato_masc_sing"
              probability="0.0" stem="leggere_v" tag="verbo-partpass"/>
  </token>
  <token form="cigola" from="12" path="1" to="18">
    <analysis cat="verbo" format="" inflection="indic_pres_third_sing"
              probability="1.0" stem="cigolare_v" tag="verbo-fin"/>
  </token>
</segment>
Clone this wiki locally