-
Notifications
You must be signed in to change notification settings - Fork 4
LkbSppp
The Simple PreProcessor Protocol (SPPP) provides a way of hooking up to an external tokenizer and morphological analyzer; optionally, SPPP also allows inclusion of part of speech (PoS) information in the preprocessor output, but as of March 2005 back-end support to take advantage of PoS assignments on input tokens (e.g. in the treatment of unknown words) is lacking. SPPP assumes that the preprocessor runs as an external process to the LKB that communicates with its caller through its standard input and output channels. Communication is in XML, using a designated control character for synchronization, and always in UTF-8 encoding (in other words, the encoding value in the XML document header is hard-wired for communication in both directions). SPPP allows ambiguous preprocessing output, for example alternative morphological analyses for one token.
<?xml version="1.0" encoding="utf-8"?>
<segment>
<token form="la" from="0" path="1" to="2">
<analysis cat="articolo" format="" inflection="fem_sing"
probability="0.0" stem="det_art" tag="articolo"/>
<analysis cat="clitico" format="" inflection="fem_sing"
probability="1.0" stem="la_cli" tag="clitico"/>
</token>
<token form="il" from="3" path="1" to="5">
<analysis cat="articolo" format="" inflection="masc_sing"
probability="1.0" stem="det_art" tag="articolo"/>
</token>
<token form="letto" from="6" path="1" to="11">
<analysis cat="nome-maschile" format="" inflection="masc_sing"
probability="1.0" stem="letto_nm" tag="nome-masc-sing"/>
<analysis cat="verbo" format="" inflection="partic_passato_masc_sing"
probability="0.0" stem="leggere_v" tag="verbo-partpass"/>
</token>
<token form="cigola" from="12" path="1" to="18">
<analysis cat="verbo" format="" inflection="indic_pres_third_sing"
probability="1.0" stem="cigolare_v" tag="verbo-fin"/>
</token>
</segment>
Home | Forum | Discussions | Events