-
Notifications
You must be signed in to change notification settings - Fork 4
SmafTop
SMAF is the name given to the XML-input format for use with the DELPH-IN deep processors. A SMAF document describes a segment (generally, a sentence) of data packaged for input to a deep processor/parser such as the LKB or PET. SMAF is an amalgamation of ideas found in [http://atoll.inria.fr/perl/maf/mafhelp.html MAF], the HoG system (HeartofgoldTop), and [http://www.cl.cam.ac.uk/~bmw20/Papers/NLPXML06-SAF.pdf SAF], and incorporates RMRS XML.
(Aside: what does SMAF stand for? "Sentence MAF", "Silly MAF", "Something-like MAF", "SoMe Annotation Format" ??? My vote goes to "Segment/Sentence MAF" [bmw])
SMAF follows the principles of standoff annotation. This means:
-
the SMAF standoff document exists separately to the primary data document;
-
standoff pointers (in SMAF, character pointers) link annotations in the standoff document to regions of the primary data.
Each SMAF document describes a segment of the primary data for input to a deep parser (such a segment typically corresponds to the notion of a sentence). The following properties are global to a SMAF document:
-
either document (URL link to primary data) or text (embedded primary data, for convenience)... (or both, in which case document takes precedence)
-
OLAC-compatible metadata: document identifier, plus optional creator, created [timestamp],...
-
a global span (cfrom/cto)
-
a single global lattice, consisting of
-
specified init(ial) and final nodes
-
a set of edges, each describing an annotation over the primary data
-
Properties of each edge:
-
an identifier
-
a type (eg. token, pos, named-entity, morphosyntax, ...)
-
a source and a target node in lattice
-
[optional] a span (defined by character pointers cfrom/cto)
-
[optional] deps, a set of edge ids corresponding to edges on which the current edge has a dependency
-
plus the actual content of the annotation, consisting of a combination of the following elements:
-
slot elements: each consists of a name part (eg. surface, weight, tagset, tag, ...) and a value string
-
features structure (fs) elements: these may be typed, and the format is compatible with the TEI/ISO standard (FSR)
-
rmrs elements: following the RMRS DTD.
-
On receiving a SMAF document as input, a deep parser will map the SMAF object into internal data structures. The format has been designed so that this mapping is reasonably straightforward for specific deep parser implementation + grammar combinations (but also general enough to abstract over the specifics of individual software components and grammars). Although many SMAF properties map fairly directly into the internal data structures of individual processors, a certain amount of configuration is required to make this go smoothly.
The lattice structure of the edges (source, target) and inter-edge dependencies (deps) can be mapped straightforwardly into internal data structures of a chart parser. The cfrom/cto properties of edges may be copied as is.
However, configuration is necessary to correctly map content (slots, fs's, rmrs's) into internal data structures. The edge type may be used to configure and constrain this mapping (eg. the content expected for a token edge sill differ to that for a pos edge will differ to that for a named-entity edge etc.).
<edge type='token' id='t1' cfrom='0' cto='6' source='v0' target='v1'>
<slot name='surface'>Andrew</slot>
</edge>
Suitable content:
-
slot named surface
-
(slot named weight ???) [should in fact all edges allow this???]
Map to:
- software-component-internal token edge value
<edge type='pos' id='p1' deps='t1' source='v0' target='v1'>
<slot name='weight'>0.5</slot>
<slot name='tagset'>CLAWS</slot>
<slot name='tag'>NNP</slot>
</edge>
Suitable content:
-
slot weight [real number]
-
slot tagset
-
slot tag
Map to:
-
weight to software-component-internal edge value
-
tag to grammar-specific type
<edge type='namedEntity' id='n1' cfrom='10' cto='20' source='v0' target='v1'>
<slot name='weight'>0.567</slot>
<slot name='surface'>1987 to 1997</slot>
<fs type='timespan'>
<f name='from'>
<fs type='point'>
<f name='year'>
<fs type='1987'/>
</f>
</fs>
</f>
<f name='to'>
<fs type='point'>
<f name='year'>
<fs type='1997'/>
</f>
</fs>
</f>
</fs>
<!-- OR: can we use RMRS in place of above FS? -->
</edge>
Suitable content:
-
slot weight [real number]
-
slot surface
-
single typed FS
Maps to:
-
weight, surface to software-component-internal data value
-
top type of FS to grammar-specific type
-
... + individual path-value pairs of SMAF FS to grammar-specific path-value pairs ???
-
... could FS above equally well be RMRS ???
<edge type='morph' deps='t1' source='v0' target='v1'>
<slot name='weight'>0.5</slot>
<slot name='tagset'>morph</slot>
<slot name='reduced'>SMILE</slot>
<!-- plus... FS along lines of MAF? -->
<!-- or... RMRS encoding morpho info? -->
</edge>
Suitable content:
-
slot weight
-
slot tagset
-
slot reduced (reduced form, generally lemms)
-
FS describing morphosyntactic features
-
... OR RMRS describing morphosyntactic features (if applicable)
Maps to:
-
weight, reduced to software-component-internal data values
-
tagset specifies how FS should be interpreted???
-
FS maps to grammar-specific type, according to tagset
-
... OR RMRS injected at grammar-specific internal FS path ?
...
Each deep processor implementing SMAF XML input implements a set of "closed" mappings, and a partially-configurable set of "open" mappings. The "closed" set of mappings is applicable to those aspects of SMAF hardwired into the SMAF spec (read DTD). The "open" set of mappings must be specified per edge type. Eg. [format=TYPE:SLOT] *.weight, *.surface, morph.reduced must map to specific internal values, independent of the grammar running.
Specify "open" mappings in a config file such as: smaf_config.lkb, smaf_config.pet, ...
Each grammar must specify mappings appropriate to the type system of an individual grammar. Eg.
-
pos.[tag='SPECIFIC_VALUE', tagset='SPECIFIC_TAGSET'] => GRAMMAR_SPECIFIC_TYPE
-
namedEntity.[fs_path=, fs_type='SPECIFIC_TYPE'] => GRAMMAR.SPECIFIC.PATH GRAMMAR_SPECIFIC_TYPE
-
namedEntity.[fs_path='from.year' fs_type=*1] => GRAMMAR.SPECIFIC.PATH2 *1
-
morph.[tagset='SPECIFIC_TAGSET', fs_path='SPECIFIC.PATH', fs_type='SPECIFIC_TYPE'] => grammar.specific.type GRAMMAR_SPECIFIC_TYPE
[expand me...]
Specify these mappings in a config file such as: smaf_config.erg,smaf_config.norsource, smaf_config.jacy, smaf_config.jap, ...
See SmafDtd.
See SmafSample.
Home | Forum | Discussions | Events