Skip to content

SmafTop

BenjaminWaldron edited this page Mar 2, 2006 · 20 revisions

Overview

SMAF is the name given to the XML-input format for use with the DELPH-IN deep processors. A SMAF document describes a segment (generally, a sentence) of data packaged for input to a deep processor/parser such as the LKB or PET. (Question: what does SMAF stand for? "Sentence MAF", "Silly MAF", "Something-like MAF", "SoMe Annotation Format" ??? My vote goes to "Sentence MAF" [bmw].)

SMAF follows the principles of standoff annotation. This means:

  • the SMAF standoff document exists separately to the primary data document;

  • standoff pointers (in SMAF, character pointers) link annotations in the standoff document to regions of the primary data.

Each SMAF document describes a segment of the primary data for input to a deep parser (such a segment typically corresponds to the notion of a sentence). The following properties are global to a SMAF document:

  • either document (URL link to primary data) or text (embedded primary data, for convenience)... (or both, in which case document takes precedence)

  • OLAC-compatible metadata: document identifier, plus optional creator, created [timestamp],...

  • a global span (cfrom/cto)

  • a single global lattice, consisting of

    • specified init(ial) and final nodes

    • a set of edges, each describing an annotation over the primary data

Properties of each edge:

  • an identifier

  • a type (eg. token, pos, named-entity, morphosyntax, ...)

  • a source and a target node in lattice

  • [optional] a span (defined by character pointers cfrom/cto)

  • [optional] deps, a set of edge ids corresponding to edges on which the current edge has a dependency

  • plus the actual content of the annotation, consisting of a combination of the following elements:

    • slot elements: each consists of a name part (eg. surface, weight, tagset, tag, ...) and a value string

    • features structure (fs) elements: these may be typed, and the format is compatible with the TEI/ISO standard (FSR)

    • rmrs elements: following the RMRS DTD.

On receiving a SMAF document as input, a deep parser will map the SMAF object into internal data structures. The format has been designed so that this mapping is reasonably straightforward for specific deep parser implementation + grammar combinations (but also general enough to abstract over the specifics of individual software components and grammars). Although many SMAF properties map fairly directly into the internal data structures of individual processors, a certain amount of configuration is required to make this go smoothly.

The lattice structure of the edges (source, target) and inter-edge dependencies (deps) can be mapped straightforwardly into internal data structures of a chart parser. The cfrom/cto properties of edges may be copied as is.

However, configuration is necessary to correctly map content (slots, fs's, rmrs's) into internal data structures. The edge type may be used to configure and constrain this mapping (eg. the content expected for a token edge sill differ to that for a pos edge will differ to that for a named-entity edge etc.).

SAMPLE token EDGE

   <edge type='token' id='t1' cfrom='0' cto='6' source='v0' target='v1'>
    <slot name='surface'>Andrew</slot>
   </edge>

Suitable content:

  • slot named surface

  • (slot named weight ???) [should in fact all edges allow this???]

Map to:

  • software-component-internal token edge value

SAMPLE pos EDGE

   <edge type='pos' id='p1' deps='t1' source='v0' target='v1'>
     <slot name='weight'>0.5</slot>
     <slot name='tagset'>CLAWS</slot>
     <slot name='tag'>NNP</slot>
   </edge>

Suitable content:

  • slot weight [real number]

  • slot tagset

  • slot tag

Map to:

  • weight to software-component-internal edge value

  • tag to grammar-specific type

SAMPLE named entity EDGE

   <edge type='namedEntity' id='n1' cfrom='10' cto='20' source='v0' target='v1'>
    <slot name='weight'>0.567</slot>
    <slot name='surface'>1987 to 1997</slot>
    <fs type='timespan'>
       <f name='from'>
          <fs type='point'>
            <f name='year'>
              <fs type='1987'/>
            </f>
          </fs>
       </f>
       <f name='to'>
          <fs type='point'>
            <f name='year'>
              <fs type='1997'/>
            </f>
          </fs>
       </f>
     </fs>
     <!-- OR: can we use RMRS in place of above FS? -->
   </edge>

Suitable content:

  • slot weight [real number]

  • slot surface

  • single typed FS

Maps to:

  • weight, surface to software-component-internal data value

  • top type of FS to grammar-specific type

  • ... + individual path-value pairs of SMAF FS to grammar-specific path-value pairs ???

  • ... could FS above equally well be RMRS ???

SAMPLE external morphosyntax EDGE

   <edge type='morph' deps='t1' source='v0' target='v1'>
    <slot name='weight'>0.5</slot>
    <slot name='tagset'>morph</slot>
    <slot name='reduced'>SMILE</slot>
    <!-- plus... FS along lines of MAF? -->
    <!-- or... RMRS encoding morpho info? -->
   </edge>

Suitable content:

  • slot weight

  • slot tagset

  • slot reduced (reduced form, generally lemms)

  • FS describing morphosyntactic features

  • ... OR RMRS describing morphosyntactic features (if applicable)

Maps to:

  • weight, reduced to software-component-internal data values

  • tagset specifies how FS should be interpreted???

  • FS maps to grammar-specific type, according to tagset

  • ... OR RMRS injected at grammar-specific internal FS path ?

...

software-component-specific SMAF mappings

Each deep processor implementing SMAF XML input implements a set of "closed" mappings, and a partially-configurable set of "open" mappings. The "closed" set of mappings is applicable to those aspects of SMAF hardwired into the SMAF spec (read DTD). The "open" set of mappings must be specified per edge type. Eg. [format=TYPE:SLOT] *.weight, *.surface, morph.reduced must map to specific internal values, independent of the grammar running.

Specify "open" mappings in a config file such as: smaf_config.lkb, smaf_config.pet, ...

grammar-specific SMAF mappings

Each grammar must specify mappings appropriate to the type system of an individual grammar. Eg.

  • pos.[tag='SPECIFIC_VALUE', tagset='SPECIFIC_TAGSET'] => GRAMMAR_SPECIFIC_TYPE

  • namedEntity.[fs_path=, fs_type='SPECIFIC_TYPE'] => GRAMMAR.SPECIFIC.PATH GRAMMAR_SPECIFIC_TYPE

  • namedEntity.[fs_path='from.year' fs_type=*1] => GRAMMAR.SPECIFIC.PATH2 *1

  • morph.[tagset='SPECIFIC_TAGSET', fs_path='SPECIFIC.PATH', fs_type='SPECIFIC_TYPE'] => grammar.specific.type GRAMMAR_SPECIFIC_TYPE

[expand me...]

Specify these mappings in a config file such as: smaf_config.erg,smaf_config.norsource, smaf_config.jacy, smaf_config.jap, ...

SMAF DTD

<?xml version="1.0" encoding="UTF-8"?>

<!-- DTD for SMAF -->
        
<!ELEMENT smaf   (text?, olac:olac? , lattice ) >
<!ATTLIST smaf   document CDATA #REQUIRED >

<!ELEMENT text (#PCDATA) >

<!ELEMENT lattice   (edge*)>
<!ATTLIST lattice   init CDATA #IMPLIED
                    final CDATA #IMPLIED
                    cfrom CDATA #IMPLIED
                    cto CDATA #IMPLIED>

<!ELEMENT edge (slot*, fs*, rmrs*) >
<!ATTLIST edge id ID #REQUIRED
               type CDATA #REQUIRED
               cfrom CDATA #IMPLIED
               cto CDATA #IMPLIED
               source CDATA #REQUIRED
               target CDATA #REQUIRED
               deps IDREFS #IMPLIED >

<!ELEMENT slot (#PCDATA) >
<!ATTLIST slot name CDATA #REQUIRED >

<!ELEMENT fs (f*) >
<!ATTLIST fs type CDATA #IMPLIED>
<!ELEMENT f (#PCDATA|fs)* >
<!ATTLIST f name CDATA #REQUIRED>

<!--!ELEMENT olac:olac (dc:creator?,created?,dc:identifier)-->

<!-- it's too tedious to specify all possible permutations in a DTD, so use ANY instead! -->
<!ELEMENT olac:olac ANY> 

<!ATTLIST olac:olac
    xmlns:olac CDATA #FIXED "http://www.language-archives.org/OLAC/1.0/"
    xmlns:dc CDATA #FIXED "http://purl.org/dc/elements/1.1/"
>

<!ELEMENT created (#PCDATA)>

<!--
THE FOLLOWING IS A TAKEN FROM...

     DTD for the OLAC Metadata Set, version 1.0
     Gary Simons and Steven Bird, 8 April 2003
     
     The definitive definition for an OLAC metadata record is the XML schema
     at:  http://www.language-archives.org/OLAC/1.0/olac.xsd
     
     This DTD is offerred for the convenience of users who need to use DTD-based 
     software.  However, since schemas have more functionality than DTDs, validation
     with this DTD does not guarantee that the document is valid with respect to the schema.

-->
    
<!ELEMENT dc:creator (#PCDATA)>
<!ATTLIST dc:creator
        %attributes;
>

<!ELEMENT dc:description (#PCDATA)>
<!ATTLIST dc:description
        %attributes;
>

<!ELEMENT dc:identifier (#PCDATA)>
<!ATTLIST dc:identifier
        %attributes;
>
<!ELEMENT dc:language (#PCDATA)>
<!ATTLIST dc:language
        %attributes;
>

Sample SMAF document

<?xml version='1.0' encoding='UTF-8'?>
 <!DOCTYPE smaf SYSTEM 'smaf.dtd'>
 <smaf document='URL'>
  <text>OPTIONAL INLINE TEXT</text>
  <olac:olac xmlns:olac='http://www.language-archives.org/OLAC/1.0/' xmlns:dc='http://purl.org/dc/elements/1.1/'>
   <dc:creator>CREATOR</dc:creator>
   <created>TIMESTAMP</created>
   <dc:identifier>HOG-LIKE ID</dc:identifier>
   <!-- more OLAC metadata possible -->
  </olac:olac>
  <lattice init='v0' final='v89' cfrom='0' cto='100'>

   <!-- some simple tokens-->
   <edge type='token' id='t1' cfrom='0' cto='6' source='v0' target='v1'>
    <slot name='surface'>Andrew</slot>
   </edge>
   <edge type='token' id='t2' cfrom='7' cto='13' source='v0' target='v1'>
    <slot name='surface'>smiles</slot>
   </edge>

   <!-- part-of-speech -->
   <edge type='pos' id='p1' deps='t1' source='v0' target='v1'>
     <slot name='weight'>0.5</slot>
     <slot name='tagset'>CLAWS</slot>
     <slot name='tag'>NNP</slot>
<!-- for grammar...  <f name='type'>$a_noun_type</f> -->
   </edge>

   <!-- sample named entity -->
   <edge type='namedEntity' id='n1' cfrom='10' cto='20' source='v0' target='v1'>
    <slot name='weight'>0.567</slot>
<!-- for grammar...  <f name='type'>$generic_timespan</f> -->
    <slot name='surface'>1987 to 1997</slot>
    <fs type='timespan'>
       <f name='from'>
          <fs type='point'>
            <f name='year'>
              <fs type='1987'/>
            </f>
          </fs>
       </f>
       <f name='to'>
          <fs type='point'>
            <f name='year'>
              <fs type='1997'/>
            </f>
          </fs>
       </f>
     </fs>
     <!-- OR: can we use RMRS in place of above FS? -->
   </edge>

   <!-- ... -->
  </lattice>
 </smaf>
Clone this wiki locally