Skip to content
BenjaminWaldron edited this page Jan 11, 2006 · 6 revisions

We have integrated into the LKB a (modification of) the Morpho-syntactic Annotation Framework (MAF). MAF is currently an ISO Working Draft (http://www.tc37sc4.org/new_doc/ISO_TC_37-4_N225_CD_MAF.pdf), and provides a framework in which an (XML) standoff annotation document is used to describe annotations with regard to a processed document. Our modification of the MAF XML format is conceptually compatible with the MAF draft, and syntactically our MAF/LKB XML format is largely compatible with the MAF XML format. Using MAF/LKB we are able to ease the integration of preprocessing components into the LKB setup (work on other DELPH-IN components, such as PET, is ongoing).

The MAF/LKB XML serialization format consists of a <maf/> header followed by <token/> and <wordForm/> annotation definitions. The annotation elements live in a lattice (directed acyclic graph). (MAF XML allows the annotation elements to be listed sequentially, but we insist on the catch-all lattice representation for ease of machine processing.)

The global <maf/> element carries global metadata relative to the annotated document as a whole. Mandatory document and addressing elements reference the document to which the standoff annotations refer, and the pointer addressing scheme used (eg. character offsets, xpoint-based addressing, ... ). Non-mandatory metadata are handled following the recommendations of the OLAC Metadata Standard (http://www.language-archives.org/OLAC/metadata.html).

Sample MAF header:

 <maf document='text.xml' addressing='xchar'>
  <olac:olac xmlns:olac="http://www.language-archives.org/OLAC/1.0/"
   xmlns="http://purl.org/dc/elements/1.1/"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/ 
      http://www.language-archives.org/OLAC/1.0/olac.xsd">
   <creator>LKB</creator>
   <created>12:11:43 12/12/2005 (UTC)</created>
  ...
 </maf>

The standoff annotations are grounded via <token/> elements. A <token/> element anchors annotations to a contiguous span of text, defined via pointers in from/to attributes. The addressing scheme for these pointers must be specified in the <maf/> header. Each <token/> element possesses an id attribute for reference by <wordForm/> elements. Additionally, a value attribute may be used to record the contents of the span (eg. that between the from and to pointers).

Sample <token/> element:

  <token id='t2' from='4' to='7' value='dog'/>

The annotation content is provided by <wordForm/> elements. This content is represented as a typed feature structure... [UNDER CONSTRUCTION]

Each word form is built on top of one or more <token/> or other <wordForm/> elements -- eg. they define a hierarchical structure with <token/> elements forming the leaves. The tokens associated in this manner with a <wordForm/> must form a contiguous sequence. (The MAF draft allows a <wordForm/> to reference zero tokens; our approach is to instead introduce pointlike <token/>'s in such cases.) The annotation hierarchy is defined via a daughter attribute on <wordForm> elements (NOTE: this is our generalisation of the MAF draft's tokens attribute -- the MAF draft allows only for explicit XML nesting of <wordForm/>'s, but we find this inadequate in the general case). The daughters attribute is a space-separated list of token and/or wordForm ids; each <wordForm/> element possesses an id attribute for this purpose (NOTE: the MAF draft does not define such an attribute).

... [UNDER CONSTRUCTION]

Clone this wiki locally