Skip to content

ItsdbTreebanking_ItsdbAnnotation

FrancisBond edited this page Jul 21, 2009 · 10 revisions

Annotation

This page describes how to treebank with itsdb (ItsdbTop), as well as how to normalize the annotated profile.

itsdb supports Redwoods-style treebanking, from the Trees menu. It has been used to produce the Redwoods (RedwoodsTop) and Hinoki treebanks. You can annotate a corpus; update an annotated corpus to a new grammar; and train statistical models on the treebanked corpora.

Normally only active and deduced discriminants are written out. You can write out all discriminants by setting *redwoods-record-void-discriminants-p* to t when you are in the tsdb package.

After selecting a profile, Trees | Annotate, will bring you into the interface for compiling a treebank. You must have have the same grammar loaded in the LKB (LkbTop) that was used to parse the profile because the system uses the grammar to do the reconstruction of the parse trees.

The annotator selects the correct analysis (or, occasionally, rejects all analyses). Selection is done through a choice of discriminants. The system selects features that distinguish between different parses, and the annotator selects or rejects the features until only one parse remains. The number of decisions for each sentence is normally around log_2 of the number of parses, although sometimes a single decision can reduce the number of remaining parses by more or less than half. In general, even a sentence with 5,000 parses only requires around 12 decisions.

After you completely disambiguate, in the left-hand-side window you will see the elementary dependency structures displayed underneath the tree. Quantifiers and messages are suppressed (by default: see /src/mrs/dependencies.lisp for the configuration options).

The dependencies are color coded:

  • Blue: Good dependency constructed.
  • Orange: Fragmented dependency constructed.
  • Red: Cyclic.

Recommended Settings

  • Number of results stored: 500 --- (setf tsdb::*tsdb-maximal-number-of-results* 500)

    • when you have a model, the good parse is almost always in the top 500, and annotation is much quicker.

Normalizing

When you annotate an item, the old unannotated entry for that item in the database is not deleted, but rather the database is augmented with another entry recording the updated information about that item, along with a version indicator showing that the annoated entry is more recent than the original one. But this version annotation is not dynamically queried when you impose conditions, so to make the version information usable you have to periodically "normalize" the database.

You normalize by selecting Trees | Normalize and give a name for the new normalized database (since the old one will not be overwritten). This step should not be too time-consuming as long as your databases has fewer than 3000 items in them (recommended). In Hinoki, we find a database with 2000 items and a maximum of 5,000 results is quite slow, taking several hours (2005-03-25).

NOTE: Remember to set the Options | TSQL Condition to no condition, otherwise only some trees will be normalized.

NOTE: Normalizing gets slightly quicker if you iconicize emacs, and is much quicker if you run it as a batch.

Thinning Normalizing

This saves only the results for good trees, making a much smaller profile.

It is possible to save MRSs for the treebanked sentences by setting (setf tsdb::*redwoods-semantix-hook* "mrs::mrs-get-string").

Clear Cutting

If you for some reason you wish to delete all the trees (e.g., you made a false start with the update process for Run 2) you can (verrrrryyyy carefully) discard the new annotations by selecting Trees | Clear Cut. Be certain (as in positive) that you are removing these annotations for Run 2, not the hand-coded real annotations you constructed painstakingly for Run 1, as clear cutting fells all the trees.

Clone this wiki locally