ItsdbReference

Reference

This page includes some low level information about itsdb (ItsdbTop). You may also be interested in ItsdbCustomization.

tsdb database format

The database consists of multiple tables. Each table is a text file, consisting of multiple rows. Each row consists of with fields separated by an @, the whole row is terminated by a newline. The mappings of columns to identifiers is given in the relations file.

item file format

Here is the structure, along with some examples of values.


Field	Name	Explanation	Example Value
1:	i-id	ID	integer
2:	i-origin	Origin	none
3:	i-register	Register	formal
4:	i-format	Format	none
5:	i-difficulty	Difficulty	1
6:	i-category	Category	S,XP
7:	i-input	String	parse me
8:	i-wf	Well Formedness	0,1,2
9:	i-length	String length (words)	integer
10:	i-comment	Comment
11:	i-author	Author	uname
12:	i-date	Date created	5-8-2003

An actual entry:

1@csli@formal@none@1@S@Abrams works .@1@2@@@jul-98

Note that [itsdb] does not always check that the i-ids are unique, but they should always be kept unique. Also, it is a good idea to keep the items sorted.

In the Hinoki project, the i-comment is used to give the source of the utterance (definition sentence, example, other corpus), the ID in the source corpus, and, for definition and examples sentences, some information about the headword being defined or exemplified.

Output File Format

It is possible to store information about desired outputs, for example translations. They are stored in a skeleton's output file.

A minimal example of (Japanese) translations of the sentence shown in the item file format is:

1@@@@-1@-1@@エーブラムズ が 働く 。@@@-1@@
1@@@@-1@-1@@エーブラムズ が 仕事 する 。@@@-1@@


Field	Name	Explanation	Example Value
1:	i-id	Item for this output specification	integer
8:	o-surface	Expected surface string	string

All the fields are described in the relations file found in each skeleton.

It is possible to have multiple correct outputs (e.g., multiple reference translations).

Well Formedness (i-wf)


Value	Meaning
0	Illformed (Ungrammatical)
1	Wellformed (Grammatical)
2	Ignored

Wellformed (Grammatical) is used to mark items that a grammar should parse.
Illformed (Ungrammatical) is used to mark items that a grammar should not parse.
Ignored is used to mark items in a profile that should currently be ignored. For example, a Japanese newspaper corpus may contain http://en.wikipedia.org/wiki/Senryu senryuu, which is currently beyond the scope of the grammar, and can be excluded when treebanking or analyzing performance.

The grammticality judgements can be used to measure lack of coverage and overgeneration, respectively:

Lack of Coverage
- test items (plus relevant properties) that are annotated as grammatical but failed to parse;
Overgeneration
- list test items (plus relevant properties) that are tagged ill-formed but accepted by the parser (i.e. were assigned at least one analysis).

How to make a new Skeleton

Make an item file
Make a new sub-directory with the item and relations files in it
Add the skeleton to skeletons/Index.lisp

  ((:path . "newtest") (:content . "example test suite"))

How to make an item file

Make a quick perl script.

You can use itsdb to make the item file:

Create a text file with the sentences (one per line), e.g. newtest
- Make sure the encoding is what you want it to be (utf-8 is recommended)
Import the items in the testsuite using itsdb

   file > import > test items
   in-path/newtest
   out-dir/newtest

This makes a directory (out-dir/newtest) with an item file (with default results for the fields, and numbering starting from 0 or 1) and a relations file.

Home | Forum | Discussions | Events

ItsdbReference

Reference

tsdb database format

item file format

Output File Format

Well Formedness (i-wf)

How to make a new Skeleton

How to make an item file

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!