-
Notifications
You must be signed in to change notification settings - Fork 4
ItsdbReference
This page include some low level information about itsdb (ItsdbTop).
The database consists of multiple tables. Each table is a text file, consisting of multiple rows. Each row consists of with fields separated by an @, the whole row is terminated by a newline. The mappings of columns to identifiers is given in the relations file.
Here is the structure, along with some examples of values.
Field | Name | Explanation | Example Value |
1: | i-id | ID | integer |
2: | i-origin | Origin | none |
3: | i-register | Register | formal |
4: | i-format | Format | none |
5: | i-difficulty | Difficulty | 1 |
6: | i-category | Category | S,XP |
7: | i-input | String | |
8: | i-wf | Well Formedness | 0,1,2 |
9: | i-length | String length (words) | integer |
10: | i-comment | Comment | |
11: | i-author | Author | uname |
12: | i-date | Date created | 5-8-2003 |
An actual entry:
1@csli@formal@none@1@S@Abrams works .@1@2@@@jul-98
Note that [itsdb] does not always check that the i-ids are unique, but they should always be kept unique. Also, it is a good idea to keep the items sorted.
In the Hinoki project, the i-comment is used to give the source of the utterance (definition sentence, example, other corpus), the ID in the source corpus, and, for definition and examples sentences, some information about the headword being defined or exemplified.
Value | Meaning |
0 | Ungrammatical |
1 | Grammatical |
2 | Ignored |
-
Grammatical is used to mark items that a grammar should parse.
-
Ungrammatical is used to mark items that a grammar should not parse.
-
Ignored is used to mark items in a profile that should currently be ignored. For example, a Japanese newspaper corpus may contain http://en.wikipedia.org/wiki/Senryu senryuu, which is currently beyond the scope of the grammar, and can be excluded when treebanking or analyzing performance.
The grammticality judgements can be used to measure lack of coverage and overgeneration, respectively:
-
Lack of Coverage
- test items (plus relevant properties) that are annotated as grammatical but failed to parse;
-
Overgeneration
- list test items (plus relevant properties) that are tagged ill-formed but accepted by the parser (i.e. were assigned at least one analysis).
Home | Forum | Discussions | Events