Skip to content

MatrixChoicesRfc

TJTrimble edited this page Jun 29, 2020 · 17 revisions

Work in Progress!

Abstract

In this document, we provide an outline for an overhaul of the choices system for the Grammar Matrix. The primary changes are moving to a standard serialization format, the introduction of a choices schema, and the decoupling of choices and questionnaire display (matrixdef). We propose using TOML as an open source, non-proprietary, generally available data storage format. We propose providing a schema for choices files. These changes will make development easier and help enable connecting the Grammar Matrix to other projects, both as a component and a target.

Introduction

The Grammar Matrix is composed of several components, including but not limited to the following:

  1. matrix.tdl

  2. the customization system

  3. the questionnaire

  4. the regression tests

A choices file is a serialized data file containing both 1) the answers to questions in the questionnaire (as well as annotated data such as lexical entries) and 2) information mapping choices to the matrixdef file that participated in generating it. (2) includes, but is probably not limited to:

  • the sections and section names of the matrixdef file

  • the choice names and (optionally) valid choices of the matrixdef file

  • the ordering of the above

Currently, choices are written in a proprietary storage format (DSL).

The code implementing choices is defined in three locations:

  • choices.py (the main data structure definitions, API)

  • deffile.py (choices deserialization)

  • matrix.cgi (choices serialization)

Motivations

The goals for changing this set up include the following:

  1. Ease of human reading: the current choices file format is generally human readable, and this is an important consideration of any new changes. However, indexing and non-obvious nesting can make choices difficult for newcomers to understand.

  2. Ease of human editing: along with reading, being able to identify and edit a choice without using the customization system is a critical requirement, for use with matrix.py, debugging, and development.

  3. Ease of creation: it is currently difficult to write a choices file from scratch. One must parse the current matrixdef file to understand available sections, choices, and options. Ideally, one would be able to write choices with very few dependencies.

  4. Computer serialization and deserialization: choices are currently stored in a proprietary format, so reading and writing them requires implementing a significant amount of code.

  5. Make development easier: currently, changing anything to do with choices probably requires changes in several locations and is brittle.

  6. Choices as a component: a modularized choices format could be used as a data format in other projects, targeting the Grammar Matrix or not.

  7. Matrix as a target: modularizing the choices format could help others develop systems that use the Grammar Matrix (i.e. matrix.py) without the Customization System (e.g. AGGREGATION).

To achieve these goals, we propose the following changes:

  1. Changing the choices serialization format
  2. Introducing a choices schema

Background

The existing choices data serialization consists of the following data types:

  1. Dictionaries (maps)
  2. Lists (arrays/vectors)
  3. Strings / References
  4. Integers

Below is an example snippet of the existing choices format as of June 2020 from the mini English choices file:

section=morphology
  noun-pc1_name=num
  noun-pc1_obligatory=on
  noun-pc1_order=suffix
  noun-pc1_inputs=noun1
    noun-pc1_lrt1_name=singular
      noun-pc1_lrt1_feat1_name=number
      noun-pc1_lrt1_feat1_value=sg
      noun-pc1_lrt1_lri1_inflecting=no
    noun-pc1_lrt2_name=plural
      noun-pc1_lrt2_feat1_name=number
      noun-pc1_lrt2_feat1_value=pl
      noun-pc1_lrt2_lri1_inflecting=yes
      noun-pc1_lrt2_lri1_orth=s
  verb-pc1_name=pernum
  verb-pc1_obligatory=on
  verb-pc1_order=suffix
  verb-pc1_inputs=verb
    verb-pc1_lrt1_name=3sg
      verb-pc1_lrt1_feat1_name=person
      verb-pc1_lrt1_feat1_value=3rd
      verb-pc1_lrt1_feat1_head=subj
      verb-pc1_lrt1_feat2_name=number
      verb-pc1_lrt1_feat2_value=sg
      verb-pc1_lrt1_feat2_head=subj
      verb-pc1_lrt1_lri1_inflecting=yes
      verb-pc1_lrt1_lri1_orth=s
    verb-pc1_lrt2_name=pl
      verb-pc1_lrt2_feat1_name=number
      verb-pc1_lrt2_feat1_value=pl
      verb-pc1_lrt2_feat1_head=subj
      verb-pc1_lrt2_lri1_inflecting=no
    verb-pc1_lrt3_name=non-3rd
      verb-pc1_lrt3_feat1_name=person
      verb-pc1_lrt3_feat1_value=1st, 2nd
      verb-pc1_lrt3_feat1_head=subj
      verb-pc1_lrt3_lri1_inflecting=no

These choices are aligned with the matrixdef section named morphology, and the following matrixdef snippet as of June 2020 (trimmed for brevity, only including items relevant to choices snippet above):

Section morphology "Morphology"

BeginIter noun-pc{i} "a Position Class" 1 0

  Text name "Noun Position Class {i} Name" "<b>Noun Position Class {i}</b>:<br/>Position Class Name: " " " 20

  Check obligatory "Noun Position Class {i} Obligatory" "<br/>Obligatorily occurs:" ""

Select order "Noun Position Class {i} Order" "<br/>Appears as a prefix or suffix:" ""
. prefix "Prefix" "prefix"
. suffix "Suffix" "suffix"

  MultiSelect inputs "Noun Position Class {i} Input" "<br/>Possible inputs:" ""
  fillcache c=nouns
  fillregex p=(verb|noun)-pc[0-9]+(_lrt[0-9]+)?_name
  . noun "Any noun" "any noun"

  BeginIter lrt{j} "a Lexical Rule Type" 1 0

    Text name "Noun Position Class {i} Lexical Rule Type {j} Name" "<b>Lexical Rule Type {j}</b>:<br/>Name: " "" 20
    BeginIter feat{k} "a Feature"

      Select name "Noun Position Class {i} Lexical Rule Type {j} Feature {k} Name" "Name: " " "
      fillnames c=both

      MultiSelect value "Noun Position Class {i} Lexical Rule Type {j} Feature {k} Value" "Value: " ""
      fillvalues p=noun-pc{i}_lrt{j}_feat{k}_name

    EndIter feat

    BeginIter lri{k} "a Lexical Rule Instance" 0 1

      Radio inflecting "Noun Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Has Orthography" "Instance {k}" ""
      . no "No" "" "No affix"
      . yes "Yes" "" "Affix spelled"

      Text orth "Noun Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Spelling" "" "" 20

    EndIter lri

  EndIter lrt

EndIter noun-pc

BeginIter verb-pc{i} "a Position Class" 1

  Text name "Verb Position Class {i} Name" "<b>Verb Position Class {i}</b>:<br/>Position Class Name: " " " 20

  Check obligatory "Verb Position Class {i} Obligatory" "<br/>Obligatorily occurs:" ""

  Select order "Verb Position Class {i} Order" "<br/>Appears as a prefix or suffix:" ""
  . prefix "Prefix" "prefix"
  . suffix "Suffix" "suffix"

  BeginIter lrt{j} "a Lexical Rule Type" 1 1

    Text name "Verb Position Class {i} Lexical Rule Type {j} Name" "<b>Lexical Rule Type {j}</b>:<br/>Name: " "" 20

    BeginIter feat{k} "a Feature"

      Select name "Verb Position Class {i} Lexical Rule Type {j} Feature {k} Name" "Name: " " "
      fillnames c=both

      MultiSelect value "Verb Position Class {i} Lexical Rule Type {j} Feature {k} Value" "Value: " ""
      fillvalues p=verb-pc{i}_lrt{j}_feat{k}_name

      Select head "Verb Position Class {i} Lexical Rule Type {j} Feature {k} Head" "Specified on: " ""
      . verb "The verb" "the verb"
      . subj "The subject" "the subject NP"
      . obj "The object" "the object NP"

    EndIter feat

    BeginIter lri{k} "a Lexical Rule Instance" 0 1

      Radio inflecting "Verb Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Has Orthography" "Instance {k}" ""
      . no "No" "" "No affix"
      . yes "Yes" "" "Affix spelled"

      Text orth "Verb Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Spelling" "" "" 20

    EndIter lri

  EndIter lrt

EndIter verb-pc

Proposals

Serialization Format

Using a standard serialization format allows the Grammar Matrix code, other code, and humans to more easily read and write choices files. By relying on an existing third party implementation, this would also reduce the amount of code in the Grammar Matrix repo, making development easier.

Serialization Language

Several candidates were considered for the serialization format: JSON, YAML, StrictYAML, and TOML.

  • JSON: generally considered hard to read, hard to edit, and hard to create.

  • YAML: YAML has several things going for it (widely used, easy to read and write, Python-inspired), however, it also has many problems. It has a lot of unnecessary features (code execution, multi-documents, lists as keys, etc.), problems with ambiguity and typing, and has readability problems with large, highly nested documents.

  • StrictYAML: StrictYAML is a subset of YAML that removes many features (e.g. code execution) and fixes many typing and ambiguity problems. However, it is currently only implemented in Python and still has readability problems with large, highly nested documents.

  • TOML: TOML is a configuration language that represents maps and arrays, has fairly readable syntax, has implementations in many languages (Python, JavaScript, etc.) and actually uses some of the same syntax as the existing choices file format (e.g. nested dictionaries via dot notation).

We are recommending TOML in large part due to the lack of problems compared to YAML and its readability over JSON.

Serialization Specifics

On top of the language, there are decisions about how choices are organized into the choices file:

  1. Storing lists of choices
  2. Nested choices
  3. Choice references
Lists of Choices

As of writing, choices are stored as lists using indexing, e.g.:

section=lexicon
  noun1_det=imp
    noun1_stem1_orth=n1
    noun1_stem1_pred=_n1_n_rel
  noun2_det=imp
    noun2_stem1_orth=n2
    noun2_stem1_pred=_n2_n_rel

noun1 is the first noun object as defined by this matrixdef's BeginIter noun, which also contains the nested definition BeginIter stem leading to the object noun1_stem1, which has two choices: orth and pred.

Nested choices

As of writing, choices are nested using chained keys with _ delimiters, e.g. noun1_stem1_orth is the value for orth of the first stem of the first noun.

In TOML, these nested choices would be explicitly nested as nested dictionaries/maps.

Serialized Example

Choices Schema

Decoupling Choices and Definitions

Meta

Date: June 2020 Contributers: T.J. Trimble, Michael Goodman, Olga Zamaraeva, Chris Curtis

Clone this wiki locally