-
Notifications
You must be signed in to change notification settings - Fork 4
MatrixChoicesRfc
Work in Progress!
In this document, we provide an outline for an overhaul of the choices system for the Grammar Matrix. The primary changes are moving to a standard serialization format, the introduction of a choices schema, and the decoupling of choices and questionnaire display (matrixdef). We propose using TOML as an open source, non-proprietary, generally available data storage format. We propose providing a schema for choices files. We propose decoupling choices serialization and deserialization from matrixdef and matrix.cgi. These changes will make development easier and help enable connecting the Grammar Matrix to other projects, both as a component and a target.
The Grammar Matrix is composed of several components, including but not limited to the following:
-
matrix.tdl
-
the customization system
-
the questionnaire
-
the regression tests
A choices file is a serialized data file containing both 1) the answers to questions in the questionnaire (as well as annotated data such as lexical entries) and 2) information mapping choices to the matrixdef file that participated in generating it. (2) includes, but is probably not limited to:
-
the sections and section names of the matrixdef file
-
the choice names and (optionally) valid choices of the matrixdef file
-
the ordering of the above
Currently, choices are written in a proprietary storage format (DSL).
The code implementing choices is defined in three locations:
-
choices.py (the main data structure definitions, API)
-
deffile.py (choices deserialization)
-
matrix.cgi (choices serialization)
The goals for changing this set up include the following:
-
Ease of human reading: the current choices file format is generally human readable, and this is an important consideration of any new changes. However, indexing and non-obvious nesting can make choices difficult for newcomers to understand.
-
Ease of human editing: along with reading, being able to identify and edit a choice without using the customization system is a critical requirement, for use with matrix.py, debugging, and development.
-
Ease of creation: it is currently difficult to write a choices file from scratch. One must read the current matrixdef file to understand available sections, choices, and options. Ideally, one would be able to write choices with very few dependencies.
-
Computer serialization and deserialization: choices are currently stored in a proprietary format, so reading and writing them requires implementing a significant amount of code.
-
Make development easier: currently, changing anything to do with choices probably requires changes in several locations and is brittle.
-
Choices as a component: a modularized choices format could be used as a data format in other projects, targeting the grammar matrix or not.
-
Matrix as a target: modularizing the choices format could help others develop systems that use the matrix customization system (e.g. AGGREGATION).
Using a standard serialization format allows the Grammar Matrix code, other code, and humans to more easily read and write choices files. This would also reduce the amount of code in the grammar matrix repo, making development easier.
Several candidates were considered for the serialization format: JSON, YAML, StrictYAML, and TOML.
-
JSON: generally considered hard to read, hard to edit, and hard to create.
-
YAML: YAML has several things going for it (widely used, easy to read and write, Python-inspired), however, it also has many problems. It has a lot of unnecessary features (code execution, multi-documents, lists as keys, etc.), problems with ambiguity and typing, and has readability problems with large, highly nested documents.
-
StrictYAML: StrictYAML is a subset of YAML that removes many features (e.g. code execution) and fixes many typing and ambiguity problems. However, it is currently only implemented in Python and still has readability problems with large, highly nested documents.
-
TOML: TOML is a configuration language that represents maps and arrays, has fairly readable syntax, has implementations in many languages (Python, JavaScript, etc.) and actually uses some of the same syntax as the existing choices file format (e.g. nested dictionaries via dot notation).
We are recommending TOML in large part due to the lack of problems compared to YAML and its readability over JSON.
On top of the language, there are decisions about how choices are organized into the choices file.
Home | Forum | Discussions | Events