|
| 1 | +# Common |
| 2 | + |
| 3 | +## Important Key Ideas |
| 4 | + |
| 5 | +### Versioning |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +### Data, Representation and Encoding independence |
| 10 | + |
| 11 | +Everything in the interchange format should have a clear distinction between; |
| 12 | + |
| 13 | + * (a) the data "as a concept", |
| 14 | + * (b) the data "as represented" and |
| 15 | + * (c) the data "as encoded in a file format". |
| 16 | + |
| 17 | +For example, |
| 18 | + |
| 19 | + * (a) the data being represented could be "the name of an object", |
| 20 | + * (b) but it could be represented as an integer pointing to a UTF-8 string |
| 21 | + table, and |
| 22 | + * (c) that could be encoded as either XML or Cap'n'Proto file format. |
| 23 | + |
| 24 | +### On disk representation |
| 25 | + |
| 26 | +The interchange format should define both; |
| 27 | + |
| 28 | + * (a) A compact binary machine readable format, **and** |
| 29 | + * (b) a texted based human readable format. |
| 30 | + |
| 31 | +Tools should exist which do lossless conversion between the machine and human |
| 32 | +readable formats. |
| 33 | + |
| 34 | +The preferred on disk formats for the interchange format are; |
| 35 | + |
| 36 | + * (a) Binary Machine readable format - **Cap'n'Proto** |
| 37 | + * (b) Text based human readable format - **XML** |
| 38 | + |
| 39 | +These two formats where selected because, they have; |
| 40 | + |
| 41 | + * A well defined schema format. |
| 42 | + * Good support by almost all languages, including the important languages of |
| 43 | + C++, Python and Java. |
| 44 | + * Already in use by core target tools. |
| 45 | + |
| 46 | +While **XML** is the preferred text based format, to enable wider adoption of |
| 47 | +the interchange format, **optional** support for *alternative* human readable |
| 48 | +text formats is encouraged. |
| 49 | + |
| 50 | +High value targets formats include; |
| 51 | + |
| 52 | + * JSON |
| 53 | + * YAML |
| 54 | + |
| 55 | + |
| 56 | +#### Schemas |
| 57 | + |
| 58 | +To make sure that files comply with the interchange specification, schemas for |
| 59 | +the on-disk file formats which allow at least some automatic validation should |
| 60 | +be provided. |
| 61 | + |
| 62 | +#### Backwards Compatibility |
| 63 | + |
| 64 | +Schema for the file formats should be extended to maintain backwards |
| 65 | +compatibility will previous on-disk formats. |
| 66 | + |
| 67 | +Making breaking changes in on-disk formats require a new major version of the |
| 68 | +specification to be published. |
| 69 | + |
| 70 | + |
| 71 | +#### Common Metadata |
| 72 | + |
| 73 | +All files should have a set of common metadata to make it easy to connect files |
| 74 | +together and understand their relationship. |
| 75 | + |
| 76 | +As the file output should be deterministic, files **should** include the |
| 77 | +details required to reproduce the file output easily. |
| 78 | + |
| 79 | +This includes; |
| 80 | + |
| 81 | + * Checksum of inputs |
| 82 | + * Information (version, command line arguments, random seed, etc) around |
| 83 | + tooling used to create the file. |
| 84 | + |
| 85 | +Should **not** include; |
| 86 | + |
| 87 | + * Anything which makes builds not-reproducible. |
| 88 | + See https://reproducible-builds.org/docs/ for common examples. |
| 89 | + |
| 90 | + |
| 91 | +#### String Storage |
| 92 | + |
| 93 | + * A significant percentage of the data in all the files are strings that are |
| 94 | + only needed for humans. |
| 95 | + |
| 96 | + * These strings are frequently used for identifiers. |
| 97 | + |
| 98 | + * For this reason special care has been taken around both the representation |
| 99 | + and the on-disk encoding of these strings. |
| 100 | + |
| 101 | + |
| 102 | + |
0 commit comments