add documentation page about datasets versioning (Issue #721).

pierrot0 · copybara-github · commit 3845b546ca8c · 2019-07-04T00:27:30.000-07:00
PiperOrigin-RevId: 256502877
diff --git a/README.md b/README.md
@@ -8,7 +8,13 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
 * [List of datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
 * [Try it in Colab](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
 * [API docs](https://www.tensorflow.org/datasets/api_docs/python/tfds)
-* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
+* Guides
+  * [Overview](https://www.tensorflow.org/datasets/overview)
+  * [Datasets versioning](https://www.tensorflow.org/datasets/datasets_versioning)
+  * [Using splits and slicing API](https://www.tensorflow.org/datasets/splits)
+  * [Add a dataset](https://www.tensorflow.org/datasets/add_dataset)
+  * [Add a huge dataset (>>100GiB)](https://www.tensorflow.org/datasets/beam_datasets)
+
 
 **Table of Contents**
 
diff --git a/docs/add_dataset.md b/docs/add_dataset.md
@@ -138,6 +138,9 @@ If you'd like to follow a test-driven development workflow, which can help you
 iterate faster, jump to the [testing instructions](#testing-mydataset) below,
 add the test, and then return here.
 
+For an explanation of what the version is, please read
+[datasets versioning](datasets_versioning.md).
+
 ## Specifying `DatasetInfo`
 
 [`DatasetInfo`](api_docs/python/tfds/core/DatasetInfo.md) describes the
diff --git a/docs/datasets_versioning.md b/docs/datasets_versioning.md
@@ -0,0 +1,162 @@
+# Datasets versioning
+
+*  [Semantic](#semantic)
+*  [Supported versions](#supported-versions)
+*  [Loading a specific version](#loading-a-specific-version)
+*  [Experiments](#experiments)
+*  [BUILDER_CONFIGS and versions](#builder-configs-and-versions)
+
+## Semantic
+
+Every `DatasetBuilder` defined in TFDS comes with a version, for example:
+
+```py
+class MNIST(tfds.core.GeneratorBasedBuilder):
+  VERSION = tfds.core.Version("1.0.0")
+```
+
+The version follows
+[Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html):
+`MAJOR.MINOR.PATCH`. The purpose of the version is to be able to guarantee
+reproducibility: loading a given dataset at a fixed version yields the same
+data. More specifically:
+
+ - If `PATCH` version is incremented, data as read by the client is the same,
+ although data might be serialized differently on disk. For any given slice, the
+ slicing API returns the same set of records.
+ - If `MINOR` version is incremented, existing data as read by the client is the
+ same, but there is additional data (features in each record). For any given
+ slice, the  slicing API returns the same set of records.
+ - If `MAJOR` version is incremented, the existing data has been changed and/or
+ the slicing API doesn't necessarily return the same set of records for a given
+ slice.
+
+When a code change is made to the TFDS library and that code change impacts the
+way a dataset is being serialized and/or read by the client, then the
+corresponding builder version is incremented according to the above guidelines.
+
+Note that the above semantic is best effort, and there might be un-noticed bugs
+impacting a dataset while the version was not incremented. Such bugs are
+eventually fixed, but if you heavily rely on the versioning, we advise you to
+use TFDS from a released version (as opposed to `HEAD`).
+
+Also note that some datasets have another versioning scheme independent from
+the TFDS version. For example, the Open Images dataset has several versions,
+and in TFDS, the corresponding builders are `open_images_v4`, `open_images_v5`,
+...
+
+## Supported versions
+
+A `DatasetBuilder` can support several versions, which can be both higher or
+lower than the canonical version. For example:
+
+```py
+class Imagenet2012(tfds.core.GeneratorBasedBuilder):
+  VERSION = tfds.core.Version('2.0.1')
+  SUPPORTED_VERSIONS = [
+      tfds.core.Version('3.0.0'),
+      tfds.core.Version('2.0.1'),
+      tfds.core.Version('1.0.0'),
+  ]
+  # Version history:
+  # 3.0.0: Fix colorization (all RGB) and format (all jpeg).
+  # 2.0.1: Encoding fix. No changes from user point of view.
+  # 2.0.0: Fix validation labels.
+  # 1.0.0: Initial definition of imagenet dataset.
+```
+
+The choice to continue supporting an older version is done on a case-by-case
+basis, mainly based on the popularity of the dataset and version. Eventually, we
+aim at only supporting a limited number versions per dataset, ideally one. In
+the above example, we can see that version `2.0.0` is not supported anymore, as
+identical to `2.0.1` from a reader perspective.
+
+Supported versions with a higher number than the canonical version number are
+considered experimental and might be broken. They will however eventually be
+made canonical.
+
+## Loading a specific version
+
+When loading a dataset or a `DatasetBuilder`, you can specify the version to
+use. For example:
+
+```py
+tfds.load('imagenet2012:2.0.1')
+tfds.builder('imagenet2012:2.0.1'')
+
+tfds.load('imagenet2012:2.0.0')  # Error: unsupported version.
+
+# Resolves to 3.0.0 for now, but would resolve to 3.1.1 if when added.
+tfds.load('imagenet2012:3.*.*')
+```
+
+If using TFDS for a publication, we advise you to:
+
+ - **fix the `MAJOR` component of the version only**;
+ - **advertise which version of the dataset was used in your results.**
+
+Doing so should make it easier for your future self, your readers and
+reviewers to reproduce your results.
+
+## Experiments
+
+To gradually roll out changes in TFDS which are impacting many dataset builders,
+we introduced the notion of experiments. When first introduced, an experiment
+is disabled by default, but specific dataset versions can decide to enable it.
+This will typically be done on "future" versions (not made canonical yet) at
+first. For example:
+
+```py
+class MNIST(tfds.core.GeneratorBasedBuilder):
+  VERSION = tfds.core.Version("1.0.0")
+  SUPPORTED_VERSIONS = [
+      tfds.core.Version("2.0.0", experiments={tfds.core.Experiment.S3: True}),
+      tfds.core.Version("1.0.0"),
+  ]
+  # Version history:
+  # 2.0.0: S3 (new shuffling, sharding and slicing mechanism).
+```
+
+Once an experiment has been verified to work as expected, it will be extended to
+all or most datasets, at which point it can be enabled by default, and the above
+definition would then look like:
+
+```py
+class MNIST(tfds.core.GeneratorBasedBuilder):
+  VERSION = tfds.core.Version("2.0.0")
+  SUPPORTED_VERSIONS = [
+      tfds.core.Version("2.0.0"),
+      tfds.core.Version("1.0.0", experiments={tfds.core.Experiment.S3: False}),
+  ]
+  # Version history:
+  # 2.0.0: S3 (new shuffling, sharding and slicing mechanism), order of records
+  # changes, set of returned records when using slicing API is different.
+```
+
+Once an experiment is used across all datasets versions (there is no dataset
+version left specifying `{experiment: False}`), the experiment can be deleted.
+
+Experiments and their description are defined in `core/utils/version.py`.
+
+## BUILDER_CONFIGS and versions
+
+Some datasets define several `BUILDER_CONFIGS`. When that is the case, `version`
+and `supported_versions` are defined on the config objects themselves. Other
+than that, semantics and usage are identical. For example:
+
+```py
+class OpenImagesV4(tfds.core.GeneratorBasedBuilder):
+
+  BUILDER_CONFIGS = [
+      OpenImagesV4Config(
+          name='original',
+          version=tfds.core.Version('0.2.0'),
+          supported_version=[
+            tfds.core.Version('1.0.0'),
+          ],
+          description='Images at their original resolution and quality.'),
+      ...
+  ]
+
+tfds.load('open_images_v4/original:1.*.*')
+```