Skip to content

Commit 3845b54

Browse files
pierrot0copybara-github
authored andcommitted
add documentation page about datasets versioning (Issue #721).
PiperOrigin-RevId: 256502877
1 parent d4cd045 commit 3845b54

File tree

3 files changed

+172
-1
lines changed

3 files changed

+172
-1
lines changed

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,13 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
88
* [List of datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
99
* [Try it in Colab](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
1010
* [API docs](https://www.tensorflow.org/datasets/api_docs/python/tfds)
11-
* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
11+
* Guides
12+
* [Overview](https://www.tensorflow.org/datasets/overview)
13+
* [Datasets versioning](https://www.tensorflow.org/datasets/datasets_versioning)
14+
* [Using splits and slicing API](https://www.tensorflow.org/datasets/splits)
15+
* [Add a dataset](https://www.tensorflow.org/datasets/add_dataset)
16+
* [Add a huge dataset (>>100GiB)](https://www.tensorflow.org/datasets/beam_datasets)
17+
1218

1319
**Table of Contents**
1420

docs/add_dataset.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,9 @@ If you'd like to follow a test-driven development workflow, which can help you
138138
iterate faster, jump to the [testing instructions](#testing-mydataset) below,
139139
add the test, and then return here.
140140

141+
For an explanation of what the version is, please read
142+
[datasets versioning](datasets_versioning.md).
143+
141144
## Specifying `DatasetInfo`
142145

143146
[`DatasetInfo`](api_docs/python/tfds/core/DatasetInfo.md) describes the

docs/datasets_versioning.md

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Datasets versioning
2+
3+
* [Semantic](#semantic)
4+
* [Supported versions](#supported-versions)
5+
* [Loading a specific version](#loading-a-specific-version)
6+
* [Experiments](#experiments)
7+
* [BUILDER_CONFIGS and versions](#builder-configs-and-versions)
8+
9+
## Semantic
10+
11+
Every `DatasetBuilder` defined in TFDS comes with a version, for example:
12+
13+
```py
14+
class MNIST(tfds.core.GeneratorBasedBuilder):
15+
VERSION = tfds.core.Version("1.0.0")
16+
```
17+
18+
The version follows
19+
[Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html):
20+
`MAJOR.MINOR.PATCH`. The purpose of the version is to be able to guarantee
21+
reproducibility: loading a given dataset at a fixed version yields the same
22+
data. More specifically:
23+
24+
- If `PATCH` version is incremented, data as read by the client is the same,
25+
although data might be serialized differently on disk. For any given slice, the
26+
slicing API returns the same set of records.
27+
- If `MINOR` version is incremented, existing data as read by the client is the
28+
same, but there is additional data (features in each record). For any given
29+
slice, the slicing API returns the same set of records.
30+
- If `MAJOR` version is incremented, the existing data has been changed and/or
31+
the slicing API doesn't necessarily return the same set of records for a given
32+
slice.
33+
34+
When a code change is made to the TFDS library and that code change impacts the
35+
way a dataset is being serialized and/or read by the client, then the
36+
corresponding builder version is incremented according to the above guidelines.
37+
38+
Note that the above semantic is best effort, and there might be un-noticed bugs
39+
impacting a dataset while the version was not incremented. Such bugs are
40+
eventually fixed, but if you heavily rely on the versioning, we advise you to
41+
use TFDS from a released version (as opposed to `HEAD`).
42+
43+
Also note that some datasets have another versioning scheme independent from
44+
the TFDS version. For example, the Open Images dataset has several versions,
45+
and in TFDS, the corresponding builders are `open_images_v4`, `open_images_v5`,
46+
...
47+
48+
## Supported versions
49+
50+
A `DatasetBuilder` can support several versions, which can be both higher or
51+
lower than the canonical version. For example:
52+
53+
```py
54+
class Imagenet2012(tfds.core.GeneratorBasedBuilder):
55+
VERSION = tfds.core.Version('2.0.1')
56+
SUPPORTED_VERSIONS = [
57+
tfds.core.Version('3.0.0'),
58+
tfds.core.Version('2.0.1'),
59+
tfds.core.Version('1.0.0'),
60+
]
61+
# Version history:
62+
# 3.0.0: Fix colorization (all RGB) and format (all jpeg).
63+
# 2.0.1: Encoding fix. No changes from user point of view.
64+
# 2.0.0: Fix validation labels.
65+
# 1.0.0: Initial definition of imagenet dataset.
66+
```
67+
68+
The choice to continue supporting an older version is done on a case-by-case
69+
basis, mainly based on the popularity of the dataset and version. Eventually, we
70+
aim at only supporting a limited number versions per dataset, ideally one. In
71+
the above example, we can see that version `2.0.0` is not supported anymore, as
72+
identical to `2.0.1` from a reader perspective.
73+
74+
Supported versions with a higher number than the canonical version number are
75+
considered experimental and might be broken. They will however eventually be
76+
made canonical.
77+
78+
## Loading a specific version
79+
80+
When loading a dataset or a `DatasetBuilder`, you can specify the version to
81+
use. For example:
82+
83+
```py
84+
tfds.load('imagenet2012:2.0.1')
85+
tfds.builder('imagenet2012:2.0.1'')
86+
87+
tfds.load('imagenet2012:2.0.0') # Error: unsupported version.
88+
89+
# Resolves to 3.0.0 for now, but would resolve to 3.1.1 if when added.
90+
tfds.load('imagenet2012:3.*.*')
91+
```
92+
93+
If using TFDS for a publication, we advise you to:
94+
95+
- **fix the `MAJOR` component of the version only**;
96+
- **advertise which version of the dataset was used in your results.**
97+
98+
Doing so should make it easier for your future self, your readers and
99+
reviewers to reproduce your results.
100+
101+
## Experiments
102+
103+
To gradually roll out changes in TFDS which are impacting many dataset builders,
104+
we introduced the notion of experiments. When first introduced, an experiment
105+
is disabled by default, but specific dataset versions can decide to enable it.
106+
This will typically be done on "future" versions (not made canonical yet) at
107+
first. For example:
108+
109+
```py
110+
class MNIST(tfds.core.GeneratorBasedBuilder):
111+
VERSION = tfds.core.Version("1.0.0")
112+
SUPPORTED_VERSIONS = [
113+
tfds.core.Version("2.0.0", experiments={tfds.core.Experiment.S3: True}),
114+
tfds.core.Version("1.0.0"),
115+
]
116+
# Version history:
117+
# 2.0.0: S3 (new shuffling, sharding and slicing mechanism).
118+
```
119+
120+
Once an experiment has been verified to work as expected, it will be extended to
121+
all or most datasets, at which point it can be enabled by default, and the above
122+
definition would then look like:
123+
124+
```py
125+
class MNIST(tfds.core.GeneratorBasedBuilder):
126+
VERSION = tfds.core.Version("2.0.0")
127+
SUPPORTED_VERSIONS = [
128+
tfds.core.Version("2.0.0"),
129+
tfds.core.Version("1.0.0", experiments={tfds.core.Experiment.S3: False}),
130+
]
131+
# Version history:
132+
# 2.0.0: S3 (new shuffling, sharding and slicing mechanism), order of records
133+
# changes, set of returned records when using slicing API is different.
134+
```
135+
136+
Once an experiment is used across all datasets versions (there is no dataset
137+
version left specifying `{experiment: False}`), the experiment can be deleted.
138+
139+
Experiments and their description are defined in `core/utils/version.py`.
140+
141+
## BUILDER_CONFIGS and versions
142+
143+
Some datasets define several `BUILDER_CONFIGS`. When that is the case, `version`
144+
and `supported_versions` are defined on the config objects themselves. Other
145+
than that, semantics and usage are identical. For example:
146+
147+
```py
148+
class OpenImagesV4(tfds.core.GeneratorBasedBuilder):
149+
150+
BUILDER_CONFIGS = [
151+
OpenImagesV4Config(
152+
name='original',
153+
version=tfds.core.Version('0.2.0'),
154+
supported_version=[
155+
tfds.core.Version('1.0.0'),
156+
],
157+
description='Images at their original resolution and quality.'),
158+
...
159+
]
160+
161+
tfds.load('open_images_v4/original:1.*.*')
162+
```

0 commit comments

Comments
 (0)