|
| 1 | +# Datasets versioning |
| 2 | + |
| 3 | +* [Semantic](#semantic) |
| 4 | +* [Supported versions](#supported-versions) |
| 5 | +* [Loading a specific version](#loading-a-specific-version) |
| 6 | +* [Experiments](#experiments) |
| 7 | +* [BUILDER_CONFIGS and versions](#builder-configs-and-versions) |
| 8 | + |
| 9 | +## Semantic |
| 10 | + |
| 11 | +Every `DatasetBuilder` defined in TFDS comes with a version, for example: |
| 12 | + |
| 13 | +```py |
| 14 | +class MNIST(tfds.core.GeneratorBasedBuilder): |
| 15 | + VERSION = tfds.core.Version("1.0.0") |
| 16 | +``` |
| 17 | + |
| 18 | +The version follows |
| 19 | +[Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html): |
| 20 | +`MAJOR.MINOR.PATCH`. The purpose of the version is to be able to guarantee |
| 21 | +reproducibility: loading a given dataset at a fixed version yields the same |
| 22 | +data. More specifically: |
| 23 | + |
| 24 | + - If `PATCH` version is incremented, data as read by the client is the same, |
| 25 | + although data might be serialized differently on disk. For any given slice, the |
| 26 | + slicing API returns the same set of records. |
| 27 | + - If `MINOR` version is incremented, existing data as read by the client is the |
| 28 | + same, but there is additional data (features in each record). For any given |
| 29 | + slice, the slicing API returns the same set of records. |
| 30 | + - If `MAJOR` version is incremented, the existing data has been changed and/or |
| 31 | + the slicing API doesn't necessarily return the same set of records for a given |
| 32 | + slice. |
| 33 | + |
| 34 | +When a code change is made to the TFDS library and that code change impacts the |
| 35 | +way a dataset is being serialized and/or read by the client, then the |
| 36 | +corresponding builder version is incremented according to the above guidelines. |
| 37 | + |
| 38 | +Note that the above semantic is best effort, and there might be un-noticed bugs |
| 39 | +impacting a dataset while the version was not incremented. Such bugs are |
| 40 | +eventually fixed, but if you heavily rely on the versioning, we advise you to |
| 41 | +use TFDS from a released version (as opposed to `HEAD`). |
| 42 | + |
| 43 | +Also note that some datasets have another versioning scheme independent from |
| 44 | +the TFDS version. For example, the Open Images dataset has several versions, |
| 45 | +and in TFDS, the corresponding builders are `open_images_v4`, `open_images_v5`, |
| 46 | +... |
| 47 | + |
| 48 | +## Supported versions |
| 49 | + |
| 50 | +A `DatasetBuilder` can support several versions, which can be both higher or |
| 51 | +lower than the canonical version. For example: |
| 52 | + |
| 53 | +```py |
| 54 | +class Imagenet2012(tfds.core.GeneratorBasedBuilder): |
| 55 | + VERSION = tfds.core.Version('2.0.1') |
| 56 | + SUPPORTED_VERSIONS = [ |
| 57 | + tfds.core.Version('3.0.0'), |
| 58 | + tfds.core.Version('2.0.1'), |
| 59 | + tfds.core.Version('1.0.0'), |
| 60 | + ] |
| 61 | + # Version history: |
| 62 | + # 3.0.0: Fix colorization (all RGB) and format (all jpeg). |
| 63 | + # 2.0.1: Encoding fix. No changes from user point of view. |
| 64 | + # 2.0.0: Fix validation labels. |
| 65 | + # 1.0.0: Initial definition of imagenet dataset. |
| 66 | +``` |
| 67 | + |
| 68 | +The choice to continue supporting an older version is done on a case-by-case |
| 69 | +basis, mainly based on the popularity of the dataset and version. Eventually, we |
| 70 | +aim at only supporting a limited number versions per dataset, ideally one. In |
| 71 | +the above example, we can see that version `2.0.0` is not supported anymore, as |
| 72 | +identical to `2.0.1` from a reader perspective. |
| 73 | + |
| 74 | +Supported versions with a higher number than the canonical version number are |
| 75 | +considered experimental and might be broken. They will however eventually be |
| 76 | +made canonical. |
| 77 | + |
| 78 | +## Loading a specific version |
| 79 | + |
| 80 | +When loading a dataset or a `DatasetBuilder`, you can specify the version to |
| 81 | +use. For example: |
| 82 | + |
| 83 | +```py |
| 84 | +tfds.load('imagenet2012:2.0.1') |
| 85 | +tfds.builder('imagenet2012:2.0.1'') |
| 86 | + |
| 87 | +tfds.load('imagenet2012:2.0.0') # Error: unsupported version. |
| 88 | + |
| 89 | +# Resolves to 3.0.0 for now, but would resolve to 3.1.1 if when added. |
| 90 | +tfds.load('imagenet2012:3.*.*') |
| 91 | +``` |
| 92 | + |
| 93 | +If using TFDS for a publication, we advise you to: |
| 94 | + |
| 95 | + - **fix the `MAJOR` component of the version only**; |
| 96 | + - **advertise which version of the dataset was used in your results.** |
| 97 | + |
| 98 | +Doing so should make it easier for your future self, your readers and |
| 99 | +reviewers to reproduce your results. |
| 100 | + |
| 101 | +## Experiments |
| 102 | + |
| 103 | +To gradually roll out changes in TFDS which are impacting many dataset builders, |
| 104 | +we introduced the notion of experiments. When first introduced, an experiment |
| 105 | +is disabled by default, but specific dataset versions can decide to enable it. |
| 106 | +This will typically be done on "future" versions (not made canonical yet) at |
| 107 | +first. For example: |
| 108 | + |
| 109 | +```py |
| 110 | +class MNIST(tfds.core.GeneratorBasedBuilder): |
| 111 | + VERSION = tfds.core.Version("1.0.0") |
| 112 | + SUPPORTED_VERSIONS = [ |
| 113 | + tfds.core.Version("2.0.0", experiments={tfds.core.Experiment.S3: True}), |
| 114 | + tfds.core.Version("1.0.0"), |
| 115 | + ] |
| 116 | + # Version history: |
| 117 | + # 2.0.0: S3 (new shuffling, sharding and slicing mechanism). |
| 118 | +``` |
| 119 | + |
| 120 | +Once an experiment has been verified to work as expected, it will be extended to |
| 121 | +all or most datasets, at which point it can be enabled by default, and the above |
| 122 | +definition would then look like: |
| 123 | + |
| 124 | +```py |
| 125 | +class MNIST(tfds.core.GeneratorBasedBuilder): |
| 126 | + VERSION = tfds.core.Version("2.0.0") |
| 127 | + SUPPORTED_VERSIONS = [ |
| 128 | + tfds.core.Version("2.0.0"), |
| 129 | + tfds.core.Version("1.0.0", experiments={tfds.core.Experiment.S3: False}), |
| 130 | + ] |
| 131 | + # Version history: |
| 132 | + # 2.0.0: S3 (new shuffling, sharding and slicing mechanism), order of records |
| 133 | + # changes, set of returned records when using slicing API is different. |
| 134 | +``` |
| 135 | + |
| 136 | +Once an experiment is used across all datasets versions (there is no dataset |
| 137 | +version left specifying `{experiment: False}`), the experiment can be deleted. |
| 138 | + |
| 139 | +Experiments and their description are defined in `core/utils/version.py`. |
| 140 | + |
| 141 | +## BUILDER_CONFIGS and versions |
| 142 | + |
| 143 | +Some datasets define several `BUILDER_CONFIGS`. When that is the case, `version` |
| 144 | +and `supported_versions` are defined on the config objects themselves. Other |
| 145 | +than that, semantics and usage are identical. For example: |
| 146 | + |
| 147 | +```py |
| 148 | +class OpenImagesV4(tfds.core.GeneratorBasedBuilder): |
| 149 | + |
| 150 | + BUILDER_CONFIGS = [ |
| 151 | + OpenImagesV4Config( |
| 152 | + name='original', |
| 153 | + version=tfds.core.Version('0.2.0'), |
| 154 | + supported_version=[ |
| 155 | + tfds.core.Version('1.0.0'), |
| 156 | + ], |
| 157 | + description='Images at their original resolution and quality.'), |
| 158 | + ... |
| 159 | + ] |
| 160 | + |
| 161 | +tfds.load('open_images_v4/original:1.*.*') |
| 162 | +``` |
0 commit comments