add documentation on S3 slicing API (Issue #737).

pierrot0 · copybara-github · commit 501ce7c6f60b · 2019-07-12T15:37:31.000-07:00
PiperOrigin-RevId: 257881514
diff --git a/docs/splits.md b/docs/splits.md
@@ -1,18 +1,156 @@
-# Splits
+# Splits and slicing
+
+All `DatasetBuilder`s expose various data subsets defined as splits (eg:
+`train`, `test`). When constructing a `tf.data.Dataset` instance using either
+`tfds.load()` or `tfds.DatasetBuilder.as_dataset()`, one can specify which
+split(s) to retrieve. It is also possible to retrieve slice(s) of split(s)
+as well as combinations of those.
+
+ * [Two APIs: S3 and legacy](#two-apis-s3-and-legacy)
+ * [S3 slicing API](#s3-slicing-api)
+   * [Examples](#examples)
+   * [Percentage slicing and rounding](#percentage-slicing-and-rounding)
+   * [Reproducibility](#reproducibility)
+ * [Legacy slicing API](#legacy-slicing-api)
+   * [Adding splits together](#adding-splits-together)
+   * [Subsplit](#subsplit)
+     * [Specifying number of subsplits](#specifying-number-of-subsplits)
+     * [Specifying a percentage slice](#specifying-a-percentage-slice)
+     * [Specifying weights](#specifying-weights)
+   * [Composing split, adding, and subsplitting](#composing-split-adding-and-subsplitting)
+   * [Dataset using non-conventional named split](#dataset-using-non-conventional-named-split)
+
+## Two APIs: S3 and legacy
+
+Each versioned dataset either implements the new S3 API, or the legacy API,
+which will eventually be retired. New datasets (except Beam ones for now) all
+implement S3, and we're slowly rolling it out to all datasets.
+
+To find out whether a dataset implements S3, one can look at the source code
+or call:
+
+```
+ds_builder.version.implements(tfds.core.Experiment.S3)
+```
+
+## S3 slicing API
+
+Slicing instructions are specified in `tfds.load` or `tfds.DatasetBuilder.as_dataset`.
+
+Instructions can be provided as either strings or  `ReadInstruction`s.
+Strings are more compact and
+readable for simple cases, while `ReadInstruction`s provide more options
+and might be easier to use with variable slicing parameters.
+
+### Examples
+
+The following examples show equivalent instructions:
+
+```py
+# The full `train` split.
+train_ds = tfds.load('mnist:3.*.*', split='train')
+train_ds = tfds.load('mnist:3.*.*', split=tfds.ReadInstruction('train'))
+
+# The full `train` split and the full `test` split as two distinct datasets.
+train_ds, test_ds = tfds.load('mnist:3.*.*', split=['train', 'test'])
+train_ds, test_ds = tfds.load('mnist:3.*.*', split=[
+    tfds.ReadInstruction('train'),
+    tfds.ReadInstruction('test'),
+])
+
+# The full `train` and `test` splits, concatenated together.
+train_test_ds = tfds.load('mnist:3.*.*', split='train+test')
+ri = tfds.ReadInstruction('train') + tfds.ReadInstruction('test')
+train_test_ds = tfds.load('mnist:3.*.*', split=ri)
+
+# From record 10 (included) to record 20 (excluded) of `train` split.
+train_10_20_ds = tfds.load('mnist:3.*.*', split='train[10:20]')
+train_10_20_ds = tfds.load('mnist:3.*.*', split=tfds.ReadInstruction(
+    'train', from_=10, to=20, unit='abs'))
+
+# The first 10% of train split.
+train_10pct_ds = tfds.load('mnist:3.*.*', split='train[:10%]')
+train_10_20_ds = tfds.load('mnist:3.*.*', split=tfds.ReadInstruction(
+    'train', to=10, unit='%'))
+
+# The first 10% of train + the last 80% of train.
+train_10_80pct_ds = tfds.load('mnist:3.*.*', split='train[:10%]+train[-80%:]')
+ri = (tfds.ReadInstruction('train', to=10, unit='%') +
+      tfds.ReadInstruction('train', from_=-80, unit='%'))
+train_10_80pct_ds = tfds.load('mnist:3.*.*', split=ri)
+
+# 10-fold cross-validation (see also next section on rounding behavior):
+# The validation datasets are each going to be 10%:
+# [0%:10%], [10%:20%], ..., [90%:100%].
+# And the training datasets are each going to be the complementary 90%:
+# [10%:100%] (for a corresponding validation set of [0%:10%]),
+# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
+# [0%:90%] (for a validation set of [90%:100%]).
+vals_ds = tfds.load('mnist:3.*.*', ['train[{}%:{}%]'.format(k, k+10)
+                                    for k in range(0, 100, 10)])
+trains_ds = tfds.load('mnist:3.*.*', ['train[:{}%]+train[{}%:]'.format(k, k+10)
+                                      for k in range(0, 100, 10)])
+# or using the `ReadInstruction`:
+vals_ds = tfds.load('mnist:3.*.*', [
+    tfds.ReadInstruction('train', from_=k, to=k+10, unit='%')
+    for k in range(0, 100, 10)])
+trains_ds = tfds.load('mnist:3.*.*', [
+    (tfds.ReadInstruction('train', to=k, unit='%') +
+     tfds.ReadInstruction('train', from_=k+10, unit='%'))
+    for k in range(0, 100, 10)])
+```
+
+### Percentage slicing and rounding
+
+If a slice of a split is requested using the percent (`%`) unit, and the
+requested slice boundaries do not divide evenly by `100`, then the default
+behaviour it to round boundaries to the nearest integer (`closest`). This means
+that some slices may contain more examples than others. For example:
+
+```py
+# Assuming "train" split contains 101 records.
+# 100 records, from 0 to 100.
+tfds.load("mnist:3.*.*", split="test[:99%]")
+# 2 records, from 49 to 51.
+tfds.load("mnist:3.*.*", split="test[49%:50%]")
+```
+
+Alternatively, the user can use the rounding `pct1_dropremainder`, so specified
+percentage boundaries are treated as multiples of 1%. This option should be used
+when consistency is needed (eg: `len(5%) == 5 * len(1%)`).
+
+Example:
+
+```py
+# Records 0 (included) to 99 (excluded).
+tfds.load("mnist:3.*.*", split="test[:99%]", rounding="pct1_dropremainder")
+```
+
+### Reproducibility
+
+The S3 API guarantees that any given split slice (or `ReadInstruction`) will
+always produce the same set of records on a given dataset, as long as the major
+version of the dataset is constant.
+
+For example, `tfds.load("mnist:3.0.0", split="train[10:20]")` and
+`tfds.load("mnist:3.2.0", split="train[10:20]")` will always contain the same
+elements - regardless of platform, architecture, etc. - even though some of
+the records might have different values (eg: imgage encoding, label, ...).
+
+## Legacy slicing API
 
-All `DatasetBuilder`s expose various data subsets defined as
 [`tfds.Split`](api_docs/python/tfds/Split.md)s (typically `tfds.Split.TRAIN` and
 `tfds.Split.TEST`). A given dataset's splits are defined in
 [`tfds.DatasetBuilder.info.splits`](api_docs/python/tfds/core/DatasetBuilder.md#info)
 and are accessible through [`tfds.load`](api_docs/python/tfds/load.md) and
 [`tfds.DatasetBuilder.as_dataset`](api_docs/python/tfds/core/DatasetBuilder.md#as_dataset),
 both of which take `split=` as a keyword argument.
 
-`tfds` enables you to further manipulate splits by combining them or
+`tfds` enables you to combine splits
 subsplitting them up. The resulting splits can be passed to `tfds.load` or
 `tfds.DatasetBuilder.as_dataset`.
 
-## Add splits together
+### Add splits together
 
 ```py
 combined_split = tfds.Split.TRAIN + tfds.Split.TEST
@@ -28,27 +166,29 @@ together:
 ds = tfds.load("mnist", split=tfds.Split.ALL)
 ```
 
-## Subsplit
+### Subsplit
 
 You have 3 options for how to get a thinner slice of the data than the
 base splits, all based on `tfds.Split.subsplit`.
 
-*Warning*: TensorFlow Datasets does not currently guarantee the order of the
-data on disk when data is generated. Therefore, if you regenerate the data, the
-subsplits may no longer be the same.
+*Warning*: The legacy API does not guarantee the reproducibility of the subsplit
+operations. Two different users working on the same dataset at the same version
+and using the same subsplit instructions could end-up with two different sets
+of examples. Also, if a user regenerates the data, the subsplits may no longer
+be the same.
 
 *Warning*: If the `total_number_examples % 100 != 0`, then remainder examples
 may not be evenly distributed among subsplits.
 
-### Specify number of subsplits
+#### Specifying number of subsplits
 
 ```py
 train_half_1, train_half_2 = tfds.Split.TRAIN.subsplit(k=2)
 
 dataset = tfds.load("mnist", split=train_half_1)
 ```
 
-### Specify a percentage slice
+#### Specifying a percentage slice
 
 ```py
 first_10_percent = tfds.Split.TRAIN.subsplit(tfds.percent[:10])
@@ -58,15 +198,15 @@ middle_50_percent = tfds.Split.TRAIN.subsplit(tfds.percent[25:75])
 dataset = tfds.load("mnist", split=middle_50_percent)
 ```
 
-### Specifying weights
+#### Specifying weights
 
 ```py
 half, quarter1, quarter2 = tfds.Split.TRAIN.subsplit(weighted=[2, 1, 1])
 
 dataset = tfds.load("mnist", split=half)
 ```
 
-## Composing split, adding, and subsplitting
+### Composing split, adding, and subsplitting
 
 It's possible to compose the above operations:
 
@@ -93,13 +233,13 @@ split = (tfds.Split.TRAIN.subsplit(tfds.percent[:25]) +
          tfds.Split.TEST).subsplit(tfds.percent[0:50])
 ```
 
-## Dataset using non-conventional named split
+### Dataset using non-conventional named split
 
 For dataset using splits not in `tfds.Split.{TRAIN,VALIDATION,TEST}`, you can
 still use the subsplit API by defining the custom named split with
 `tfds.Split('custom_split')`. For instance:
 
 ```py
 split = tfds.Split('test2015') + tfds.Split.TEST
-ds = tfds.load('coco2014', split= split)
+ds = tfds.load('coco2014', split=split)
 ```