Skip to content

Commit 501ce7c

Browse files
pierrot0copybara-github
authored andcommitted
add documentation on S3 slicing API (Issue #737).
PiperOrigin-RevId: 257881514
1 parent 727d8e9 commit 501ce7c

File tree

1 file changed

+154
-14
lines changed

1 file changed

+154
-14
lines changed

docs/splits.md

Lines changed: 154 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,156 @@
1-
# Splits
1+
# Splits and slicing
2+
3+
All `DatasetBuilder`s expose various data subsets defined as splits (eg:
4+
`train`, `test`). When constructing a `tf.data.Dataset` instance using either
5+
`tfds.load()` or `tfds.DatasetBuilder.as_dataset()`, one can specify which
6+
split(s) to retrieve. It is also possible to retrieve slice(s) of split(s)
7+
as well as combinations of those.
8+
9+
* [Two APIs: S3 and legacy](#two-apis-s3-and-legacy)
10+
* [S3 slicing API](#s3-slicing-api)
11+
* [Examples](#examples)
12+
* [Percentage slicing and rounding](#percentage-slicing-and-rounding)
13+
* [Reproducibility](#reproducibility)
14+
* [Legacy slicing API](#legacy-slicing-api)
15+
* [Adding splits together](#adding-splits-together)
16+
* [Subsplit](#subsplit)
17+
* [Specifying number of subsplits](#specifying-number-of-subsplits)
18+
* [Specifying a percentage slice](#specifying-a-percentage-slice)
19+
* [Specifying weights](#specifying-weights)
20+
* [Composing split, adding, and subsplitting](#composing-split-adding-and-subsplitting)
21+
* [Dataset using non-conventional named split](#dataset-using-non-conventional-named-split)
22+
23+
## Two APIs: S3 and legacy
24+
25+
Each versioned dataset either implements the new S3 API, or the legacy API,
26+
which will eventually be retired. New datasets (except Beam ones for now) all
27+
implement S3, and we're slowly rolling it out to all datasets.
28+
29+
To find out whether a dataset implements S3, one can look at the source code
30+
or call:
31+
32+
```
33+
ds_builder.version.implements(tfds.core.Experiment.S3)
34+
```
35+
36+
## S3 slicing API
37+
38+
Slicing instructions are specified in `tfds.load` or `tfds.DatasetBuilder.as_dataset`.
39+
40+
Instructions can be provided as either strings or `ReadInstruction`s.
41+
Strings are more compact and
42+
readable for simple cases, while `ReadInstruction`s provide more options
43+
and might be easier to use with variable slicing parameters.
44+
45+
### Examples
46+
47+
The following examples show equivalent instructions:
48+
49+
```py
50+
# The full `train` split.
51+
train_ds = tfds.load('mnist:3.*.*', split='train')
52+
train_ds = tfds.load('mnist:3.*.*', split=tfds.ReadInstruction('train'))
53+
54+
# The full `train` split and the full `test` split as two distinct datasets.
55+
train_ds, test_ds = tfds.load('mnist:3.*.*', split=['train', 'test'])
56+
train_ds, test_ds = tfds.load('mnist:3.*.*', split=[
57+
tfds.ReadInstruction('train'),
58+
tfds.ReadInstruction('test'),
59+
])
60+
61+
# The full `train` and `test` splits, concatenated together.
62+
train_test_ds = tfds.load('mnist:3.*.*', split='train+test')
63+
ri = tfds.ReadInstruction('train') + tfds.ReadInstruction('test')
64+
train_test_ds = tfds.load('mnist:3.*.*', split=ri)
65+
66+
# From record 10 (included) to record 20 (excluded) of `train` split.
67+
train_10_20_ds = tfds.load('mnist:3.*.*', split='train[10:20]')
68+
train_10_20_ds = tfds.load('mnist:3.*.*', split=tfds.ReadInstruction(
69+
'train', from_=10, to=20, unit='abs'))
70+
71+
# The first 10% of train split.
72+
train_10pct_ds = tfds.load('mnist:3.*.*', split='train[:10%]')
73+
train_10_20_ds = tfds.load('mnist:3.*.*', split=tfds.ReadInstruction(
74+
'train', to=10, unit='%'))
75+
76+
# The first 10% of train + the last 80% of train.
77+
train_10_80pct_ds = tfds.load('mnist:3.*.*', split='train[:10%]+train[-80%:]')
78+
ri = (tfds.ReadInstruction('train', to=10, unit='%') +
79+
tfds.ReadInstruction('train', from_=-80, unit='%'))
80+
train_10_80pct_ds = tfds.load('mnist:3.*.*', split=ri)
81+
82+
# 10-fold cross-validation (see also next section on rounding behavior):
83+
# The validation datasets are each going to be 10%:
84+
# [0%:10%], [10%:20%], ..., [90%:100%].
85+
# And the training datasets are each going to be the complementary 90%:
86+
# [10%:100%] (for a corresponding validation set of [0%:10%]),
87+
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
88+
# [0%:90%] (for a validation set of [90%:100%]).
89+
vals_ds = tfds.load('mnist:3.*.*', ['train[{}%:{}%]'.format(k, k+10)
90+
for k in range(0, 100, 10)])
91+
trains_ds = tfds.load('mnist:3.*.*', ['train[:{}%]+train[{}%:]'.format(k, k+10)
92+
for k in range(0, 100, 10)])
93+
# or using the `ReadInstruction`:
94+
vals_ds = tfds.load('mnist:3.*.*', [
95+
tfds.ReadInstruction('train', from_=k, to=k+10, unit='%')
96+
for k in range(0, 100, 10)])
97+
trains_ds = tfds.load('mnist:3.*.*', [
98+
(tfds.ReadInstruction('train', to=k, unit='%') +
99+
tfds.ReadInstruction('train', from_=k+10, unit='%'))
100+
for k in range(0, 100, 10)])
101+
```
102+
103+
### Percentage slicing and rounding
104+
105+
If a slice of a split is requested using the percent (`%`) unit, and the
106+
requested slice boundaries do not divide evenly by `100`, then the default
107+
behaviour it to round boundaries to the nearest integer (`closest`). This means
108+
that some slices may contain more examples than others. For example:
109+
110+
```py
111+
# Assuming "train" split contains 101 records.
112+
# 100 records, from 0 to 100.
113+
tfds.load("mnist:3.*.*", split="test[:99%]")
114+
# 2 records, from 49 to 51.
115+
tfds.load("mnist:3.*.*", split="test[49%:50%]")
116+
```
117+
118+
Alternatively, the user can use the rounding `pct1_dropremainder`, so specified
119+
percentage boundaries are treated as multiples of 1%. This option should be used
120+
when consistency is needed (eg: `len(5%) == 5 * len(1%)`).
121+
122+
Example:
123+
124+
```py
125+
# Records 0 (included) to 99 (excluded).
126+
tfds.load("mnist:3.*.*", split="test[:99%]", rounding="pct1_dropremainder")
127+
```
128+
129+
### Reproducibility
130+
131+
The S3 API guarantees that any given split slice (or `ReadInstruction`) will
132+
always produce the same set of records on a given dataset, as long as the major
133+
version of the dataset is constant.
134+
135+
For example, `tfds.load("mnist:3.0.0", split="train[10:20]")` and
136+
`tfds.load("mnist:3.2.0", split="train[10:20]")` will always contain the same
137+
elements - regardless of platform, architecture, etc. - even though some of
138+
the records might have different values (eg: imgage encoding, label, ...).
139+
140+
## Legacy slicing API
2141

3-
All `DatasetBuilder`s expose various data subsets defined as
4142
[`tfds.Split`](api_docs/python/tfds/Split.md)s (typically `tfds.Split.TRAIN` and
5143
`tfds.Split.TEST`). A given dataset's splits are defined in
6144
[`tfds.DatasetBuilder.info.splits`](api_docs/python/tfds/core/DatasetBuilder.md#info)
7145
and are accessible through [`tfds.load`](api_docs/python/tfds/load.md) and
8146
[`tfds.DatasetBuilder.as_dataset`](api_docs/python/tfds/core/DatasetBuilder.md#as_dataset),
9147
both of which take `split=` as a keyword argument.
10148

11-
`tfds` enables you to further manipulate splits by combining them or
149+
`tfds` enables you to combine splits
12150
subsplitting them up. The resulting splits can be passed to `tfds.load` or
13151
`tfds.DatasetBuilder.as_dataset`.
14152

15-
## Add splits together
153+
### Add splits together
16154

17155
```py
18156
combined_split = tfds.Split.TRAIN + tfds.Split.TEST
@@ -28,27 +166,29 @@ together:
28166
ds = tfds.load("mnist", split=tfds.Split.ALL)
29167
```
30168

31-
## Subsplit
169+
### Subsplit
32170

33171
You have 3 options for how to get a thinner slice of the data than the
34172
base splits, all based on `tfds.Split.subsplit`.
35173

36-
*Warning*: TensorFlow Datasets does not currently guarantee the order of the
37-
data on disk when data is generated. Therefore, if you regenerate the data, the
38-
subsplits may no longer be the same.
174+
*Warning*: The legacy API does not guarantee the reproducibility of the subsplit
175+
operations. Two different users working on the same dataset at the same version
176+
and using the same subsplit instructions could end-up with two different sets
177+
of examples. Also, if a user regenerates the data, the subsplits may no longer
178+
be the same.
39179

40180
*Warning*: If the `total_number_examples % 100 != 0`, then remainder examples
41181
may not be evenly distributed among subsplits.
42182

43-
### Specify number of subsplits
183+
#### Specifying number of subsplits
44184

45185
```py
46186
train_half_1, train_half_2 = tfds.Split.TRAIN.subsplit(k=2)
47187

48188
dataset = tfds.load("mnist", split=train_half_1)
49189
```
50190

51-
### Specify a percentage slice
191+
#### Specifying a percentage slice
52192

53193
```py
54194
first_10_percent = tfds.Split.TRAIN.subsplit(tfds.percent[:10])
@@ -58,15 +198,15 @@ middle_50_percent = tfds.Split.TRAIN.subsplit(tfds.percent[25:75])
58198
dataset = tfds.load("mnist", split=middle_50_percent)
59199
```
60200

61-
### Specifying weights
201+
#### Specifying weights
62202

63203
```py
64204
half, quarter1, quarter2 = tfds.Split.TRAIN.subsplit(weighted=[2, 1, 1])
65205

66206
dataset = tfds.load("mnist", split=half)
67207
```
68208

69-
## Composing split, adding, and subsplitting
209+
### Composing split, adding, and subsplitting
70210

71211
It's possible to compose the above operations:
72212

@@ -93,13 +233,13 @@ split = (tfds.Split.TRAIN.subsplit(tfds.percent[:25]) +
93233
tfds.Split.TEST).subsplit(tfds.percent[0:50])
94234
```
95235

96-
## Dataset using non-conventional named split
236+
### Dataset using non-conventional named split
97237

98238
For dataset using splits not in `tfds.Split.{TRAIN,VALIDATION,TEST}`, you can
99239
still use the subsplit API by defining the custom named split with
100240
`tfds.Split('custom_split')`. For instance:
101241

102242
```py
103243
split = tfds.Split('test2015') + tfds.Split.TEST
104-
ds = tfds.load('coco2014', split= split)
244+
ds = tfds.load('coco2014', split=split)
105245
```

0 commit comments

Comments
 (0)