1
- # Splits
1
+ # Splits and slicing
2
+
3
+ All ` DatasetBuilder ` s expose various data subsets defined as splits (eg:
4
+ ` train ` , ` test ` ). When constructing a ` tf.data.Dataset ` instance using either
5
+ ` tfds.load() ` or ` tfds.DatasetBuilder.as_dataset() ` , one can specify which
6
+ split(s) to retrieve. It is also possible to retrieve slice(s) of split(s)
7
+ as well as combinations of those.
8
+
9
+ * [ Two APIs: S3 and legacy] ( #two-apis-s3-and-legacy )
10
+ * [ S3 slicing API] ( #s3-slicing-api )
11
+ * [ Examples] ( #examples )
12
+ * [ Percentage slicing and rounding] ( #percentage-slicing-and-rounding )
13
+ * [ Reproducibility] ( #reproducibility )
14
+ * [ Legacy slicing API] ( #legacy-slicing-api )
15
+ * [ Adding splits together] ( #adding-splits-together )
16
+ * [ Subsplit] ( #subsplit )
17
+ * [ Specifying number of subsplits] ( #specifying-number-of-subsplits )
18
+ * [ Specifying a percentage slice] ( #specifying-a-percentage-slice )
19
+ * [ Specifying weights] ( #specifying-weights )
20
+ * [ Composing split, adding, and subsplitting] ( #composing-split-adding-and-subsplitting )
21
+ * [ Dataset using non-conventional named split] ( #dataset-using-non-conventional-named-split )
22
+
23
+ ## Two APIs: S3 and legacy
24
+
25
+ Each versioned dataset either implements the new S3 API, or the legacy API,
26
+ which will eventually be retired. New datasets (except Beam ones for now) all
27
+ implement S3, and we're slowly rolling it out to all datasets.
28
+
29
+ To find out whether a dataset implements S3, one can look at the source code
30
+ or call:
31
+
32
+ ```
33
+ ds_builder.version.implements(tfds.core.Experiment.S3)
34
+ ```
35
+
36
+ ## S3 slicing API
37
+
38
+ Slicing instructions are specified in ` tfds.load ` or ` tfds.DatasetBuilder.as_dataset ` .
39
+
40
+ Instructions can be provided as either strings or ` ReadInstruction ` s.
41
+ Strings are more compact and
42
+ readable for simple cases, while ` ReadInstruction ` s provide more options
43
+ and might be easier to use with variable slicing parameters.
44
+
45
+ ### Examples
46
+
47
+ The following examples show equivalent instructions:
48
+
49
+ ``` py
50
+ # The full `train` split.
51
+ train_ds = tfds.load(' mnist:3.*.*' , split = ' train' )
52
+ train_ds = tfds.load(' mnist:3.*.*' , split = tfds.ReadInstruction(' train' ))
53
+
54
+ # The full `train` split and the full `test` split as two distinct datasets.
55
+ train_ds, test_ds = tfds.load(' mnist:3.*.*' , split = [' train' , ' test' ])
56
+ train_ds, test_ds = tfds.load(' mnist:3.*.*' , split = [
57
+ tfds.ReadInstruction(' train' ),
58
+ tfds.ReadInstruction(' test' ),
59
+ ])
60
+
61
+ # The full `train` and `test` splits, concatenated together.
62
+ train_test_ds = tfds.load(' mnist:3.*.*' , split = ' train+test' )
63
+ ri = tfds.ReadInstruction(' train' ) + tfds.ReadInstruction(' test' )
64
+ train_test_ds = tfds.load(' mnist:3.*.*' , split = ri)
65
+
66
+ # From record 10 (included) to record 20 (excluded) of `train` split.
67
+ train_10_20_ds = tfds.load(' mnist:3.*.*' , split = ' train[10:20]' )
68
+ train_10_20_ds = tfds.load(' mnist:3.*.*' , split = tfds.ReadInstruction(
69
+ ' train' , from_ = 10 , to = 20 , unit = ' abs' ))
70
+
71
+ # The first 10% of train split.
72
+ train_10pct_ds = tfds.load(' mnist:3.*.*' , split = ' train[:10%]' )
73
+ train_10_20_ds = tfds.load(' mnist:3.*.*' , split = tfds.ReadInstruction(
74
+ ' train' , to = 10 , unit = ' %' ))
75
+
76
+ # The first 10% of train + the last 80% of train.
77
+ train_10_80pct_ds = tfds.load(' mnist:3.*.*' , split = ' train[:10%]+train[-80%:]' )
78
+ ri = (tfds.ReadInstruction(' train' , to = 10 , unit = ' %' ) +
79
+ tfds.ReadInstruction(' train' , from_ = - 80 , unit = ' %' ))
80
+ train_10_80pct_ds = tfds.load(' mnist:3.*.*' , split = ri)
81
+
82
+ # 10-fold cross-validation (see also next section on rounding behavior):
83
+ # The validation datasets are each going to be 10%:
84
+ # [0%:10%], [10%:20%], ..., [90%:100%].
85
+ # And the training datasets are each going to be the complementary 90%:
86
+ # [10%:100%] (for a corresponding validation set of [0%:10%]),
87
+ # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
88
+ # [0%:90%] (for a validation set of [90%:100%]).
89
+ vals_ds = tfds.load(' mnist:3.*.*' , [' train[{} %:{} %]' .format(k, k+ 10 )
90
+ for k in range (0 , 100 , 10 )])
91
+ trains_ds = tfds.load(' mnist:3.*.*' , [' train[:{} %]+train[{} %:]' .format(k, k+ 10 )
92
+ for k in range (0 , 100 , 10 )])
93
+ # or using the `ReadInstruction`:
94
+ vals_ds = tfds.load(' mnist:3.*.*' , [
95
+ tfds.ReadInstruction(' train' , from_ = k, to = k+ 10 , unit = ' %' )
96
+ for k in range (0 , 100 , 10 )])
97
+ trains_ds = tfds.load(' mnist:3.*.*' , [
98
+ (tfds.ReadInstruction(' train' , to = k, unit = ' %' ) +
99
+ tfds.ReadInstruction(' train' , from_ = k+ 10 , unit = ' %' ))
100
+ for k in range (0 , 100 , 10 )])
101
+ ```
102
+
103
+ ### Percentage slicing and rounding
104
+
105
+ If a slice of a split is requested using the percent (` % ` ) unit, and the
106
+ requested slice boundaries do not divide evenly by ` 100 ` , then the default
107
+ behaviour it to round boundaries to the nearest integer (` closest ` ). This means
108
+ that some slices may contain more examples than others. For example:
109
+
110
+ ``` py
111
+ # Assuming "train" split contains 101 records.
112
+ # 100 records, from 0 to 100.
113
+ tfds.load(" mnist:3.*.*" , split = " test[:99%]" )
114
+ # 2 records, from 49 to 51.
115
+ tfds.load(" mnist:3.*.*" , split = " test[49%:50%]" )
116
+ ```
117
+
118
+ Alternatively, the user can use the rounding ` pct1_dropremainder ` , so specified
119
+ percentage boundaries are treated as multiples of 1%. This option should be used
120
+ when consistency is needed (eg: ` len(5%) == 5 * len(1%) ` ).
121
+
122
+ Example:
123
+
124
+ ``` py
125
+ # Records 0 (included) to 99 (excluded).
126
+ tfds.load(" mnist:3.*.*" , split = " test[:99%]" , rounding = " pct1_dropremainder" )
127
+ ```
128
+
129
+ ### Reproducibility
130
+
131
+ The S3 API guarantees that any given split slice (or ` ReadInstruction ` ) will
132
+ always produce the same set of records on a given dataset, as long as the major
133
+ version of the dataset is constant.
134
+
135
+ For example, ` tfds.load("mnist:3.0.0", split="train[10:20]") ` and
136
+ ` tfds.load("mnist:3.2.0", split="train[10:20]") ` will always contain the same
137
+ elements - regardless of platform, architecture, etc. - even though some of
138
+ the records might have different values (eg: imgage encoding, label, ...).
139
+
140
+ ## Legacy slicing API
2
141
3
- All ` DatasetBuilder ` s expose various data subsets defined as
4
142
[ ` tfds.Split ` ] ( api_docs/python/tfds/Split.md ) s (typically ` tfds.Split.TRAIN ` and
5
143
` tfds.Split.TEST ` ). A given dataset's splits are defined in
6
144
[ ` tfds.DatasetBuilder.info.splits ` ] ( api_docs/python/tfds/core/DatasetBuilder.md#info )
7
145
and are accessible through [ ` tfds.load ` ] ( api_docs/python/tfds/load.md ) and
8
146
[ ` tfds.DatasetBuilder.as_dataset ` ] ( api_docs/python/tfds/core/DatasetBuilder.md#as_dataset ) ,
9
147
both of which take ` split= ` as a keyword argument.
10
148
11
- ` tfds ` enables you to further manipulate splits by combining them or
149
+ ` tfds ` enables you to combine splits
12
150
subsplitting them up. The resulting splits can be passed to ` tfds.load ` or
13
151
` tfds.DatasetBuilder.as_dataset ` .
14
152
15
- ## Add splits together
153
+ ### Add splits together
16
154
17
155
``` py
18
156
combined_split = tfds.Split.TRAIN + tfds.Split.TEST
@@ -28,27 +166,29 @@ together:
28
166
ds = tfds.load(" mnist" , split = tfds.Split.ALL )
29
167
```
30
168
31
- ## Subsplit
169
+ ### Subsplit
32
170
33
171
You have 3 options for how to get a thinner slice of the data than the
34
172
base splits, all based on ` tfds.Split.subsplit ` .
35
173
36
- * Warning* : TensorFlow Datasets does not currently guarantee the order of the
37
- data on disk when data is generated. Therefore, if you regenerate the data, the
38
- subsplits may no longer be the same.
174
+ * Warning* : The legacy API does not guarantee the reproducibility of the subsplit
175
+ operations. Two different users working on the same dataset at the same version
176
+ and using the same subsplit instructions could end-up with two different sets
177
+ of examples. Also, if a user regenerates the data, the subsplits may no longer
178
+ be the same.
39
179
40
180
* Warning* : If the ` total_number_examples % 100 != 0 ` , then remainder examples
41
181
may not be evenly distributed among subsplits.
42
182
43
- ### Specify number of subsplits
183
+ #### Specifying number of subsplits
44
184
45
185
``` py
46
186
train_half_1, train_half_2 = tfds.Split.TRAIN .subsplit(k = 2 )
47
187
48
188
dataset = tfds.load(" mnist" , split = train_half_1)
49
189
```
50
190
51
- ### Specify a percentage slice
191
+ #### Specifying a percentage slice
52
192
53
193
``` py
54
194
first_10_percent = tfds.Split.TRAIN .subsplit(tfds.percent[:10 ])
@@ -58,15 +198,15 @@ middle_50_percent = tfds.Split.TRAIN.subsplit(tfds.percent[25:75])
58
198
dataset = tfds.load(" mnist" , split = middle_50_percent)
59
199
```
60
200
61
- ### Specifying weights
201
+ #### Specifying weights
62
202
63
203
``` py
64
204
half, quarter1, quarter2 = tfds.Split.TRAIN .subsplit(weighted = [2 , 1 , 1 ])
65
205
66
206
dataset = tfds.load(" mnist" , split = half)
67
207
```
68
208
69
- ## Composing split, adding, and subsplitting
209
+ ### Composing split, adding, and subsplitting
70
210
71
211
It's possible to compose the above operations:
72
212
@@ -93,13 +233,13 @@ split = (tfds.Split.TRAIN.subsplit(tfds.percent[:25]) +
93
233
tfds.Split.TEST ).subsplit(tfds.percent[0 :50 ])
94
234
```
95
235
96
- ## Dataset using non-conventional named split
236
+ ### Dataset using non-conventional named split
97
237
98
238
For dataset using splits not in ` tfds.Split.{TRAIN,VALIDATION,TEST} ` , you can
99
239
still use the subsplit API by defining the custom named split with
100
240
` tfds.Split('custom_split') ` . For instance:
101
241
102
242
``` py
103
243
split = tfds.Split(' test2015' ) + tfds.Split.TEST
104
- ds = tfds.load(' coco2014' , split = split)
244
+ ds = tfds.load(' coco2014' , split = split)
105
245
```
0 commit comments