Skip to content

Commit 97cdcb7

Browse files
Conchylicultorcopybara-github
authored andcommitted
Add performance tips section to the doc
PiperOrigin-RevId: 296014436
1 parent b6f9b00 commit 97cdcb7

File tree

2 files changed

+121
-0
lines changed

2 files changed

+121
-0
lines changed

docs/_book.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ upper_tabs:
1919
contents:
2020
- title: Overview
2121
path: /datasets/overview
22+
- title: Performance tips
23+
path: /datasets/performances
2224
- title: Versioning
2325
path: /datasets/datasets_versioning
2426
- title: Splits

docs/performances.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Performances tips
2+
3+
This document provides TFDS-specific performance tips. Note that TFDS provides
4+
datasets as `tf.data.Dataset`s, so the advice from the
5+
[`tf.data` guide](https://www.tensorflow.org/guide/data_performance#optimize_performance)
6+
still applies.
7+
8+
## Small datasets (< GB)
9+
10+
All TFDS datasets store the data on disk in the
11+
[`TFRecord`](https://www.tensorflow.org/tutorials/load_data/tfrecord) format.
12+
For small datasets (e.g. Mnist, Cifar,...), reading from `.tfrecord` can add
13+
significant overhead.
14+
15+
As those datasets fit in memory, it is possible to significantly improve the
16+
performance by caching or pre-loading the dataset. Note that TFDS automatically
17+
caches small datasets (see next section for details).
18+
19+
### Caching the dataset
20+
21+
Here is an example of a data pipeline which explicitly caches the dataset after
22+
normalizing the images.
23+
24+
```python
25+
def normalize_img(image, label):
26+
"""Normalizes images: `uint8` -> `float32`."""
27+
return tf.cast(image, tf.float32) / 255., label
28+
29+
30+
ds, ds_info = tfds.load(
31+
'mnist',
32+
split='train',
33+
as_supervised=True, # returns `(img, label)` instead of dict(image=, ...)
34+
with_info=True,
35+
)
36+
# Applying normalization before `ds.cache()` to re-use it.
37+
# Note: Random transformations (e.g. images augmentations) should be applied
38+
# after both `ds.cache()` (to avoid caching randomness) and `ds.batch()` (for
39+
# vectorization [1]).
40+
ds = ds.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
41+
ds = ds.cache()
42+
# For true randomness, we set the shuffle buffer to the full dataset size.
43+
ds = ds.shuffle(ds_info.splits['train'].num_examples)
44+
# Batch after shuffling to get unique batches at each epoch.
45+
ds = ds.batch(128)
46+
ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
47+
```
48+
49+
* [[1] Vectorizing mapping](https://www.tensorflow.org/guide/data_performance#vectorizing_mapping)
50+
51+
When iterating over this dataset, the second iteration will be much faster than
52+
the first one thanks to the caching.
53+
54+
### Auto-caching
55+
56+
By default, TFDS auto-caches datasets which satisfy the following constraints:
57+
58+
* Total dataset size (all splits) is defined and < 250 MiB
59+
* `shuffle_files` is disabled, or only a single shard is read
60+
61+
It is possible to opt out of auto-caching by passing
62+
`read_config=tfds.ReadConfig(try_autocaching=False)` to `tfds.load`. Have a look
63+
at the dataset catalog documentation to see if a specific dataset will use
64+
auto-cache.
65+
66+
### Loading the full data as a single Tensor
67+
68+
If your dataset fits into memory, you can also load the full dataset as a single
69+
Tensor or NumPy array. It is possible to do so by setting `batch_size=-1` to
70+
batch all examples in a single `tf.Tensor`. Then use `tfds.as_numpy` for the
71+
conversion from `tf.Tensor` to `np.array`.
72+
73+
```
74+
(img_train, label_train), (img_test, label_test) = tfds.as_numpy(tfds.load(
75+
'mnist',
76+
split=['train', 'test'],
77+
batch_size=-1,
78+
as_supervised=True,
79+
))
80+
```
81+
82+
## Large datasets
83+
84+
Large datasets are sharded (split in multiple files), and typically do not fit
85+
in memory so they should not be cached.
86+
87+
### Shuffle and training
88+
89+
During training, it's important to shuffle the data well; poorly shuffled data
90+
can result in lower training accuracy.
91+
92+
In addition to using `ds.shuffle` to shuffle records, you should also set
93+
`shuffle_files=True` to get good shuffling behavior for larger datasets that are
94+
sharded into multiple files. Otherwise, epochs will read the shards in the same
95+
order, and so data won't be truly randomized.
96+
97+
```
98+
ds = tfds.load('imagenet2012', split='train', shuffle_files=True)
99+
```
100+
101+
Additionally, when `shuffle_files=True`, TFDS disables
102+
[`options.experimental_deterministic`](https://www.tensorflow.org/api_docs/python/tf/data/Options?version=nightly#experimental_deterministic),
103+
which may give a slight performance boost. To get deterministic shuffling, it is
104+
possible to opt-out of this feature with `tfds.ReadConfig`: either by setting
105+
`read_config.shuffle_seed` or overwriting
106+
`read_config.options.experimental_deterministic`.
107+
108+
### Faster image decoding
109+
110+
By default TFDS automatically decodes images. However, there are cases where it
111+
can be more performant to skip the image decoding with
112+
`tfds.decode.SkipDecoding` and manually apply the `tf.io.decode_image` op:
113+
114+
* When filtering examples (with `ds.filter`), to decode images after examples
115+
have been filtered.
116+
* When cropping images, to use the fused `tf.image.decode_and_crop_jpeg` op.
117+
118+
The code for both examples is available in the
119+
[decode guide](https://www.tensorflow.org/datasets/decode#usage_examples).

0 commit comments

Comments
 (0)