Skip to content

Commit fd8048d

Browse files
sync changes.
2 parents 1caadb8 + edfd9f7 commit fd8048d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1002
-219
lines changed

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,13 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
88
* [List of datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
99
* [Try it in Colab](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
1010
* [API docs](https://www.tensorflow.org/datasets/api_docs/python/tfds)
11-
* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
11+
* Guides
12+
* [Overview](https://www.tensorflow.org/datasets/overview)
13+
* [Datasets versioning](https://www.tensorflow.org/datasets/datasets_versioning)
14+
* [Using splits and slicing API](https://www.tensorflow.org/datasets/splits)
15+
* [Add a dataset](https://www.tensorflow.org/datasets/add_dataset)
16+
* [Add a huge dataset (>>100GiB)](https://www.tensorflow.org/datasets/beam_datasets)
17+
1218

1319
**Table of Contents**
1420

docs/add_dataset.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,9 @@ If you'd like to follow a test-driven development workflow, which can help you
138138
iterate faster, jump to the [testing instructions](#testing-mydataset) below,
139139
add the test, and then return here.
140140

141+
For an explanation of what the version is, please read
142+
[datasets versioning](datasets_versioning.md).
143+
141144
## Specifying `DatasetInfo`
142145

143146
[`DatasetInfo`](api_docs/python/tfds/core/DatasetInfo.md) describes the

docs/api_docs/python/tfds/_api_cache.json

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -139,8 +139,6 @@
139139
"tfds.core": false,
140140
"tfds.core.BeamBasedBuilder": false,
141141
"tfds.core.BeamBasedBuilder.BUILDER_CONFIGS": true,
142-
"tfds.core.BeamBasedBuilder.GOOGLE_DISABLED": true,
143-
"tfds.core.BeamBasedBuilder.IN_DEVELOPMENT": true,
144142
"tfds.core.BeamBasedBuilder.SUPPORTED_VERSIONS": true,
145143
"tfds.core.BeamBasedBuilder.VERSION": true,
146144
"tfds.core.BeamBasedBuilder.__init__": true,
@@ -160,8 +158,6 @@
160158
"tfds.core.BuilderConfig.version": true,
161159
"tfds.core.DatasetBuilder": false,
162160
"tfds.core.DatasetBuilder.BUILDER_CONFIGS": true,
163-
"tfds.core.DatasetBuilder.GOOGLE_DISABLED": true,
164-
"tfds.core.DatasetBuilder.IN_DEVELOPMENT": true,
165161
"tfds.core.DatasetBuilder.SUPPORTED_VERSIONS": true,
166162
"tfds.core.DatasetBuilder.VERSION": true,
167163
"tfds.core.DatasetBuilder.__init__": true,
@@ -200,8 +196,6 @@
200196
"tfds.core.Experiment.S3": true,
201197
"tfds.core.GeneratorBasedBuilder": false,
202198
"tfds.core.GeneratorBasedBuilder.BUILDER_CONFIGS": true,
203-
"tfds.core.GeneratorBasedBuilder.GOOGLE_DISABLED": true,
204-
"tfds.core.GeneratorBasedBuilder.IN_DEVELOPMENT": true,
205199
"tfds.core.GeneratorBasedBuilder.SUPPORTED_VERSIONS": true,
206200
"tfds.core.GeneratorBasedBuilder.VERSION": true,
207201
"tfds.core.GeneratorBasedBuilder.__init__": true,
@@ -599,7 +593,6 @@
599593
"tfds.testing.DatasetBuilderTestCase.DATASET_CLASS": true,
600594
"tfds.testing.DatasetBuilderTestCase.DL_EXTRACT_RESULT": true,
601595
"tfds.testing.DatasetBuilderTestCase.EXAMPLE_DIR": true,
602-
"tfds.testing.DatasetBuilderTestCase.INTERNAL_DATASET": true,
603596
"tfds.testing.DatasetBuilderTestCase.MOCK_MONARCH": true,
604597
"tfds.testing.DatasetBuilderTestCase.MOCK_OUT_FORBIDDEN_OS_FUNCTIONS": true,
605598
"tfds.testing.DatasetBuilderTestCase.OVERLAPPING_SPLITS": true,
@@ -745,8 +738,6 @@
745738
"tfds.testing.DatasetBuilderTestCase.test_session": true,
746739
"tfds.testing.DummyDatasetSharedGenerator": false,
747740
"tfds.testing.DummyDatasetSharedGenerator.BUILDER_CONFIGS": true,
748-
"tfds.testing.DummyDatasetSharedGenerator.GOOGLE_DISABLED": true,
749-
"tfds.testing.DummyDatasetSharedGenerator.IN_DEVELOPMENT": true,
750741
"tfds.testing.DummyDatasetSharedGenerator.SUPPORTED_VERSIONS": true,
751742
"tfds.testing.DummyDatasetSharedGenerator.VERSION": true,
752743
"tfds.testing.DummyDatasetSharedGenerator.__init__": true,
@@ -760,8 +751,6 @@
760751
"tfds.testing.DummyDatasetSharedGenerator.version": true,
761752
"tfds.testing.DummyMnist": false,
762753
"tfds.testing.DummyMnist.BUILDER_CONFIGS": true,
763-
"tfds.testing.DummyMnist.GOOGLE_DISABLED": true,
764-
"tfds.testing.DummyMnist.IN_DEVELOPMENT": true,
765754
"tfds.testing.DummyMnist.SUPPORTED_VERSIONS": true,
766755
"tfds.testing.DummyMnist.VERSION": true,
767756
"tfds.testing.DummyMnist.__init__": true,
@@ -1199,7 +1188,6 @@
11991188
"tfds.units.MiB": true,
12001189
"tfds.units.PiB": true,
12011190
"tfds.units.TiB": true,
1202-
"tfds.units.absolute_import": true,
12031191
"tfds.units.division": true,
12041192
"tfds.units.print_function": true,
12051193
"tfds.units.size_str": false

docs/api_docs/python/tfds/core/BeamBasedBuilder.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@
99
<meta itemprop="property" content="as_dataset"/>
1010
<meta itemprop="property" content="download_and_prepare"/>
1111
<meta itemprop="property" content="BUILDER_CONFIGS"/>
12-
<meta itemprop="property" content="GOOGLE_DISABLED"/>
13-
<meta itemprop="property" content="IN_DEVELOPMENT"/>
1412
<meta itemprop="property" content="SUPPORTED_VERSIONS"/>
1513
<meta itemprop="property" content="VERSION"/>
1614
<meta itemprop="property" content="builder_configs"/>
@@ -80,7 +78,8 @@ as_dataset(
8078
split=None,
8179
batch_size=None,
8280
shuffle_files=None,
83-
as_supervised=False
81+
as_supervised=False,
82+
in_memory=None
8483
)
8584
```
8685

@@ -105,6 +104,10 @@ Callers must pass arguments as keyword arguments.
105104
will have a 2-tuple structure `(input, label)` according to
106105
`builder.info.supervised_keys`. If `False`, the default, the returned
107106
`tf.data.Dataset` will have a dictionary with all the features.
107+
* <b>`in_memory`</b>: `bool`, if `True`, loads the dataset in memory which
108+
increases iteration speeds. Note that if `True` and the dataset has unknown
109+
dimensions, the features will be padded to the maximum size across the
110+
dataset.
108111

109112
#### Returns:
110113

@@ -142,8 +145,6 @@ Downloads and prepares dataset for reading.
142145
## Class Members
143146

144147
* `BUILDER_CONFIGS` <a id="BUILDER_CONFIGS"></a>
145-
* `GOOGLE_DISABLED = False` <a id="GOOGLE_DISABLED"></a>
146-
* `IN_DEVELOPMENT = False` <a id="IN_DEVELOPMENT"></a>
147148
* `SUPPORTED_VERSIONS` <a id="SUPPORTED_VERSIONS"></a>
148149
* `VERSION = None` <a id="VERSION"></a>
149150
* `builder_configs` <a id="builder_configs"></a>

docs/api_docs/python/tfds/core/DatasetBuilder.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@
99
<meta itemprop="property" content="as_dataset"/>
1010
<meta itemprop="property" content="download_and_prepare"/>
1111
<meta itemprop="property" content="BUILDER_CONFIGS"/>
12-
<meta itemprop="property" content="GOOGLE_DISABLED"/>
13-
<meta itemprop="property" content="IN_DEVELOPMENT"/>
1412
<meta itemprop="property" content="SUPPORTED_VERSIONS"/>
1513
<meta itemprop="property" content="VERSION"/>
1614
<meta itemprop="property" content="builder_configs"/>
@@ -111,7 +109,8 @@ as_dataset(
111109
split=None,
112110
batch_size=None,
113111
shuffle_files=None,
114-
as_supervised=False
112+
as_supervised=False,
113+
in_memory=None
115114
)
116115
```
117116

@@ -136,6 +135,10 @@ Callers must pass arguments as keyword arguments.
136135
will have a 2-tuple structure `(input, label)` according to
137136
`builder.info.supervised_keys`. If `False`, the default, the returned
138137
`tf.data.Dataset` will have a dictionary with all the features.
138+
* <b>`in_memory`</b>: `bool`, if `True`, loads the dataset in memory which
139+
increases iteration speeds. Note that if `True` and the dataset has unknown
140+
dimensions, the features will be padded to the maximum size across the
141+
dataset.
139142

140143
#### Returns:
141144

@@ -173,8 +176,6 @@ Downloads and prepares dataset for reading.
173176
## Class Members
174177

175178
* `BUILDER_CONFIGS` <a id="BUILDER_CONFIGS"></a>
176-
* `GOOGLE_DISABLED = False` <a id="GOOGLE_DISABLED"></a>
177-
* `IN_DEVELOPMENT = False` <a id="IN_DEVELOPMENT"></a>
178179
* `SUPPORTED_VERSIONS` <a id="SUPPORTED_VERSIONS"></a>
179180
* `VERSION = None` <a id="VERSION"></a>
180181
* `builder_configs` <a id="builder_configs"></a>

docs/api_docs/python/tfds/core/GeneratorBasedBuilder.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@
99
<meta itemprop="property" content="as_dataset"/>
1010
<meta itemprop="property" content="download_and_prepare"/>
1111
<meta itemprop="property" content="BUILDER_CONFIGS"/>
12-
<meta itemprop="property" content="GOOGLE_DISABLED"/>
13-
<meta itemprop="property" content="IN_DEVELOPMENT"/>
1412
<meta itemprop="property" content="SUPPORTED_VERSIONS"/>
1513
<meta itemprop="property" content="VERSION"/>
1614
<meta itemprop="property" content="builder_configs"/>
@@ -89,7 +87,8 @@ as_dataset(
8987
split=None,
9088
batch_size=None,
9189
shuffle_files=None,
92-
as_supervised=False
90+
as_supervised=False,
91+
in_memory=None
9392
)
9493
```
9594

@@ -114,6 +113,10 @@ Callers must pass arguments as keyword arguments.
114113
will have a 2-tuple structure `(input, label)` according to
115114
`builder.info.supervised_keys`. If `False`, the default, the returned
116115
`tf.data.Dataset` will have a dictionary with all the features.
116+
* <b>`in_memory`</b>: `bool`, if `True`, loads the dataset in memory which
117+
increases iteration speeds. Note that if `True` and the dataset has unknown
118+
dimensions, the features will be padded to the maximum size across the
119+
dataset.
117120

118121
#### Returns:
119122

@@ -151,8 +154,6 @@ Downloads and prepares dataset for reading.
151154
## Class Members
152155

153156
* `BUILDER_CONFIGS` <a id="BUILDER_CONFIGS"></a>
154-
* `GOOGLE_DISABLED = False` <a id="GOOGLE_DISABLED"></a>
155-
* `IN_DEVELOPMENT = False` <a id="IN_DEVELOPMENT"></a>
156157
* `SUPPORTED_VERSIONS` <a id="SUPPORTED_VERSIONS"></a>
157158
* `VERSION = None` <a id="VERSION"></a>
158159
* `builder_configs` <a id="builder_configs"></a>

docs/api_docs/python/tfds/disable_progress_bar.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,10 @@ Defined in
1616

1717
### Used in the tutorials:
1818

19+
* [CycleGAN](https://www.tensorflow.org/beta/tutorials/generative/cyclegan)
20+
* [Distributed training with Keras](https://www.tensorflow.org/beta/tutorials/distribute/keras)
21+
* [Multi-worker Training with Estimator](https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_estimator)
22+
* [Multi-worker Training with Keras](https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_keras)
1923
* [Transfer Learning Using Pretrained ConvNets](https://www.tensorflow.org/beta/tutorials/images/transfer_learning)
2024

2125
#### Usage:

docs/api_docs/python/tfds/features/text/TokenTextEncoder.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,10 @@ __init__(
4646

4747
Constructs a TokenTextEncoder.
4848

49-
To load from a file saved with `TokenTextEncoder.save_to_file`, use
50-
`TokenTextEncoder.load_from_file`.
49+
To load from a file saved with
50+
<a href="../../../tfds/features/text/TokenTextEncoder.md#save_to_file"><code>TokenTextEncoder.save_to_file</code></a>,
51+
use
52+
<a href="../../../tfds/features/text/TokenTextEncoder.md#load_from_file"><code>TokenTextEncoder.load_from_file</code></a>.
5153

5254
#### Args:
5355

docs/api_docs/python/tfds/load.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ tfds.load(
1313
split=None,
1414
data_dir=None,
1515
batch_size=None,
16+
in_memory=None,
1617
download=True,
1718
as_supervised=False,
1819
with_info=False,
@@ -31,6 +32,7 @@ Defined in [`core/registered.py`](https://github.com/tensorflow/datasets/tree/ma
3132

3233
### Used in the tutorials:
3334

35+
* [CycleGAN](https://www.tensorflow.org/beta/tutorials/generative/cyclegan)
3436
* [Distributed training with Keras](https://www.tensorflow.org/beta/tutorials/distribute/keras)
3537
* [Multi-worker Training with Estimator](https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_estimator)
3638
* [Multi-worker Training with Keras](https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_keras)
@@ -42,9 +44,12 @@ Defined in [`core/registered.py`](https://github.com/tensorflow/datasets/tree/ma
4244
If `split=None` (the default), returns all splits for the dataset. Otherwise,
4345
returns the specified split.
4446

45-
`load` is a convenience method that fetches the <a href="../tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a> by
46-
string name, optionally calls `DatasetBuilder.download_and_prepare`
47-
(if `download=True`), and then calls `DatasetBuilder.as_dataset`.
47+
`load` is a convenience method that fetches the
48+
<a href="../tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>
49+
by string name, optionally calls
50+
<a href="../tfds/core/DatasetBuilder.md#download_and_prepare"><code>DatasetBuilder.download_and_prepare</code></a>
51+
(if `download=True`), and then calls
52+
<a href="../tfds/core/DatasetBuilder.md#as_dataset"><code>DatasetBuilder.as_dataset</code></a>.
4853
This is roughly equivalent to:
4954

5055
```
@@ -86,6 +91,10 @@ of hundreds of GiB to disk. Refer to the `download` argument.
8691
* <b>`batch_size`</b>: `int`, if set, add a batch dimension to examples. Note
8792
that variable length features will be 0-padded. If `batch_size=-1`, will
8893
return the full dataset as `tf.Tensor`s.
94+
* <b>`in_memory`</b>: `bool`, if `True`, loads the dataset in memory which
95+
increases iteration speeds. Note that if `True` and the dataset has unknown
96+
dimensions, the features will be padded to the maximum size across the
97+
dataset.
8998
* <b>`download`</b>: `bool` (optional), whether to call
9099
<a href="../tfds/core/DatasetBuilder.md#download_and_prepare"><code>tfds.core.DatasetBuilder.download_and_prepare</code></a>
91100
before calling `tf.DatasetBuilder.as_dataset`. If `False`, data is expected

docs/api_docs/python/tfds/testing/DatasetBuilderTestCase.md

Lines changed: 9 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,6 @@
137137
<meta itemprop="property" content="DATASET_CLASS"/>
138138
<meta itemprop="property" content="DL_EXTRACT_RESULT"/>
139139
<meta itemprop="property" content="EXAMPLE_DIR"/>
140-
<meta itemprop="property" content="INTERNAL_DATASET"/>
141140
<meta itemprop="property" content="MOCK_MONARCH"/>
142141
<meta itemprop="property" content="MOCK_OUT_FORBIDDEN_OS_FUNCTIONS"/>
143142
<meta itemprop="property" content="OVERLAPPING_SPLITS"/>
@@ -178,18 +177,15 @@ MOCK_OUT_FORBIDDEN_OS_FUNCTIONS: `bool`, defaults to True. Set to False to
178177
disable checks preventing usage of `os` or builtin functions instead of
179178
recommended `tf.io.gfile` API.
180179

181-
This test case will check for the following:
182-
- the dataset builder is correctly registered, i.e. `tfds.load(name)` works;
183-
- the dataset builder can read the fake examples stored in
184-
testing/test_data/fake_examples/${dataset_name};
185-
- the dataset builder can produce serialized data;
186-
- the dataset builder produces a valid Dataset object from serialized data
187-
- in eager mode;
188-
- in graph mode.
189-
- the produced Dataset examples have the expected dimensions and types;
190-
- the produced Dataset has and the expected number of examples;
191-
- a example is not part of two splits, or one of these splits is whitelisted
192-
in OVERLAPPING_SPLITS.
180+
This test case will check for the following: - the dataset builder is correctly
181+
registered, i.e. <a href="../../tfds/load.md"><code>tfds.load(name)</code></a>
182+
works; - the dataset builder can read the fake examples stored in
183+
testing/test_data/fake_examples/${dataset_name}; - the dataset builder can
184+
produce serialized data; - the dataset builder produces a valid Dataset object
185+
from serialized data - in eager mode; - in graph mode. - the produced Dataset
186+
examples have the expected dimensions and types; - the produced Dataset has and
187+
the expected number of examples; - a example is not part of two splits, or one
188+
of these splits is whitelisted in OVERLAPPING_SPLITS.
193189

194190
<h2 id="__init__"><code>__init__</code></h2>
195191

@@ -2450,7 +2446,6 @@ Use `self.session()` or `self.cached_session()` instead.
24502446
* `DATASET_CLASS = None` <a id="DATASET_CLASS"></a>
24512447
* `DL_EXTRACT_RESULT = None` <a id="DL_EXTRACT_RESULT"></a>
24522448
* `EXAMPLE_DIR = None` <a id="EXAMPLE_DIR"></a>
2453-
* `INTERNAL_DATASET = False` <a id="INTERNAL_DATASET"></a>
24542449
* `MOCK_MONARCH = True` <a id="MOCK_MONARCH"></a>
24552450
* `MOCK_OUT_FORBIDDEN_OS_FUNCTIONS = True`
24562451
<a id="MOCK_OUT_FORBIDDEN_OS_FUNCTIONS"></a>

0 commit comments

Comments
 (0)