Skip to content

Commit c52b969

Browse files
Merge pull request #10 from tensorflow/master
update master-11 from tfds-master
2 parents d341cee + 32aa040 commit c52b969

File tree

166 files changed

+3749
-808
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

166 files changed

+3749
-808
lines changed

CONTRIBUTING.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,14 @@ Thanks for thinking about contributing to our library !
44

55

66
## Before you start
7+
78
* Please accept the [Contributor License Agreement](https://cla.developers.google.com) (see below)
8-
* [Ask here](https://github.com/tensorflow/datasets/issues/142) to be added to
9-
the list of collaborators so that issues can be assigned to you.
109
* Comment on the issue that you plan to work on so we can assign it to you and
11-
there isn't unnecessary duplication of work.
10+
there isn't unnecessary duplication of work. If this is your first time
11+
contributing, we'll send you an invitation on GitHub to be a contributor;
12+
you must accept this invitation
13+
[here](https://github.com/tensorflow/datasets/settings/collaboration)
14+
before we can assign you the issue.
1215
* When you plan to work on something larger (for example, adding new
1316
`FeatureConnectors`), please respond on the issue (or create one if there
1417
isn't one) to explain your plan and give others a chance to discuss.

docs/_index.ipynb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@
6262
"name": "tensorflow/datasets",
6363
"provenance": [],
6464
"version": "0.3.2"
65+
},
66+
"kernelspec": {
67+
"display_name": "Python 3",
68+
"name": "python3"
6569
}
6670
},
6771
"nbformat": 4,

docs/add_dataset.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -360,6 +360,11 @@ Note that most datasets will find the [current set of
360360
`tfds.features.FeatureConnector`s](api_docs/python/tfds/features.md)
361361
sufficient, but sometimes a new one may need to be defined.
362362

363+
Note: If you need a new `FeatureConnector` not present in the default set and
364+
are planning to submit it to `tensorflow/datasets`, please open a
365+
[new issue](https://github.com/tensorflow/datasets/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=)
366+
on GitHub with your proposal.
367+
363368
[`tfds.features.FeatureConnector`s](api_docs/python/tfds/features/FeatureConnector.md)
364369
in `DatasetInfo` correspond to the elements returned in the
365370
`tf.data.Dataset` object. For instance, with:
@@ -445,14 +450,27 @@ import to its subdirectory's `__init__.py`
445450

446451
### 2. Run `download_and_prepare` locally.
447452

453+
If you're contributing the dataset to `tensorflow/datasets`, add a checksums
454+
file for the dataset. On first download, the `DownloadManager` will
455+
automatically add the sizes and checksums for all downloaded URLs to that file.
456+
This ensures that on subsequent data generation, the downloaded files are
457+
as expected.
458+
459+
```sh
460+
touch tensorflow_datasets/url_checksums/my_new_dataset.txt
461+
```
462+
448463
Run `download_and_prepare` locally to ensure that data generation works:
449464

450465
```
451466
# default data_dir is ~/tensorflow_datasets
452467
python -m tensorflow_datasets.scripts.download_and_prepare \
468+
--register_checksums \
453469
--datasets=my_new_dataset
454470
```
455471

472+
Note that the `--register_checksums` flag must only be used while in development.
473+
456474
Copy in the contents of the `dataset_info.json` file(s) to a [GitHub gist](https://gist.github.com/) and link to it in your pull request.
457475

458476

@@ -483,6 +501,11 @@ Most datasets in TFDS should have a unit test and your reviewer may ask you
483501
to add one if you haven't already. See the
484502
[testing section](#testing-mydataset) below.
485503

504+
### 5. Send for review!
505+
506+
Send the pull request for review.
507+
508+
486509
## Large datasets and distributed generation
487510

488511
Some datasets are so large as to require multiple machines to download and

docs/api_docs/python/tfds/Split.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,8 @@ stages of training and evaluation.
2929
model architecture, etc.).
3030
* `TEST`: the testing data. This is the data to report metrics on. Typically
3131
you do not want to use this during model iteration as you may overfit to it.
32-
* `ALL`: Special value corresponding to all existing splits of a dataset
33-
merged together
32+
* `ALL`: Special value, never defined by a dataset, but corresponding to all
33+
defined splits of a dataset merged together.
3434

3535
Note: All splits, including compositions inherit from <a href="../tfds/core/SplitBase.md"><code>tfds.core.SplitBase</code></a>
3636

docs/api_docs/python/tfds/_api_cache.json

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"current_doc_full_name": "tfds.core.GeneratorBasedBuilder.__getattribute__",
2+
"current_doc_full_name": "tfds.core.Version.__sizeof__",
33
"duplicate_of": {
44
"tfds.GenerateMode": "tfds.download.GenerateMode",
55
"tfds.GenerateMode.FORCE_REDOWNLOAD": "tfds.download.GenerateMode.FORCE_REDOWNLOAD",
@@ -50,6 +50,7 @@
5050
"tfds.core.GeneratorBasedBuilder.__str__": "tfds.core.BuilderConfig.__str__",
5151
"tfds.core.GeneratorBasedBuilder.__weakref__": "tfds.core.DatasetBuilder.__weakref__",
5252
"tfds.core.GeneratorBasedBuilder.builder_config": "tfds.core.DatasetBuilder.builder_config",
53+
"tfds.core.GeneratorBasedBuilder.data_dir": "tfds.core.DatasetBuilder.data_dir",
5354
"tfds.core.GeneratorBasedBuilder.info": "tfds.core.DatasetBuilder.info",
5455
"tfds.core.NamedSplit.__delattr__": "tfds.core.BuilderConfig.__delattr__",
5556
"tfds.core.NamedSplit.__format__": "tfds.core.BuilderConfig.__format__",
@@ -497,6 +498,7 @@
497498
"tfds.testing.DummyDatasetSharedGenerator.__str__": "tfds.core.BuilderConfig.__str__",
498499
"tfds.testing.DummyDatasetSharedGenerator.__weakref__": "tfds.core.DatasetBuilder.__weakref__",
499500
"tfds.testing.DummyDatasetSharedGenerator.builder_config": "tfds.core.DatasetBuilder.builder_config",
501+
"tfds.testing.DummyDatasetSharedGenerator.data_dir": "tfds.core.DatasetBuilder.data_dir",
500502
"tfds.testing.DummyDatasetSharedGenerator.info": "tfds.core.DatasetBuilder.info",
501503
"tfds.testing.DummyMnist.BUILDER_CONFIGS": "tfds.core.DatasetBuilder.BUILDER_CONFIGS",
502504
"tfds.testing.DummyMnist.__abstractmethods__": "tfds.core.NamedSplit.__abstractmethods__",
@@ -513,6 +515,7 @@
513515
"tfds.testing.DummyMnist.__str__": "tfds.core.BuilderConfig.__str__",
514516
"tfds.testing.DummyMnist.__weakref__": "tfds.core.DatasetBuilder.__weakref__",
515517
"tfds.testing.DummyMnist.builder_config": "tfds.core.DatasetBuilder.builder_config",
518+
"tfds.testing.DummyMnist.data_dir": "tfds.core.DatasetBuilder.data_dir",
516519
"tfds.testing.DummyMnist.info": "tfds.core.DatasetBuilder.info",
517520
"tfds.testing.FeatureExpectationItem.__delattr__": "tfds.core.BuilderConfig.__delattr__",
518521
"tfds.testing.FeatureExpectationItem.__format__": "tfds.core.BuilderConfig.__format__",
@@ -641,6 +644,7 @@
641644
"tfds.core.BuilderConfig.version": true,
642645
"tfds.core.DatasetBuilder": false,
643646
"tfds.core.DatasetBuilder.BUILDER_CONFIGS": true,
647+
"tfds.core.DatasetBuilder.GOOGLE_DISABLED": true,
644648
"tfds.core.DatasetBuilder.IN_DEVELOPMENT": true,
645649
"tfds.core.DatasetBuilder.VERSION": true,
646650
"tfds.core.DatasetBuilder.__abstractmethods__": true,
@@ -664,6 +668,7 @@
664668
"tfds.core.DatasetBuilder.as_dataset": true,
665669
"tfds.core.DatasetBuilder.builder_config": true,
666670
"tfds.core.DatasetBuilder.builder_configs": true,
671+
"tfds.core.DatasetBuilder.data_dir": true,
667672
"tfds.core.DatasetBuilder.download_and_prepare": true,
668673
"tfds.core.DatasetBuilder.info": true,
669674
"tfds.core.DatasetBuilder.name": true,
@@ -690,13 +695,13 @@
690695
"tfds.core.DatasetInfo.citation": true,
691696
"tfds.core.DatasetInfo.compute_dynamic_properties": true,
692697
"tfds.core.DatasetInfo.description": true,
693-
"tfds.core.DatasetInfo.download_checksums": true,
694698
"tfds.core.DatasetInfo.features": true,
695699
"tfds.core.DatasetInfo.full_name": true,
696700
"tfds.core.DatasetInfo.initialize_from_bucket": true,
697701
"tfds.core.DatasetInfo.initialized": true,
698702
"tfds.core.DatasetInfo.name": true,
699703
"tfds.core.DatasetInfo.read_from_directory": true,
704+
"tfds.core.DatasetInfo.redistribution_info": true,
700705
"tfds.core.DatasetInfo.size_in_bytes": true,
701706
"tfds.core.DatasetInfo.splits": true,
702707
"tfds.core.DatasetInfo.supervised_keys": true,
@@ -706,6 +711,7 @@
706711
"tfds.core.DatasetInfo.write_to_directory": true,
707712
"tfds.core.GeneratorBasedBuilder": false,
708713
"tfds.core.GeneratorBasedBuilder.BUILDER_CONFIGS": true,
714+
"tfds.core.GeneratorBasedBuilder.GOOGLE_DISABLED": true,
709715
"tfds.core.GeneratorBasedBuilder.IN_DEVELOPMENT": true,
710716
"tfds.core.GeneratorBasedBuilder.VERSION": true,
711717
"tfds.core.GeneratorBasedBuilder.__abstractmethods__": true,
@@ -729,6 +735,7 @@
729735
"tfds.core.GeneratorBasedBuilder.as_dataset": true,
730736
"tfds.core.GeneratorBasedBuilder.builder_config": true,
731737
"tfds.core.GeneratorBasedBuilder.builder_configs": true,
738+
"tfds.core.GeneratorBasedBuilder.data_dir": true,
732739
"tfds.core.GeneratorBasedBuilder.download_and_prepare": true,
733740
"tfds.core.GeneratorBasedBuilder.info": true,
734741
"tfds.core.GeneratorBasedBuilder.name": true,
@@ -960,12 +967,12 @@
960967
"tfds.download.DownloadManager.download": true,
961968
"tfds.download.DownloadManager.download_and_extract": true,
962969
"tfds.download.DownloadManager.download_kaggle_data": true,
963-
"tfds.download.DownloadManager.download_sizes": true,
970+
"tfds.download.DownloadManager.downloaded_size": true,
964971
"tfds.download.DownloadManager.extract": true,
965972
"tfds.download.DownloadManager.iter_archive": true,
966973
"tfds.download.DownloadManager.manual_dir": true,
967-
"tfds.download.DownloadManager.recorded_download_checksums": true,
968974
"tfds.download.ExtractMethod": false,
975+
"tfds.download.ExtractMethod.BZIP2": true,
969976
"tfds.download.ExtractMethod.GZIP": true,
970977
"tfds.download.ExtractMethod.NO_EXTRACT": true,
971978
"tfds.download.ExtractMethod.TAR": true,
@@ -1659,6 +1666,7 @@
16591666
"tfds.testing.DatasetBuilderTestCase.BUILDER_CONFIG_NAMES_TO_TEST": true,
16601667
"tfds.testing.DatasetBuilderTestCase.DATASET_CLASS": true,
16611668
"tfds.testing.DatasetBuilderTestCase.DL_EXTRACT_RESULT": true,
1669+
"tfds.testing.DatasetBuilderTestCase.EXAMPLE_DIR": true,
16621670
"tfds.testing.DatasetBuilderTestCase.INTERNAL_DATASET": true,
16631671
"tfds.testing.DatasetBuilderTestCase.MOCK_MONARCH": true,
16641672
"tfds.testing.DatasetBuilderTestCase.MOCK_OUT_FORBIDDEN_OS_FUNCTIONS": true,
@@ -1836,6 +1844,7 @@
18361844
"tfds.testing.DatasetBuilderTestCase.test_session": true,
18371845
"tfds.testing.DummyDatasetSharedGenerator": false,
18381846
"tfds.testing.DummyDatasetSharedGenerator.BUILDER_CONFIGS": true,
1847+
"tfds.testing.DummyDatasetSharedGenerator.GOOGLE_DISABLED": true,
18391848
"tfds.testing.DummyDatasetSharedGenerator.IN_DEVELOPMENT": true,
18401849
"tfds.testing.DummyDatasetSharedGenerator.VERSION": true,
18411850
"tfds.testing.DummyDatasetSharedGenerator.__abstractmethods__": true,
@@ -1859,11 +1868,13 @@
18591868
"tfds.testing.DummyDatasetSharedGenerator.as_dataset": true,
18601869
"tfds.testing.DummyDatasetSharedGenerator.builder_config": true,
18611870
"tfds.testing.DummyDatasetSharedGenerator.builder_configs": true,
1871+
"tfds.testing.DummyDatasetSharedGenerator.data_dir": true,
18621872
"tfds.testing.DummyDatasetSharedGenerator.download_and_prepare": true,
18631873
"tfds.testing.DummyDatasetSharedGenerator.info": true,
18641874
"tfds.testing.DummyDatasetSharedGenerator.name": true,
18651875
"tfds.testing.DummyMnist": false,
18661876
"tfds.testing.DummyMnist.BUILDER_CONFIGS": true,
1877+
"tfds.testing.DummyMnist.GOOGLE_DISABLED": true,
18671878
"tfds.testing.DummyMnist.IN_DEVELOPMENT": true,
18681879
"tfds.testing.DummyMnist.VERSION": true,
18691880
"tfds.testing.DummyMnist.__abstractmethods__": true,
@@ -1887,6 +1898,7 @@
18871898
"tfds.testing.DummyMnist.as_dataset": true,
18881899
"tfds.testing.DummyMnist.builder_config": true,
18891900
"tfds.testing.DummyMnist.builder_configs": true,
1901+
"tfds.testing.DummyMnist.data_dir": true,
18901902
"tfds.testing.DummyMnist.download_and_prepare": true,
18911903
"tfds.testing.DummyMnist.info": true,
18921904
"tfds.testing.DummyMnist.name": true,

docs/api_docs/python/tfds/core/DatasetBuilder.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,13 @@
22
<meta itemprop="name" content="tfds.core.DatasetBuilder" />
33
<meta itemprop="path" content="Stable" />
44
<meta itemprop="property" content="builder_config"/>
5+
<meta itemprop="property" content="data_dir"/>
56
<meta itemprop="property" content="info"/>
67
<meta itemprop="property" content="__init__"/>
78
<meta itemprop="property" content="as_dataset"/>
89
<meta itemprop="property" content="download_and_prepare"/>
910
<meta itemprop="property" content="BUILDER_CONFIGS"/>
11+
<meta itemprop="property" content="GOOGLE_DISABLED"/>
1012
<meta itemprop="property" content="IN_DEVELOPMENT"/>
1113
<meta itemprop="property" content="VERSION"/>
1214
<meta itemprop="property" content="builder_configs"/>
@@ -86,6 +88,10 @@ Callers must pass arguments as keyword arguments.
8688

8789
<a href="../../tfds/core/BuilderConfig.md"><code>tfds.core.BuilderConfig</code></a> for this builder.
8890

91+
<h3 id="data_dir"><code>data_dir</code></h3>
92+
93+
94+
8995
<h3 id="info"><code>info</code></h3>
9096

9197
<a href="../../tfds/core/DatasetInfo.md"><code>tfds.core.DatasetInfo</code></a> for this builder.
@@ -161,6 +167,8 @@ Downloads and prepares dataset for reading.
161167

162168
<h3 id="BUILDER_CONFIGS"><code>BUILDER_CONFIGS</code></h3>
163169

170+
<h3 id="GOOGLE_DISABLED"><code>GOOGLE_DISABLED</code></h3>
171+
164172
<h3 id="IN_DEVELOPMENT"><code>IN_DEVELOPMENT</code></h3>
165173

166174
<h3 id="VERSION"><code>VERSION</code></h3>

docs/api_docs/python/tfds/core/DatasetInfo.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@
55
<meta itemprop="property" content="as_proto"/>
66
<meta itemprop="property" content="citation"/>
77
<meta itemprop="property" content="description"/>
8-
<meta itemprop="property" content="download_checksums"/>
98
<meta itemprop="property" content="features"/>
109
<meta itemprop="property" content="full_name"/>
1110
<meta itemprop="property" content="initialized"/>
1211
<meta itemprop="property" content="name"/>
12+
<meta itemprop="property" content="redistribution_info"/>
1313
<meta itemprop="property" content="size_in_bytes"/>
1414
<meta itemprop="property" content="splits"/>
1515
<meta itemprop="property" content="supervised_keys"/>
@@ -52,7 +52,8 @@ __init__(
5252
features=None,
5353
supervised_keys=None,
5454
urls=None,
55-
citation=None
55+
citation=None,
56+
redistribution_info=None
5657
)
5758
```
5859

@@ -69,6 +70,10 @@ Constructs DatasetInfo.
6970
supervised learning, if applicable for the dataset.
7071
* <b>`urls`</b>: `list(str)`, optional, the homepage(s) for this dataset.
7172
* <b>`citation`</b>: `str`, optional, the citation to use for this dataset.
73+
* <b>`redistribution_info`</b>: `dict`, optional, information needed for
74+
redistribution, as specified in `dataset_info_pb2.RedistributionInfo`.
75+
The content of the `license` subfield will automatically be written to a
76+
LICENSE file stored with the dataset.
7277

7378

7479

@@ -90,10 +95,6 @@ Constructs DatasetInfo.
9095

9196

9297

93-
<h3 id="download_checksums"><code>download_checksums</code></h3>
94-
95-
96-
9798
<h3 id="features"><code>features</code></h3>
9899

99100

@@ -110,6 +111,10 @@ Whether DatasetInfo has been fully initialized.
110111

111112

112113

114+
<h3 id="redistribution_info"><code>redistribution_info</code></h3>
115+
116+
117+
113118
<h3 id="size_in_bytes"><code>size_in_bytes</code></h3>
114119

115120

docs/api_docs/python/tfds/core/GeneratorBasedBuilder.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,13 @@
22
<meta itemprop="name" content="tfds.core.GeneratorBasedBuilder" />
33
<meta itemprop="path" content="Stable" />
44
<meta itemprop="property" content="builder_config"/>
5+
<meta itemprop="property" content="data_dir"/>
56
<meta itemprop="property" content="info"/>
67
<meta itemprop="property" content="__init__"/>
78
<meta itemprop="property" content="as_dataset"/>
89
<meta itemprop="property" content="download_and_prepare"/>
910
<meta itemprop="property" content="BUILDER_CONFIGS"/>
11+
<meta itemprop="property" content="GOOGLE_DISABLED"/>
1012
<meta itemprop="property" content="IN_DEVELOPMENT"/>
1113
<meta itemprop="property" content="VERSION"/>
1214
<meta itemprop="property" content="builder_configs"/>
@@ -58,6 +60,10 @@ Builder constructor.
5860

5961
<a href="../../tfds/core/BuilderConfig.md"><code>tfds.core.BuilderConfig</code></a> for this builder.
6062

63+
<h3 id="data_dir"><code>data_dir</code></h3>
64+
65+
66+
6167
<h3 id="info"><code>info</code></h3>
6268

6369
<a href="../../tfds/core/DatasetInfo.md"><code>tfds.core.DatasetInfo</code></a> for this builder.
@@ -133,6 +139,8 @@ Downloads and prepares dataset for reading.
133139

134140
<h3 id="BUILDER_CONFIGS"><code>BUILDER_CONFIGS</code></h3>
135141

142+
<h3 id="GOOGLE_DISABLED"><code>GOOGLE_DISABLED</code></h3>
143+
136144
<h3 id="IN_DEVELOPMENT"><code>IN_DEVELOPMENT</code></h3>
137145

138146
<h3 id="VERSION"><code>VERSION</code></h3>

docs/api_docs/python/tfds/core/SplitBase.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ See the
2424
for more information.
2525

2626
There are three parts to the composition:
27-
1) The splits are composed (defined, merged, splitted,...) together before
27+
1) The splits are composed (defined, merged, split,...) together before
2828
calling the `.as_dataset()` function. This is done with the `__add__`,
2929
`__getitem__`, which return a tree of `SplitBase` (whose leaf
3030
are the `NamedSplit` objects)

docs/api_docs/python/tfds/core/lazy_imports.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,6 @@ Defined in [`core/lazy_imports.py`](https://github.com/tensorflow/datasets/tree/
1616
Lazy importer for heavy dependencies.
1717

1818
Some datasets require heavy dependencies for data generation. To allow for
19-
the default installation to remain lean, those heavy depdencies are
19+
the default installation to remain lean, those heavy dependencies are
2020
lazily imported here.
2121

0 commit comments

Comments
 (0)