Skip to content

Commit 8175693

Browse files
Merge pull request #13 from tensorflow/master
sync from tfds-master
2 parents c52b969 + 90c0b38 commit 8175693

File tree

617 files changed

+67053
-13971
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

617 files changed

+67053
-13971
lines changed

README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,16 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
2424
```sh
2525
pip install tensorflow-datasets
2626

27-
# Requires TF 1.12+ to be installed.
27+
# Requires TF 1.13+ to be installed.
2828
# Some datasets require additional libraries; see setup.py extras_require
2929
pip install tensorflow
3030
# or:
3131
pip install tensorflow-gpu
3232
```
3333

34+
Join [our Google group](https://groups.google.com/forum/#!forum/tensorflow-datasets-public-announce)
35+
to receive updates on the project.
36+
3437
### Usage
3538

3639
```python
@@ -111,6 +114,17 @@ print(info)
111114
)
112115
```
113116
117+
You can also get details about the classes (number of classes and their names).
118+
119+
```python
120+
info = tfds.builder('cats_vs_dogs').info
121+
122+
info.features['label'].num_classes # 2
123+
info.features['label'].names # ['cat', 'dog']
124+
info.features['label'].int2str(1) # "dog"
125+
info.features['label'].str2int('cat') # 0
126+
```
127+
114128
### NumPy Usage with `tfds.as_numpy`
115129
116130
As a convenience for users that want simple NumPy arrays in their programs, you

docs/_book.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,10 @@ upper_tabs:
2323
path: /datasets/splits
2424
- title: Add a dataset
2525
path: /datasets/add_dataset
26+
- title: Add huge datasets
27+
path: /datasets/beam_datasets
28+
- title: Store your dataset on GCS
29+
path: /datasets/gcs
2630
- name: API
2731
skip_translation: true
2832
contents:

docs/_project.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
name: TensorFlow Datasets
2-
breadcrumb_name: Datasets v1.0.1
2+
breadcrumb_name: Datasets v1.0.2
33
home_url: /datasets/
44
parent_project_metadata_path: /_project.yaml
55
description: >

docs/add_dataset.md

Lines changed: 70 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,34 @@ Follow this guide to add a dataset to TFDS.
55
See our [list of datasets](datasets.md) to see if the dataset you want isn't
66
already added.
77

8-
* [Overview](#overview)
9-
* [Writing `my_dataset.py`](#writing-my-datasetpy)
10-
* [Specifying `DatasetInfo`](#specifying-datasetinfo)
11-
* [`FeatureConnector`s](#featureconnectors)
12-
* [Downloading and extracting source data](#downloading-and-extracting-source-data)
13-
* [Manual download and extraction](#manual-download-and-extraction)
14-
* [Specifying dataset splits](#specifying-dataset-splits)
15-
* [Writing an example generator](#writing-an-example-generator)
16-
* [File access and `tf.io.gfile`](#file-access-and-tfiogfile)
17-
* [Extra dependencies](#extra-dependencies)
18-
* [Dataset configuration](#dataset-configuration)
19-
* [Create your own `FeatureConnector`](#create-your-own-featureconnector)
20-
* [Adding the dataset to `tensorflow/datasets`](#adding-the-dataset-to-tensorflowdatasets)
21-
* [Large datasets and distributed generation](#large-datasets-and-distributed-generation)
22-
* [Testing `MyDataset`](#testing-mydataset)
8+
* [Overview](#overview)
9+
* [Writing `my_dataset.py`](#writing-my-datasetpy)
10+
* [Use the default template](#use-the-default-template)
11+
* [DatasetBuilder](#datasetbuilde)
12+
* [my_dataset.py](#my-datasetpy)
13+
* [Specifying `DatasetInfo`](#specifying-datasetinfo)
14+
* [`FeatureConnector`s](#featureconnectors)
15+
* [Downloading and extracting source data](#downloading-and-extracting-source-data)
16+
* [Manual download and extraction](#manual-download-and-extraction)
17+
* [Specifying dataset splits](#specifying-dataset-splits)
18+
* [Writing an example generator](#writing-an-example-generator)
19+
* [File access and `tf.io.gfile`](#file-access-and-tfiogfile)
20+
* [Extra dependencies](#extra-dependencies)
21+
* [Corrupted data](#corrupted-data)
22+
* [Inconsistent data](#inconsistent-data)
23+
* [Dataset configuration](#dataset-configuration)
24+
* [Heavy configuration with BuilderConfig](#heavy-configuration-with-builderconfig)
25+
* [Light configuration with constructor args](#light-configuration-with-constructor-args)
26+
* [Create your own `FeatureConnector`](#create-your-own-featureconnector)
27+
* [Adding the dataset to `tensorflow/datasets`](#adding-the-dataset-to-tensorflowdatasets)
28+
* [1. Add an import for registration](#1-add-an-import-for-registration)
29+
* [2. Run download_and_prepare locally](#2-run-download-and-prepare-locally)
30+
* [3. Double-check the citation](#3-double-check-the-citation)
31+
* [4. Add a test](#4-add-a-test)
32+
* [5. Check your code style](#5-check-your-code-style)
33+
* [6. Send for review!](#6-send-for-review)
34+
* [Large datasets and distributed generation](#large-datasets-and-distributed-generation)
35+
* [Testing `MyDataset`](#testing-mydataset)
2336

2437
## Overview
2538

@@ -49,6 +62,24 @@ generate on a single machine. See the
4962

5063
## Writing `my_dataset.py`
5164

65+
### Use the default template
66+
67+
If you want to
68+
[contribute to our repo](https://github.com/tensorflow/datasets/blob/master/CONTRIBUTING.md)
69+
and add a new dataset, the following script will help you get started by
70+
generating the required python files,...
71+
To use it, clone the `tfds` repository and run the following command:
72+
73+
```
74+
python tensorflow_datasets/scripts/create_new_dataset.py \
75+
--dataset my_dataset \
76+
--type image # text, audio, translation,...
77+
```
78+
79+
80+
Then search for `TODO(my_dataset)` in the generated files to do the
81+
modifications.
82+
5283
### `DatasetBuilder`
5384

5485
Each dataset is defined as a subclass of
@@ -193,15 +224,15 @@ through [`tfds.Split.subsplit`](splits.md#subsplit).
193224
# Specify the splits
194225
return [
195226
tfds.core.SplitGenerator(
196-
name="train",
227+
name=tfds.Split.TRAIN,
197228
num_shards=10,
198229
gen_kwargs={
199230
"images_dir_path": os.path.join(extracted_path, "train"),
200231
"labels": os.path.join(extracted_path, "train_labels.csv"),
201232
},
202233
),
203234
tfds.core.SplitGenerator(
204-
name="test",
235+
name=tfds.Split.TEST,
205236
num_shards=1,
206237
gen_kwargs={
207238
"images_dir_path": os.path.join(extracted_path, "test"),
@@ -501,17 +532,35 @@ Most datasets in TFDS should have a unit test and your reviewer may ask you
501532
to add one if you haven't already. See the
502533
[testing section](#testing-mydataset) below.
503534

504-
### 5. Send for review!
535+
### 5. Check your code style
536+
537+
Follow the [PEP 8 Python style guide](https://www.python.org/dev/peps/pep-0008),
538+
except TensorFlow uses 2 spaces instead of 4. Please conform to the
539+
[Google Python Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md),
540+
541+
Most importantly, use
542+
[`tensorflow_datasets/oss_scripts/lint.sh`](https://github.com/tensorflow/datasets/blob/master/oss_scripts/lint.sh)
543+
to ensure your code is properly formatted. For example, to lint the `image`
544+
directory:
545+
546+
```sh
547+
./oss_scripts/lint.sh tensorflow_datasets/image
548+
```
549+
550+
See
551+
[TensorFlow code style guide](https://www.tensorflow.org/community/contribute/code_style)
552+
for more information.
553+
554+
### 6. Send for review!
505555

506556
Send the pull request for review.
507557

508558

509559
## Large datasets and distributed generation
510560

511561
Some datasets are so large as to require multiple machines to download and
512-
generate. We intend to soon support this use case using Apache Beam. Follow
513-
[our tracking issue](https://github.com/tensorflow/datasets/issues/10)
514-
to be updated.
562+
generate. We support this use case using Apache Beam. Please read the
563+
[Beam Dataset Guide](beam_datasets.md) to get started.
515564

516565
## Testing MyDataset
517566

docs/api_docs/python/_redirects.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@ redirects:
33
to: /datasets/api_docs/python/tfds/download/GenerateMode
44
- from: /datasets/api_docs/python/tfds/testing/FeatureExpectationsTestCase/failureException
55
to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
6+
- from: /datasets/api_docs/python/tfds/testing/SubTestCase/failureException
7+
to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
68
- from: /datasets/api_docs/python/tfds/testing/TestCase/failureException
79
to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
810
- from: /datasets/api_docs/python/tfds/features/text

docs/api_docs/python/_toc.yaml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ toc:
88
path: /datasets/api_docs/python/tfds/as_numpy
99
- title: builder
1010
path: /datasets/api_docs/python/tfds/builder
11+
- title: disable_progress_bar
12+
path: /datasets/api_docs/python/tfds/disable_progress_bar
13+
- title: is_dataset_on_gcs
14+
path: /datasets/api_docs/python/tfds/is_dataset_on_gcs
1115
- title: list_builders
1216
path: /datasets/api_docs/python/tfds/list_builders
1317
- title: load
@@ -20,6 +24,8 @@ toc:
2024
section:
2125
- title: Overview
2226
path: /datasets/api_docs/python/tfds/core
27+
- title: BeamBasedBuilder
28+
path: /datasets/api_docs/python/tfds/core/BeamBasedBuilder
2329
- title: BuilderConfig
2430
path: /datasets/api_docs/python/tfds/core/BuilderConfig
2531
- title: DatasetBuilder
@@ -32,6 +38,10 @@ toc:
3238
path: /datasets/api_docs/python/tfds/core/get_tfds_path
3339
- title: lazy_imports
3440
path: /datasets/api_docs/python/tfds/core/lazy_imports
41+
- title: Metadata
42+
path: /datasets/api_docs/python/tfds/core/Metadata
43+
- title: MetadataDict
44+
path: /datasets/api_docs/python/tfds/core/MetadataDict
3545
- title: NamedSplit
3646
path: /datasets/api_docs/python/tfds/core/NamedSplit
3747
- title: SplitBase
@@ -82,8 +92,6 @@ toc:
8292
path: /datasets/api_docs/python/tfds/features/Image
8393
- title: Sequence
8494
path: /datasets/api_docs/python/tfds/features/Sequence
85-
- title: SequenceDict
86-
path: /datasets/api_docs/python/tfds/features/SequenceDict
8795
- title: Tensor
8896
path: /datasets/api_docs/python/tfds/features/Tensor
8997
- title: TensorInfo
@@ -112,8 +120,6 @@ toc:
112120
section:
113121
- title: Overview
114122
path: /datasets/api_docs/python/tfds/file_adapter
115-
- title: CSVAdapter
116-
path: /datasets/api_docs/python/tfds/file_adapter/CSVAdapter
117123
- title: FileFormatAdapter
118124
path: /datasets/api_docs/python/tfds/file_adapter/FileFormatAdapter
119125
- title: TFRecordExampleAdapter
@@ -142,6 +148,8 @@ toc:
142148
path: /datasets/api_docs/python/tfds/testing/rm_tmp_dir
143149
- title: run_in_graph_and_eager_modes
144150
path: /datasets/api_docs/python/tfds/testing/run_in_graph_and_eager_modes
151+
- title: SubTestCase
152+
path: /datasets/api_docs/python/tfds/testing/SubTestCase
145153
- title: TestCase
146154
path: /datasets/api_docs/python/tfds/testing/TestCase
147155
- title: test_main

0 commit comments

Comments
 (0)