Automated documentation update

Conchylicultor · copybara-github · commit 3c69b42456fb · 2020-02-21T15:17:20.000-08:00
PiperOrigin-RevId: 296520398
diff --git a/docs/catalog/_toc.yaml b/docs/catalog/_toc.yaml
@@ -4,6 +4,8 @@ toc:
 - section:
   - path: /datasets/catalog/groove
     title: groove
+  - path: /datasets/catalog/librispeech
+    title: librispeech
   - path: /datasets/catalog/nsynth
     title: nsynth
   title: Audio
@@ -242,6 +244,8 @@ toc:
     title: glue
   - path: /datasets/catalog/imdb_reviews
     title: imdb_reviews
+  - path: /datasets/catalog/librispeech_lm
+    title: librispeech_lm
   - path: /datasets/catalog/lm1b
     title: lm1b
   - path: /datasets/catalog/math_dataset
diff --git a/docs/catalog/beans.md b/docs/catalog/beans.md
@@ -28,7 +28,7 @@ and collected by the Makerere AI research lab.
 *   **Dataset size**: `171.63 MiB`
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes (test, validation), Only when `shuffle_files=False` (train)
+    Yes (validation, test), Only when `shuffle_files=False` (train)
 *   **Splits**:
 
 Split        | Examples
diff --git a/docs/catalog/c4.md b/docs/catalog/c4.md
@@ -44,7 +44,7 @@ https://www.tensorflow.org/datasets/beam_datasets.
     `manual_dir`.
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes
+    Unknown
 *   **Splits**:
 
 Split | Examples
diff --git a/docs/catalog/cityscapes.md b/docs/catalog/cityscapes.md
@@ -60,7 +60,7 @@ get the files.
     Other configs do require additional files - please see code for more details.
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes
+    Unknown
 *   **Splits**:
 
 Split | Examples
diff --git a/docs/catalog/cnn_dailymail.md b/docs/catalog/cnn_dailymail.md
@@ -31,7 +31,7 @@ each highlight, which is the target summary
 *   **Dataset size**: `Unknown size`
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes
+    Unknown
 *   **Splits**:
 
 Split | Examples
diff --git a/docs/catalog/image_label_folder.md b/docs/catalog/image_label_folder.md
@@ -31,7 +31,7 @@ Generic image classification dataset.
     This is a 'template' dataset.
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes
+    Unknown
 *   **Splits**:
 
 Split | Examples
diff --git a/docs/catalog/librispeech.md b/docs/catalog/librispeech.md
@@ -0,0 +1,75 @@
+<div itemscope itemtype="http://schema.org/Dataset">
+  <div itemscope itemprop="includedInDataCatalog" itemtype="http://schema.org/DataCatalog">
+    <meta itemprop="name" content="TensorFlow Datasets" />
+  </div>
+
+  <meta itemprop="name" content="librispeech" />
+  <meta itemprop="description" content="LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz,&#10;prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read&#10;audiobooks from the LibriVox project, and has been carefully segmented and aligned.87&#10;&#10;&#10;To use this dataset:&#10;&#10;```python&#10;import tensorflow_datasets as tfds&#10;&#10;ds = tfds.load(&#x27;librispeech&#x27;, split=&#x27;train&#x27;)&#10;for ex in ds.take(4):&#10;  print(ex)&#10;```&#10;&#10;See [the guide](https://www.tensorflow.org/datasets/overview) for more&#10;informations on [tensorflow_datasets](https://www.tensorflow.org/datasets).&#10;&#10;" />
+  <meta itemprop="url" content="https://www.tensorflow.org/datasets/catalog/librispeech" />
+  <meta itemprop="sameAs" content="http://www.openslr.org/12" />
+  <meta itemprop="citation" content="@inproceedings{panayotov2015librispeech,&#10;  title={Librispeech: an ASR corpus based on public domain audio books},&#10;  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},&#10;  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},&#10;  pages={5206--5210},&#10;  year={2015},&#10;  organization={IEEE}&#10;}&#10;" />
+</div>
+
+# `librispeech`
+
+*   **Description**:
+
+LibriSpeech is a corpus of approximately 1000 hours of read English speech with
+sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of
+Daniel Povey. The data is derived from read audiobooks from the LibriVox
+project, and has been carefully segmented and aligned.87
+
+*   **Homepage**: [http://www.openslr.org/12](http://www.openslr.org/12)
+*   **Source code**:
+    [`tfds.audio.librispeech.Librispeech`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/audio/librispeech.py)
+*   **Versions**:
+    *   **`1.1.0`** (default): No release notes.
+*   **Download size**: `Unknown size`
+*   **Dataset size**: `Unknown size`
+*   **Auto-cached**
+    ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
+    Unknown
+*   **Splits**:
+
+Split | Examples
+:---- | -------:
+
+*   **Features**:
+
+```python
+FeaturesDict({
+    'chapter_id': Tensor(shape=(), dtype=tf.int64),
+    'id': Tensor(shape=(), dtype=tf.string),
+    'speaker_id': Tensor(shape=(), dtype=tf.int64),
+    'speech': Audio(shape=(None,), dtype=tf.int64),
+    'text': Text(shape=(), dtype=tf.string),
+})
+```
+
+*   **Supervised keys** (See
+    [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)):
+    `('speech', 'text')`
+*   **Citation**:
+
+```
+@inproceedings{panayotov2015librispeech,
+  title={Librispeech: an ASR corpus based on public domain audio books},
+  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
+  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},
+  pages={5206--5210},
+  year={2015},
+  organization={IEEE}
+}
+```
+
+## librispeech/plain_text (default config)
+
+*   **Config description**: Transcriptions are in plain text.
+
+## librispeech/subwords8k
+
+*   **Config description**: Transcriptions use the SubwordTextEncoder
+
+## librispeech/subwords32k
+
+*   **Config description**: Transcriptions use the SubwordTextEncoder
diff --git a/docs/catalog/librispeech_lm.md b/docs/catalog/librispeech_lm.md
@@ -0,0 +1,57 @@
+<div itemscope itemtype="http://schema.org/Dataset">
+  <div itemscope itemprop="includedInDataCatalog" itemtype="http://schema.org/DataCatalog">
+    <meta itemprop="name" content="TensorFlow Datasets" />
+  </div>
+
+  <meta itemprop="name" content="librispeech_lm" />
+  <meta itemprop="description" content="Language modeling resources to be used in conjunction with the LibriSpeech ASR corpus.&#10;&#10;&#10;To use this dataset:&#10;&#10;```python&#10;import tensorflow_datasets as tfds&#10;&#10;ds = tfds.load(&#x27;librispeech_lm&#x27;, split=&#x27;train&#x27;)&#10;for ex in ds.take(4):&#10;  print(ex)&#10;```&#10;&#10;See [the guide](https://www.tensorflow.org/datasets/overview) for more&#10;informations on [tensorflow_datasets](https://www.tensorflow.org/datasets).&#10;&#10;" />
+  <meta itemprop="url" content="https://www.tensorflow.org/datasets/catalog/librispeech_lm" />
+  <meta itemprop="sameAs" content="http://www.openslr.org/11" />
+  <meta itemprop="citation" content="@inproceedings{panayotov2015librispeech,&#10;  title={Librispeech: an ASR corpus based on public domain audio books},&#10;  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},&#10;  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},&#10;  pages={5206--5210},&#10;  year={2015},&#10;  organization={IEEE}&#10;}&#10;" />
+</div>
+
+# `librispeech_lm`
+
+*   **Description**:
+
+Language modeling resources to be used in conjunction with the LibriSpeech ASR
+corpus.
+
+*   **Homepage**: [http://www.openslr.org/11](http://www.openslr.org/11)
+*   **Source code**:
+    [`tfds.text.librispeech_lm.LibrispeechLm`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/text/librispeech_lm.py)
+*   **Versions**:
+    *   **`0.1.0`** (default): No release notes.
+*   **Download size**: `Unknown size`
+*   **Dataset size**: `Unknown size`
+*   **Auto-cached**
+    ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
+    Unknown
+*   **Splits**:
+
+Split | Examples
+:---- | -------:
+
+*   **Features**:
+
+```python
+FeaturesDict({
+    'text': Text(shape=(), dtype=tf.string),
+})
+```
+
+*   **Supervised keys** (See
+    [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)):
+    `('text', 'text')`
+*   **Citation**:
+
+```
+@inproceedings{panayotov2015librispeech,
+  title={Librispeech: an ASR corpus based on public domain audio books},
+  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
+  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},
+  pages={5206--5210},
+  year={2015},
+  organization={IEEE}
+}
+```
diff --git a/docs/catalog/opinosis.md b/docs/catalog/opinosis.md
@@ -2,14 +2,12 @@
   <div itemscope itemprop="includedInDataCatalog" itemtype="http://schema.org/DataCatalog">
     <meta itemprop="name" content="TensorFlow Datasets" />
   </div>
-
   <meta itemprop="name" content="opinosis" />
   <meta itemprop="description" content="&#10;The Opinosis Opinion Dataset consists of sentences extracted from reviews for 51 topics.&#10;Topics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com.&#10;&#10;&#10;To use this dataset:&#10;&#10;```python&#10;import tensorflow_datasets as tfds&#10;&#10;ds = tfds.load(&#x27;opinosis&#x27;, split=&#x27;train&#x27;)&#10;for ex in ds.take(4):&#10;  print(ex)&#10;```&#10;&#10;See [the guide](https://www.tensorflow.org/datasets/overview) for more&#10;informations on [tensorflow_datasets](https://www.tensorflow.org/datasets).&#10;&#10;" />
   <meta itemprop="url" content="https://www.tensorflow.org/datasets/catalog/opinosis" />
   <meta itemprop="sameAs" content="http://kavita-ganesan.com/opinosis/" />
   <meta itemprop="citation" content="&#10;@inproceedings{ganesan2010opinosis,&#10;  title={Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions},&#10;  author={Ganesan, Kavita and Zhai, ChengXiang and Han, Jiawei},&#10;  booktitle={Proceedings of the 23rd International Conference on Computational Linguistics},&#10;  pages={340--348},&#10;  year={2010},&#10;  organization={Association for Computational Linguistics}&#10;}&#10;" />
 </div>
-
 # `opinosis`
 
 *   **Description**:
@@ -43,7 +41,6 @@ FeaturesDict({
     'summaries': Sequence(Text(shape=(), dtype=tf.string)),
 })
 ```
-
 *   **Supervised keys** (See
     [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)):
     `('review_sents', 'summaries')`
diff --git a/docs/catalog/overview.md b/docs/catalog/overview.md
@@ -36,6 +36,7 @@ np_datasets = tfds.as_numpy(datasets)
 
 *   `Audio`
     *   [`groove`](groove.md)
+    *   [`librispeech`](librispeech.md)
     *   [`nsynth`](nsynth.md)
 *   `Image`
     *   [`abstract_reasoning`](abstract_reasoning.md)
@@ -155,6 +156,7 @@ np_datasets = tfds.as_numpy(datasets)
     *   [`gap`](gap.md)
     *   [`glue`](glue.md)
     *   [`imdb_reviews`](imdb_reviews.md)
+    *   [`librispeech_lm`](librispeech_lm.md)
     *   [`lm1b`](lm1b.md)
     *   [`math_dataset`](math_dataset.md)
     *   [`movie_rationales`](movie_rationales.md)
diff --git a/docs/catalog/qa4mre.md b/docs/catalog/qa4mre.md
@@ -2,14 +2,12 @@
   <div itemscope itemprop="includedInDataCatalog" itemtype="http://schema.org/DataCatalog">
     <meta itemprop="name" content="TensorFlow Datasets" />
   </div>
-
   <meta itemprop="name" content="qa4mre" />
   <meta itemprop="description" content="&#10;QA4MRE dataset was created for the CLEF 2011/2012/2013 shared tasks to promote research in &#10;question answering and reading comprehension. The dataset contains a supporting &#10;passage and a set of questions corresponding to the passage. Multiple options &#10;for answers are provided for each question, of which only one is correct. The &#10;training and test datasets are available for the main track.&#10;Additional gold standard documents are available for two pilot studies: one on &#10;alzheimers data, and the other on entrance exams data.&#10;&#10;&#10;To use this dataset:&#10;&#10;```python&#10;import tensorflow_datasets as tfds&#10;&#10;ds = tfds.load(&#x27;qa4mre&#x27;, split=&#x27;train&#x27;)&#10;for ex in ds.take(4):&#10;  print(ex)&#10;```&#10;&#10;See [the guide](https://www.tensorflow.org/datasets/overview) for more&#10;informations on [tensorflow_datasets](https://www.tensorflow.org/datasets).&#10;&#10;" />
   <meta itemprop="url" content="https://www.tensorflow.org/datasets/catalog/qa4mre" />
   <meta itemprop="sameAs" content="http://nlp.uned.es/clef-qa/repository/pastCampaigns.php" />
   <meta itemprop="citation" content="&#10;@InProceedings{10.1007/978-3-642-40802-1_29,&#10;author=&quot;Pe{\~{n}}as, Anselmo&#10;and Hovy, Eduard&#10;and Forner, Pamela&#10;and Rodrigo, {\&#x27;A}lvaro&#10;and Sutcliffe, Richard&#10;and Morante, Roser&quot;,&#10;editor=&quot;Forner, Pamela&#10;and M{\&quot;u}ller, Henning&#10;and Paredes, Roberto&#10;and Rosso, Paolo&#10;and Stein, Benno&quot;,&#10;title=&quot;QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation&quot;,&#10;booktitle=&quot;Information Access Evaluation. Multilinguality, Multimodality, and Visualization&quot;,&#10;year=&quot;2013&quot;,&#10;publisher=&quot;Springer Berlin Heidelberg&quot;,&#10;address=&quot;Berlin, Heidelberg&quot;,&#10;pages=&quot;303--320&quot;,&#10;abstract=&quot;This paper describes the methodology for testing the performance of Machine Reading systems through Question Answering and Reading Comprehension Tests. This was the attempt of the QA4MRE challenge which was run as a Lab at CLEF 2011--2013. The traditional QA task was replaced by a new Machine Reading task, whose intention was to ask questions that required a deep knowledge of individual short texts and in which systems were required to choose one answer, by analysing the corresponding test document in conjunction with background text collections provided by the organization. Four different tasks have been organized during these years: Main Task, Processing Modality and Negation for Machine Reading, Machine Reading of Biomedical Texts about Alzheimer&#x27;s disease, and Entrance Exams. This paper describes their motivation, their goals, their methodology for preparing the data sets, their background collections, their metrics used for the evaluation, and the lessons learned along these three years.&quot;,&#10;isbn=&quot;978-3-642-40802-1&quot;&#10;}&#10;" />
 </div>
-
 # `qa4mre`
 
 *   **Description**:
diff --git a/docs/catalog/so2sat.md b/docs/catalog/so2sat.md
@@ -37,7 +37,7 @@ http://creativecommons.org/licenses/by/4.0
 *   **Dataset size**: `Unknown size`
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes
+    Unknown
 *   **Splits**:
 
 Split | Examples
diff --git a/docs/catalog/wmt18_translate.md b/docs/catalog/wmt18_translate.md
@@ -49,7 +49,7 @@ builder = tfds.builder("wmt_translate", config=config)
     be downloaded.
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes
+    Unknown
 *   **Splits**:
 
 Split | Examples
diff --git a/tensorflow_datasets/testing/metadata/missing.txt b/tensorflow_datasets/testing/metadata/missing.txt
@@ -16,6 +16,10 @@ cnn_dailymail/subwords32k/3.0.0
 diabetic_retinopathy_detection/btgraham-300/1.0.0
 glue/ax/0.0.2
 image_label_folder/2.0.0
+librispeech/plain_text/1.1.0
+librispeech/subwords32k/1.1.0
+librispeech/subwords8k/1.1.0
+librispeech_lm/0.1.0
 oxford_iiit_pet/1.2.0
 so2sat/all/0.0.1
 so2sat/all/2.0.0