Automated documentation update

Conchylicultor · copybara-github · commit f679fc1f7fd8 · 2020-02-20T15:48:47.000-08:00
PiperOrigin-RevId: 296311032
diff --git a/docs/catalog/_toc.yaml b/docs/catalog/_toc.yaml
@@ -210,6 +210,8 @@ toc:
     title: multi_news
   - path: /datasets/catalog/newsroom
     title: newsroom (manual)
+  - path: /datasets/catalog/opinosis
+    title: opinosis
   - path: /datasets/catalog/reddit_tifu
     title: reddit_tifu
   - path: /datasets/catalog/scientific_papers
diff --git a/docs/catalog/beans.md b/docs/catalog/beans.md
@@ -28,7 +28,7 @@ and collected by the Makerere AI research lab.
 *   **Dataset size**: `171.63 MiB`
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    Yes (validation, test), Only when `shuffle_files=False` (train)
+    Yes (test, validation), Only when `shuffle_files=False` (train)
 *   **Splits**:
 
 Split        | Examples
diff --git a/docs/catalog/c4.md b/docs/catalog/c4.md
@@ -27,11 +27,12 @@ https://www.tensorflow.org/datasets/beam_datasets.
 *   **Source code**:
     [`tfds.text.c4.C4`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/text/c4.py)
 *   **Versions**:
-    *   **`2.2.0`** (default): No release notes.
+    *   **`2.2.1`** (default): Update dataset_info.json
+    *   `2.2.0`: No release notes.
     *   `1.1.0`: No release notes.
     *   `1.0.1`: No release notes.
     *   `1.0.0`: No release notes.
-*   **Download size**: `6.96 TiB`
+*   **Download size**: `Unknown size`
 *   **Dataset size**: `Unknown size`
 *   **Manual download instructions**: This dataset requires you to download the
     source data manually into `download_config.manual_dir`
@@ -43,7 +44,12 @@ https://www.tensorflow.org/datasets/beam_datasets.
     `manual_dir`.
 *   **Auto-cached**
     ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
-    No
+    Yes
+*   **Splits**:
+
+Split | Examples
+:---- | -------:
+
 *   **Features**:
 
 ```python
@@ -74,44 +80,20 @@ FeaturesDict({
 ## c4/en (default config)
 
 *   **Config description**: English C4 dataset.
-*   **Splits**:
-
-Split        | Examples
-:----------- | ----------:
-'train'      | 364,684,602
-'validation' | 364,525
 
 ## c4/en.noclean
 
 *   **Config description**: Disables all cleaning (deduplication, removal based
     on bad words, etc.)
-*   **Splits**:
-
-Split        | Examples
-:----------- | ------------:
-'train'      | 1,063,805,630
-'validation' | 1,065,028
 
 ## c4/en.realnewslike
 
 *   **Config description**: Filters from the default config to only include
     content from the domains used in the 'RealNews' dataset (Zellers et al.,
     2019).
-*   **Splits**:
-
-Split        | Examples
-:----------- | ---------:
-'train'      | 13,659,362
-'validation' | 13,727
 
 ## c4/en.webtextlike
 
 *   **Config description**: Filters from the default config to only include
     content from the URLs in OpenWebText
     (https://github.com/jcpeterson/openwebtext).
-*   **Splits**:
-
-Split        | Examples
-:----------- | --------:
-'train'      | 4,441,108
-'validation' | 4,417
diff --git a/docs/catalog/opinosis.md b/docs/catalog/opinosis.md
@@ -0,0 +1,61 @@
+<div itemscope itemtype="http://schema.org/Dataset">
+  <div itemscope itemprop="includedInDataCatalog" itemtype="http://schema.org/DataCatalog">
+    <meta itemprop="name" content="TensorFlow Datasets" />
+  </div>
+
+  <meta itemprop="name" content="opinosis" />
+  <meta itemprop="description" content="&#10;The Opinosis Opinion Dataset consists of sentences extracted from reviews for 51 topics.&#10;Topics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com.&#10;&#10;&#10;To use this dataset:&#10;&#10;```python&#10;import tensorflow_datasets as tfds&#10;&#10;ds = tfds.load(&#x27;opinosis&#x27;, split=&#x27;train&#x27;)&#10;for ex in ds.take(4):&#10;  print(ex)&#10;```&#10;&#10;See [the guide](https://www.tensorflow.org/datasets/overview) for more&#10;informations on [tensorflow_datasets](https://www.tensorflow.org/datasets).&#10;&#10;" />
+  <meta itemprop="url" content="https://www.tensorflow.org/datasets/catalog/opinosis" />
+  <meta itemprop="sameAs" content="http://kavita-ganesan.com/opinosis/" />
+  <meta itemprop="citation" content="&#10;@inproceedings{ganesan2010opinosis,&#10;  title={Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions},&#10;  author={Ganesan, Kavita and Zhai, ChengXiang and Han, Jiawei},&#10;  booktitle={Proceedings of the 23rd International Conference on Computational Linguistics},&#10;  pages={340--348},&#10;  year={2010},&#10;  organization={Association for Computational Linguistics}&#10;}&#10;" />
+</div>
+
+# `opinosis`
+
+*   **Description**:
+
+The Opinosis Opinion Dataset consists of sentences extracted from reviews for 51
+topics. Topics and opinions are obtained from Tripadvisor, Edmunds.com and
+Amazon.com.
+
+*   **Homepage**:
+    [http://kavita-ganesan.com/opinosis/](http://kavita-ganesan.com/opinosis/)
+*   **Source code**:
+    [`tfds.summarization.opinosis.Opinosis`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/summarization/opinosis.py)
+*   **Versions**:
+    *   **`1.0.0`** (default): No release notes.
+*   **Download size**: `739.65 KiB`
+*   **Dataset size**: `725.45 KiB`
+*   **Auto-cached**
+    ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):
+    Yes
+*   **Splits**:
+
+Split   | Examples
+:------ | -------:
+'train' | 51
+
+*   **Features**:
+
+```python
+FeaturesDict({
+    'review_sents': Text(shape=(), dtype=tf.string),
+    'summaries': Sequence(Text(shape=(), dtype=tf.string)),
+})
+```
+
+*   **Supervised keys** (See
+    [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)):
+    `('review_sents', 'summaries')`
+*   **Citation**:
+
+```
+@inproceedings{ganesan2010opinosis,
+  title={Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions},
+  author={Ganesan, Kavita and Zhai, ChengXiang and Han, Jiawei},
+  booktitle={Proceedings of the 23rd International Conference on Computational Linguistics},
+  pages={340--348},
+  year={2010},
+  organization={Association for Computational Linguistics}
+}
+```
diff --git a/docs/catalog/overview.md b/docs/catalog/overview.md
@@ -139,6 +139,7 @@ np_datasets = tfds.as_numpy(datasets)
     *   [`gigaword`](gigaword.md)
     *   [`multi_news`](multi_news.md)
     *   [`newsroom`](newsroom.md)
+    *   [`opinosis`](opinosis.md)
     *   [`reddit_tifu`](reddit_tifu.md)
     *   [`scientific_papers`](scientific_papers.md)
     *   [`wikihow`](wikihow.md)
diff --git a/docs/catalog/qa4mre.md b/docs/catalog/qa4mre.md
@@ -2,12 +2,14 @@
   <div itemscope itemprop="includedInDataCatalog" itemtype="http://schema.org/DataCatalog">
     <meta itemprop="name" content="TensorFlow Datasets" />
   </div>
+
   <meta itemprop="name" content="qa4mre" />
   <meta itemprop="description" content="&#10;QA4MRE dataset was created for the CLEF 2011/2012/2013 shared tasks to promote research in &#10;question answering and reading comprehension. The dataset contains a supporting &#10;passage and a set of questions corresponding to the passage. Multiple options &#10;for answers are provided for each question, of which only one is correct. The &#10;training and test datasets are available for the main track.&#10;Additional gold standard documents are available for two pilot studies: one on &#10;alzheimers data, and the other on entrance exams data.&#10;&#10;&#10;To use this dataset:&#10;&#10;```python&#10;import tensorflow_datasets as tfds&#10;&#10;ds = tfds.load(&#x27;qa4mre&#x27;, split=&#x27;train&#x27;)&#10;for ex in ds.take(4):&#10;  print(ex)&#10;```&#10;&#10;See [the guide](https://www.tensorflow.org/datasets/overview) for more&#10;informations on [tensorflow_datasets](https://www.tensorflow.org/datasets).&#10;&#10;" />
   <meta itemprop="url" content="https://www.tensorflow.org/datasets/catalog/qa4mre" />
   <meta itemprop="sameAs" content="http://nlp.uned.es/clef-qa/repository/pastCampaigns.php" />
-  <meta itemprop="citation" content="&#10;@InProceedings{10.1007/978-3-642-40802-1_29,&#10;author=&quot;Pe{\~{n}}as, Anselmo&#10;and Hovy, Eduard&#10;and Forner, Pamela&#10;and Rodrigo, {&#x27;A}lvaro&#10;and Sutcliffe, Richard&#10;and Morante, Roser&quot;,&#10;editor=&quot;Forner, Pamela&#10;and M{&quot;u}ller, Henning&#10;and Paredes, Roberto&#10;and Rosso, Paolo&#10;and Stein, Benno&quot;,&#10;title=&quot;QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation&quot;,&#10;booktitle=&quot;Information Access Evaluation. Multilinguality, Multimodality, and Visualization&quot;,&#10;year=&quot;2013&quot;,&#10;publisher=&quot;Springer Berlin Heidelberg&quot;,&#10;address=&quot;Berlin, Heidelberg&quot;,&#10;pages=&quot;303--320&quot;,&#10;abstract=&quot;This paper describes the methodology for testing the performance of Machine Reading systems through Question Answering and Reading Comprehension Tests. This was the attempt of the QA4MRE challenge which was run as a Lab at CLEF 2011--2013. The traditional QA task was replaced by a new Machine Reading task, whose intention was to ask questions that required a deep knowledge of individual short texts and in which systems were required to choose one answer, by analysing the corresponding test document in conjunction with background text collections provided by the organization. Four different tasks have been organized during these years: Main Task, Processing Modality and Negation for Machine Reading, Machine Reading of Biomedical Texts about Alzheimer&#x27;s disease, and Entrance Exams. This paper describes their motivation, their goals, their methodology for preparing the data sets, their background collections, their metrics used for the evaluation, and the lessons learned along these three years.&quot;,&#10;isbn=&quot;978-3-642-40802-1&quot;&#10;}&#10;" />
+  <meta itemprop="citation" content="&#10;@InProceedings{10.1007/978-3-642-40802-1_29,&#10;author=&quot;Pe{\~{n}}as, Anselmo&#10;and Hovy, Eduard&#10;and Forner, Pamela&#10;and Rodrigo, {\&#x27;A}lvaro&#10;and Sutcliffe, Richard&#10;and Morante, Roser&quot;,&#10;editor=&quot;Forner, Pamela&#10;and M{\&quot;u}ller, Henning&#10;and Paredes, Roberto&#10;and Rosso, Paolo&#10;and Stein, Benno&quot;,&#10;title=&quot;QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation&quot;,&#10;booktitle=&quot;Information Access Evaluation. Multilinguality, Multimodality, and Visualization&quot;,&#10;year=&quot;2013&quot;,&#10;publisher=&quot;Springer Berlin Heidelberg&quot;,&#10;address=&quot;Berlin, Heidelberg&quot;,&#10;pages=&quot;303--320&quot;,&#10;abstract=&quot;This paper describes the methodology for testing the performance of Machine Reading systems through Question Answering and Reading Comprehension Tests. This was the attempt of the QA4MRE challenge which was run as a Lab at CLEF 2011--2013. The traditional QA task was replaced by a new Machine Reading task, whose intention was to ask questions that required a deep knowledge of individual short texts and in which systems were required to choose one answer, by analysing the corresponding test document in conjunction with background text collections provided by the organization. Four different tasks have been organized during these years: Main Task, Processing Modality and Negation for Machine Reading, Machine Reading of Biomedical Texts about Alzheimer&#x27;s disease, and Entrance Exams. This paper describes their motivation, their goals, their methodology for preparing the data sets, their background collections, their metrics used for the evaluation, and the lessons learned along these three years.&quot;,&#10;isbn=&quot;978-3-642-40802-1&quot;&#10;}&#10;" />
 </div>
+
 # `qa4mre`
 
 *   **Description**:
@@ -60,11 +62,11 @@ FeaturesDict({
 author="Pe{\~{n}}as, Anselmo
 and Hovy, Eduard
 and Forner, Pamela
-and Rodrigo, {'A}lvaro
+and Rodrigo, {\'A}lvaro
 and Sutcliffe, Richard
 and Morante, Roser",
 editor="Forner, Pamela
-and M{"u}ller, Henning
+and M{\"u}ller, Henning
 and Paredes, Roberto
 and Rosso, Paolo
 and Stein, Benno",
diff --git a/tensorflow_datasets/testing/metadata/missing.txt b/tensorflow_datasets/testing/metadata/missing.txt
@@ -2,6 +2,10 @@
 # This is used for reference and debugging.
 bigearthnet/all/0.0.2
 bigearthnet/rgb/0.0.2
+c4/en.noclean/2.2.1
+c4/en.realnewslike/2.2.1
+c4/en.webtextlike/2.2.1
+c4/en/2.2.1
 cityscapes/semantic_segmentation/1.0.0
 cityscapes/semantic_segmentation_extra/1.0.0
 cityscapes/stereo_disparity/1.0.0
diff --git a/tensorflow_datasets/testing/metadata/opinosis/1.0.0/dataset_info.json b/tensorflow_datasets/testing/metadata/opinosis/1.0.0/dataset_info.json
@@ -0,0 +1,68 @@
+{
+  "citation": "\n@inproceedings{ganesan2010opinosis,\n  title={Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions},\n  author={Ganesan, Kavita and Zhai, ChengXiang and Han, Jiawei},\n  booktitle={Proceedings of the 23rd International Conference on Computational Linguistics},\n  pages={340--348},\n  year={2010},\n  organization={Association for Computational Linguistics}\n}\n",
+  "description": "\nThe Opinosis Opinion Dataset consists of sentences extracted from reviews for 51 topics.\nTopics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com.\n",
+  "downloadSize": "757398",
+  "location": {
+    "urls": [
+      "http://kavita-ganesan.com/opinosis/"
+    ]
+  },
+  "name": "opinosis",
+  "schema": {
+    "feature": [
+      {
+        "name": "review_sents",
+        "type": "BYTES"
+      },
+      {
+        "name": "summaries",
+        "shape": {
+          "dim": [
+            {
+              "size": "-1"
+            }
+          ]
+        },
+        "type": "BYTES"
+      }
+    ]
+  },
+  "splits": [
+    {
+      "name": "train",
+      "numBytes": "742862",
+      "numShards": "1",
+      "shardLengths": [
+        "51"
+      ],
+      "statistics": {
+        "features": [
+          {
+            "bytesStats": {
+              "commonStats": {
+                "numNonMissing": "51"
+              }
+            },
+            "name": "review_sents",
+            "type": "BYTES"
+          },
+          {
+            "bytesStats": {
+              "commonStats": {
+                "numNonMissing": "51"
+              }
+            },
+            "name": "summaries",
+            "type": "BYTES"
+          }
+        ],
+        "numExamples": "51"
+      }
+    }
+  ],
+  "supervisedKeys": {
+    "input": "review_sents",
+    "output": "summaries"
+  },
+  "version": "1.0.0"
+}
diff --git a/tensorflow_datasets/testing/metadata/supported.txt b/tensorflow_datasets/testing/metadata/supported.txt
@@ -761,6 +761,7 @@ open_images_v4/300k/0.2.1
 open_images_v4/300k/2.0.0
 open_images_v4/original/0.2.0
 open_images_v4/original/2.0.0
+opinosis/1.0.0
 oxford_flowers102/0.0.1
 oxford_flowers102/2.0.0
 oxford_iiit_pet/3.1.0