Updated user guide.

lu-ohai · lu-ohai · commit a577cce45a9b · 2023-05-01T17:03:02.000-07:00
diff --git a/docs/source/user_guide/data_transformation/data_transformation.rst b/docs/source/user_guide/data_transformation/data_transformation.rst
@@ -3,7 +3,7 @@
 Transform Data 
 ##############
 
-When datasets are loaded with DatasetFactory, they can be transformed and manipulated easily with the built-in functions. Underlying, an ``ADSDataset`` object is a Pandas dataframe. Any operation that can be performed to a `Pandas dataframe <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_ can also be applied to an ADS Dataset.
+When datasets are loaded, they can be transformed and manipulated easily with the built-in functions. Underlying, an ``ADSDataset`` object is a Pandas dataframe. Any operation that can be performed to a `Pandas dataframe <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_ can also be applied to an ADS Dataset.
 
 Loading the Dataset
 ********************
@@ -12,9 +12,9 @@ You can load a ``pandas`` dataframe into an ``ADSDataset`` by calling.
 
 .. code-block:: python3
 
-   from ads.dataset.factory import DatasetFactory
+   from ads.dataset.dataset import ADSDataset
 
-   ds = DatasetFactory.from_dataframe(df)
+   ds = ADSDataset.from_dataframe(df)
 
 
 Automated Transformations
@@ -513,11 +513,14 @@ The resulting three data subsets each have separate data (X) and labels (y).
     print(train.X)  # print out all features in train dataset
     print(train.y)  # print out labels in train dataset
 
-You can split the dataset right after the ``DatasetFactory.open()`` statement:
+You can split the dataset right after the ``ADSDatasetWithTarget.from_dataframe()`` statement:
 
 .. code-block:: python3
 
-    ds = DatasetFactory.open("path/data.csv").set_target('target')
+    ds = ADSDatasetWithTarget.from_dataframe(
+        df=pd.read_csv("path/data.csv"),
+        target="target"
+    )
     train, test = ds.train_test_split(test_size=0.25)
 
 Text Data 
diff --git a/docs/source/user_guide/loading_data/connect.rst b/docs/source/user_guide/loading_data/connect.rst
@@ -526,34 +526,34 @@ To load a dataframe from a remote web server source, use ``pandas`` directly and
 Convert Pandas DataFrame to ``ADSDataset``
 ==========================================
 
-To convert a Pandas dataframe to ``ADSDataset``, pass the ``pandas.DataFrame`` object directly into the ADS ``DatasetFactory.open`` method:
+To convert a Pandas dataframe to ``ADSDataset``, pass the ``pandas.DataFrame`` object directly into the ADS ``ADSDataset`` constructor or ``ADSDataset.from_dataframe()`` method:
 
 .. code-block:: python3
 
   import pandas as pd
-  from ads.dataset.factory import DatasetFactory
+  from ads.dataset.dataset import ADSDataset
 
   df = pd.read_csv('/path/some_data.csv) # load data with Pandas
 
   # use open...
 
-  ds = DatasetFactory.open(df) # construct **ADS** Dataset from DataFrame
+  ds = ADSDataset(df) # construct **ADS** Dataset from DataFrame
 
   # alternative form...
 
-  ds = DatasetFactory.from_dataframe(df)
+  ds = ADSDataset.from_dataframe(df)
 
   # an example using Pandas to parse data on the clipboard as a CSV and construct an ADS Dataset object
   # this allows easily transfering data from an application like Microsoft Excel, Apple Numbers, etc.
 
-  ds = DatasetFactory.from_dataframe(pd.read_clipboard())
+  ds = ADSDataset.from_dataframe(pd.read_clipboard())
 
   # use Pandas to query a SQL database:
 
   from sqlalchemy import create_engine
   engine = create_engine('dialect://user:pass@host:port/schema', echo=False)
   df = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')
-  ds = DatasetFactory.from_dataframe(df)
+  ds = ADSDataset.from_dataframe(df)
 
 
 Using ``PyArrow``
diff --git a/docs/source/user_guide/loading_data/connect_legacy.rst b/docs/source/user_guide/loading_data/connect_legacy.rst
@@ -1,5 +1,5 @@
-Connect with ``DatasetFactory`` 
-*******************************
+Connect with ``ADSDataset`` and ``ADSDatasetWithTarget`` 
+********************************************************
 
 
 .. admonition:: Deprecation Note |deprecated|
@@ -25,7 +25,8 @@ Begin by loading the required libraries and modules:
     import pandas as pd
 
     from ads.dataset.dataset_browser import DatasetBrowser
-    from ads.dataset.factory import DatasetFactory
+    from ads.dataset.dataset import ADSDataset
+    from ads.dataset.dataset_with_target import ADSDatasetWithTarget
 
 Object Storage
 ==============
@@ -37,14 +38,15 @@ To open a dataset from Object Storage using the resource principal method, you c
   import ads
   import os
 
-  from ads.dataset.factory import DatasetFactory
-
   ads.set_auth(auth='resource_principal')
   bucket_name = <bucket-name>
   file_name = <file-name>
   namespace = <namespace>
   storage_options = {'config':{}, 'tenancy': os.environ['TENANCY_OCID'], 'region': os.environ['NB_REGION']}
-  ds = DatasetFactory.open(f"oci://{bucket_name}@{namespace}/{file_name}", storage_options=storage_options)
+  ds = ADSDataset(
+    df=pd.read_csv(f"oci://{bucket_name}@{namespace}/{file_name}.csv"),
+    storage_options=storage_options
+  )
 
 
 To open a dataset from Object Storage using the Oracle Cloud Infrastructure configuration file method, include the location of the file using this format ``oci://<bucket_name>@<namespace>/<file_name>`` and modify the optional parameter ``storage_options``. Insert:
@@ -56,19 +58,22 @@ For example:
 
 .. code-block:: python3
 
-  ds = DatasetFactory.open("oci://<bucket_name>@<namespace>/<file_name>", storage_options = {
+  ds = ADSDataset(
+    df=pd.read_csv(f"oci://{bucket_name}@{namespace}/{file_name}.csv"),
+    storage_options={
      "config": "~/.oci/config",
      "profile": "DEFAULT"
-  })
+    }
+  )
 
 Local Storage
 =============
 
-To open a dataset from a local source, use ``DatasetFactory.open`` and specify the path of the data file:
+To open a dataset from a local source, use ``ADSDataset`` and specify the path of the data file:
 
 .. code-block:: python3
 
-  ds = DatasetFactory.open("/path/to/data.data", format='csv', delimiter=" ")
+  ds = ADSDataset(df=pd.read_csv("/path/to/data.csv"))
 
 Oracle Database
 ---------------
@@ -122,9 +127,11 @@ You can also use ``cx_Oracle`` within ADS by creating a connection string:
 .. code-block:: python3
 
   os.environ['TNS_ADMIN'] = creds['tns_admin']
-  from ads.dataset.factory import DatasetFactory
+  from ads.dataset.dataset import ADSDataset
   uri = 'oracle+cx_oracle://' + creds['user'] + ':' + creds['password'] + '@' + creds['sid']
-  ds = DatasetFactory.open(uri, format="sql", table=table, index_col=index_col)
+  ds = ADSDataset(
+    df=pd.read_sql(uri, table=table, index_col=index_col)
+  )
 
 Autonomous Database
 ===================
@@ -148,13 +155,13 @@ After you have stored the ADB username, password, and database name (SID) as var
 
     uri = 'oracle+cx_oracle://' + creds['user'] + ':' + creds['password'] + '@' + creds['sid']
 
-You can use ADS to query a table from your database, and then load that table as an ``ADSDataset`` object through ``DatasetFactory``.
-When you open ``DatasetFactory``, specify the name of the table you want to pull using the ``table`` variable for a given table. For SQL expressions, use the table parameter also. For example, *(`table="SELECT * FROM sh.times WHERE rownum <= 30"`)*.
+You can use ADS to query a table from your database, and then load that table as an ``ADSDatasetWithTarget`` object.
+When you open ``ADSDatasetWithTarget``, specify the name of the table you want to pull using the ``table`` variable for a given table. For SQL expressions, use the table parameter also. For example, *(`table="SELECT * FROM sh.times WHERE rownum <= 30"`)*.
 
 .. code-block:: python3
 
     os.environ['TNS_ADMIN'] = creds['tns_admin']
-    ds = DatasetFactory.open(uri, format="sql", table=table, target='label')
+    ds = ADSDatasetWithTarget(df=pd.read_sql(uri, table=table), target='label')
 
 Query ADB
 ---------
@@ -172,11 +179,11 @@ Query ADB
       engine = create_engine(uri)
       df = pd.read_sql('SELECT * from <TABLENAME>', con=engine)
 
-You can convert the ``pd.DataFrame`` into ``ADSDataset`` using the ``DatasetFactory.from_dataframe()`` function.
+You can convert the ``pd.DataFrame`` into ``ADSDataset`` using the ``ADSDataset.from_dataframe()`` function.
 
 .. code-block:: python3
 
-      ds = DatasetFactory.from_dataframe(df)
+      ds = ADSDataset.from_dataframe(df)
 
 These two examples run a simple query on ADW data. With ``read_sql_query`` you can use SQL expressions not just for tables, but also to limit the number of rows and to apply conditions with filters, such as (``where``).
 
@@ -207,7 +214,7 @@ You can also query data from ADW using cx_Oracle. Use the cx_Oracle 7.0.0 versio
       data = results.fetchall()
       df = pd.DataFrame(np.array(data))
 
-      ds = DatasetFactory.from_dataframe(df)
+      ds = ADSDataset.from_dataframe(df)
 
 .. code-block:: python3
 
@@ -230,7 +237,7 @@ This example adds predictions programmatically using cx_Oracle. It uses ``execut
 
 .. code-block:: python3
 
-    ds = DatasetFactory.open("iris.csv")
+    ds = ADSDataset(pd.read_csv("iris.csv"))
 
     create_table = '''CREATE TABLE IRIS_PREDICTED (,
                             sepal_length number,
@@ -269,24 +276,29 @@ You can open Amazon S3 public or private files in ADS. For private files, you mu
 
 .. code-block:: python3
 
-  ds = DatasetFactory.open("s3://bucket_name/iris.csv", storage_options = {
+  ds = ADSDataset(
+    df=pd.read_csv("s3://bucket_name/iris.csv"),
+    storage_options = {
       'key': 'aws key',
       'secret': 'aws secret,
       'blocksize': 1000000,
       'client_kwargs': {
-              "endpoint_url": "https://s3-us-west-1.amazonaws.com"
+        "endpoint_url": "https://s3-us-west-1.amazonaws.com"
       }
   })
 
 
 HTTP(S) Sources
 ===============
 
-To open a dataset from a remote web server source, use ``DatasetFactory.open()`` and specify the URL of the data:
+To open a dataset from a remote web server source, use ``ADSDatasetWithTarget`` and specify the URL of the data:
 
 .. code-block:: python3
 
-   ds = DatasetFactory.open('https://example.com/path/to/data.csv', target='label')
+  ds = ADSDatasetWithTarget(
+    df=pd.read_csv('https://example.com/path/to/data.csv'),
+    target='label'
+  )
 
 
 ``DatasetBrowser``
diff --git a/docs/source/user_guide/loading_data/supported_format.rst b/docs/source/user_guide/loading_data/supported_format.rst
@@ -3,11 +3,11 @@ Supported Formats
 
 You can load datasets into ADS, either locally or from network file systems.
 
-You can open datasets with ``DatasetFactory``, ``DatasetBrowser`` or ``pandas``. ``DatasetFactory`` allows datasets to be loaded into ADS.
+You can open datasets with ``DatasetBrowser`` or ``pandas``.
 
 ``DatasetBrowser`` supports opening the datasets from web sites and libraries, such as scikit-learn directly into ADS.
 
-When you open a dataset in ``DatasetFactory``, you can get the summary statistics, correlations, and visualizations of the dataset.
+When you load a dataset in ``ADSDataset`` from ``pandas.DataFrame``, you can get the summary statistics, correlations, and visualizations of the dataset.
 
 ADS Supports:
 
diff --git a/docs/source/user_guide/model_catalog/model_catalog.rst b/docs/source/user_guide/model_catalog/model_catalog.rst
@@ -17,14 +17,15 @@ provenance, reproduced, and deployed.
     import os
     import tempfile
     import warnings
+    import pandas as pd
 
     from ads.catalog.model import ModelCatalog
     from ads.common.model import ADSModel
     from ads.common.model_export_util import prepare_generic_model
     from ads.common.model_metadata import (MetadataCustomCategory,
                                            UseCaseType,
                                            Framework)
-    from ads.dataset.factory import DatasetFactory
+    from ads.dataset.dataset_with_target import ADSDatasetWithTarget
     from ads.feature_engineering.schema import Expression, Schema
     from os import path
     from sklearn.ensemble import RandomForestClassifier
@@ -97,7 +98,7 @@ The ``RandomForestClassifier`` object is converted to into an ``ADSModel`` using
     # Load the dataset
     ds_path = path.join("/", "opt", "notebooks", "ads-examples", "oracle_data", "oracle_classification_dataset1_150K.csv")
 
-    ds = DatasetFactory.open(ds_path, target="class")
+    ds = ADSDatasetWithTarget(df=pd.read_csv(ds_path), target="class")
 
     # Data preprocessing
     transformed_ds = ds.auto_transform(fix_imbalance=False)
diff --git a/docs/source/user_guide/quickstart/quickstart.rst b/docs/source/user_guide/quickstart/quickstart.rst
@@ -59,9 +59,9 @@ variable during modeling. The type of this target determines what type of modeli
 to use (regression, binary, and multi-class classification, or time series forecasting).
 
 There are several ways to turn data into an ``ADSDataset``. The simplest way is to
-use `DatasetFactory`, which takes as its first argument as a string URI or a
-``Pandas Dataframe`` object. The URI supports many formats, such as Object Storage
-or S3 files. The
+use `ADSDataset` or `ADSDatasetWithTarget` constructor, which takes as its first argument
+as a ``Pandas Dataframe`` object. The ``Pandas Dataframe`` supports many formats, such as
+Object Storage or S3 files. The
 `class documentation <https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/modules.html>_` describes all classes.
 
 For example:
@@ -77,12 +77,12 @@ For example:
   df = pd.DataFrame(data.data, columns=data.feature_names)
   df["species"] = data.target
 
-  from ads.dataset.factory import DatasetFactory
+  from ads.dataset.dataset_with_target import ADSDatasetWithTarget
 
   # these two are equivalent:
-  ds = DatasetFactory.open(df, target="species")
+  ds = ADSDatasetWithTarget(df, target="species")
   # OR
-  ds = DatasetFactory.from_dataframe(df, target="species")
+  ds = ADSDatasetWithTarget.from_dataframe(df, target="species")
 
 The ``ds`` (``ADSDataset``) object is ``Pandas`` like. For example, you can use ``ds.head()``. It's
 an encapsulation of a `Pandas` Dataframe with immutability. Any attempt to
@@ -93,7 +93,7 @@ modify the data yields a new copy-on-write of the ``ADSDataset``.
    to memory. ADS also samples the dataset for visualization purposes, computes
    co-correlation of the columns in the dataset, and performs type discovery on the
    different columns in the dataset. That is why loading a dataset with
-   ``DatasetFactory`` can be slower than simply reading the same dataset
+   ``ADSDataset`` can be slower than simply reading the same dataset
    with ``Pandas``. In return, you get the added data visualizations and data
    profiling benefits of the ``ADSDataset`` object.
 
@@ -113,10 +113,12 @@ modify the data yields a new copy-on-write of the ``ADSDataset``.
 
   pd.DataFrame({'c1':[1,2,3], 'target': ['yes', 'no', 'yes']}).to_csv('Users/ysz/data/sample.csv')
 
-  ds = DatasetFactory.open('Users/ysz/data/sample.csv',
-                          target = 'target',
-                          type_discovery = False, # turn off ADS type discovery
-                          types = {'target': 'category'}) # specify target type
+  ds = ADSDatasetWithTarget(
+    df=pd.read_csv('Users/ysz/data/sample.csv'),
+    target='target',
+    type_discovery=False, # turn off ADS type discovery
+    types={'target': 'category'} # specify target type
+  )