Update documentation for feature store (#253)

KshitizLohia · web-flow · commit 90f2e303f0af · 2023-07-12T19:01:06.000+05:30
diff --git a/ads/feature_store/docs/source/dataset.rst b/ads/feature_store/docs/source/dataset.rst
@@ -124,41 +124,60 @@ With a Dataset instance, we can get the last dataset job details using ``get_las
   df = dataset_job.get_validation_output().to_dataframe()
   df.show()
 
-
 Save expectation entity
 =======================
+Feature store allows you to define expectations on data being materialized into feature group instance. With a ``FeatureGroup`` instance, we can save the expectation entity using ``save_expectation()``
 
-With a Dataset instance, we can save the expectation entity using ``save_expectation()``
-
-.. note::
-
-  Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. Software developers have long known that automated testing is essential for managing complex codebases.
 
 .. image:: figures/validation.png
 
-
 The ``.save_expectation()`` method takes the following optional parameter:
 
-- ``expectation_suite: ExpectationSuite``. Expectation suite of great expectation
+- ``expectation: Expectation``. Expectation of great expectation
 - ``expectation_type: ExpectationType``. Type of expectation
         - ``ExpectationType.STRICT``: Fail the job if expectation not met
         - ``ExpectationType.LENIENT``: Pass the job even if expectation not met
 
 .. code-block:: python3
 
-  dataset.save_expectation(expectation_suite, expectation_type="STRICT")
+  feature_group.save_expectation(expectation_suite, expectation_type="STRICT")
+
+.. seealso::
+
+    :ref:`Feature Validation`
+
+Statistics Computation
+========================
+During the materialization feature store performs computation of statistical metrics for all the features  by default. This can be configured using ``StatisticsConfig`` object which can be passed at the creation of
+dataset or it can be updated later as well.
+
+.. code-block:: python3
+
+  # Define statistics configuration for selected features
+  stats_config = StatisticsConfig().with_is_enabled(True).with_columns(["column1", "column2"])
+
 
+This can be used with dataset instance.
 
-Statistics Results
-==================
-You can call the ``get_statistics()`` method of the Dataset instance to fetch feature statistics results of a dataset job.
+.. code-block:: python3
 
-.. note::
+  from ads.feature_store.dataset import Dataset
 
-  PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
+  dataset = (
+        Dataset
+        .with_name("<dataset_name>")
+        .with_entity_id(<entity_id>)
+        .with_feature_store_id("<feature_store_id>")
+        .with_description("<dataset_description>")
+        .with_compartment_id("<compartment_id>")
+        .with_dataset_ingestion_mode(DatasetIngestionMode.SQL)
+        .with_query('SELECT col FROM <entity_id>.<feature_group_name>')
+        .with_statistics_config(stats_config)
+  )
 
+You can call the ``get_statistics()`` method of the dataset to fetch metrics for a specific ingestion job.
 
-The ``.get_statistics()`` method takes the following optional parameter:
+The ``get_statistics()`` method takes the following optional parameter:
 
 - ``job_id: string``. Id of dataset job
 
@@ -167,6 +186,12 @@ The ``.get_statistics()`` method takes the following optional parameter:
   # Fetch stats results for a dataset job
   df = dataset.get_statistics(job_id).to_pandas()
 
+.. image:: figures/stats_1.png
+
+.. seealso::
+
+    :ref:`Statistics`
+
 
 Get features
 ============
diff --git a/ads/feature_store/docs/source/feature_group.rst b/ads/feature_store/docs/source/feature_group.rst
@@ -152,49 +152,60 @@ Feature store provides an API similar to Pandas to join feature groups together
 
 Save expectation entity
 =======================
-With a ``FeatureGroup`` instance, You can save the expectation details using ``with_expectation_suite()`` with parameters
+Feature store allows you to define expectations on data being materialized into feature group instance. With a ``FeatureGroup`` instance, we can save the expectation entity using ``save_expectation()``
 
-- ``expectation_suite: ExpectationSuite``. ExpectationSuit of great expectation
+
+.. image:: figures/validation.png
+
+The ``.save_expectation()`` method takes the following optional parameter:
+
+- ``expectation: Expectation``. Expectation of great expectation
 - ``expectation_type: ExpectationType``. Type of expectation
         - ``ExpectationType.STRICT``: Fail the job if expectation not met
         - ``ExpectationType.LENIENT``: Pass the job even if expectation not met
 
-.. note::
+.. code-block:: python3
 
-  Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. Software developers have long known that automated testing is essential for managing complex codebases.
+  feature_group.save_expectation(expectation_suite, expectation_type="STRICT")
+
+.. seealso::
+
+   :ref:`Feature Validation`
 
-.. image:: figures/validation.png
+
+Statistics Computation
+========================
+During the materialization feature store performs computation of statistical metrics for all the features  by default. This can be configured using ``StatisticsConfig`` object which can be passed at the creation of
+feature group or it can be updated later as well.
 
 .. code-block:: python3
 
-    expectation_suite = ExpectationSuite(
-        expectation_suite_name="expectation_suite_name"
-    )
-    expectation_suite.add_expectation(
-        ExpectationConfiguration(
-            expectation_type="expect_column_values_to_not_be_null",
-            kwargs={"column": "<column>"},
-        )
+  # Define statistics configuration for selected features
+  stats_config = StatisticsConfig().with_is_enabled(True).with_columns(["column1", "column2"])
 
-    feature_group_resource = (
-        FeatureGroup()
-        .with_feature_store_id(feature_store.id)
-        .with_primary_keys(["<key>"])
-        .with_name("<name>")
-        .with_entity_id(entity.id)
-        .with_compartment_id(<compartment_id>)
-        .with_schema_details_from_dataframe(<datframe>)
-        .with_expectation_suite(
-            expectation_suite=expectation_suite,
-            expectation_type=ExpectationType.STRICT,
-         )
-    )
 
-You can call the ``get_validation_output()`` method of the FeatureGroup instance to fetch validation results for a specific ingestion job.
+This can be used with feature group instance.
+
+.. code-block:: python3
+
+  # Fetch stats results for a feature group job
+  from ads.feature_store.feature_group import FeatureGroup
 
-Statistics Results
-==================
-You can call the ``get_statistics()`` method of the FeatureGroup instance to fetch statistics for a specific ingestion job.
+  feature_group_resource = (
+    FeatureGroup()
+    .with_feature_store_id(feature_store.id)
+    .with_primary_keys(["<key>"])
+    .with_name("<name>")
+    .with_entity_id(entity.id)
+    .with_compartment_id(<compartment_id>)
+    .with_schema_details_from_dataframe(<dataframe>)
+    .with_statistics_config(stats_config)
+
+You can call the ``get_statistics()`` method of the feature group to fetch metrics for a specific ingestion job.
+
+The ``get_statistics()`` method takes the following optional parameter:
+
+- ``job_id: string``. Id of feature group job
 
 .. code-block:: python3
 
@@ -203,6 +214,10 @@ You can call the ``get_statistics()`` method of the FeatureGroup instance to fet
 
 .. image:: figures/stats_1.png
 
+.. seealso::
+
+    :ref:`Statistics`
+
 Get last feature group job
 ==========================
 Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.
diff --git a/ads/feature_store/docs/source/feature_validation.rst b/ads/feature_store/docs/source/feature_validation.rst
@@ -0,0 +1,54 @@
+.. _Feature Validation:
+
+Feature Validation
+*************
+
+Feature validation is the process of checking the quality and accuracy of the features used in a machine learning model. This is important because features that are not accurate or reliable can lead to poor model performance.
+Feature store allows you to define expectation on the data which is being materialized into feature group and dataset. This is achieved using open source library Great Expectations.
+
+.. note::
+  `Great Expectations <https://docs.greatexpectations.io/docs/>`_ is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. Software developers have long known that automated testing is essential for managing complex codebases.
+
+
+Expectations
+============
+An Expectation is a verifiable assertion about your data. You can define expectation as below:
+
+.. code-block:: python3
+
+    from great_expectations.core.expectation_configuration import ExpectationConfiguration
+
+    # Create an Expectation
+    expect_config = ExpectationConfiguration(
+        # Name of expectation type being added
+        expectation_type="expect_table_columns_to_match_ordered_list",
+        # These are the arguments of the expectation
+        # The keys allowed in the dictionary are Parameters and
+        # Keyword Arguments of this Expectation Type
+        kwargs={
+            "column_list": [
+                "column1",
+                "column2",
+                "column3",
+                "column4",
+            ]
+        },
+        # This is how you can optionally add a comment about this expectation.
+        meta={
+            "notes": {
+                "format": "markdown",
+                "content": "details about this expectation. **Markdown** `Supported`",
+            }
+        },
+    )
+
+Expectations Suite
+============
+
+Expectation Suite is a collection of verifiable assertions i.e. expectations about your data. You can define expectation suite as below:
+
+.. code-block:: python3
+
+    # Create an Expectation Suite
+    suite = context.add_expectation_suite(expectation_suite_name="example_suite")
+    suite.add_expectation(expect_config)
diff --git a/ads/feature_store/docs/source/figures/stats_1.png b/ads/feature_store/docs/source/figures/stats_1.png
diff --git a/ads/feature_store/docs/source/index.rst b/ads/feature_store/docs/source/index.rst
@@ -16,6 +16,8 @@ Welcome to oci-feature-store's documentation!
     feature_group_job
     dataset
     dataset_job
+    statistics
+    feature_validation
     demo
     notebook
     release_notes
diff --git a/ads/feature_store/docs/source/overview.rst b/ads/feature_store/docs/source/overview.rst
@@ -15,6 +15,10 @@ Oracle feature store is a stack based solution that is deployed in the customer
 - ``Dataset``: A dataset is a collection of feature that are used together to either train a model or perform model inference.
 - ``Dataset Job``: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results.
 
+.. important::
+
+      Prerequisite : Please contact #oci-feature-store_early-preview for getting your tenancy whitelisted for early access of feature store.
+
 .. important::
 
   The OCI Feature Store support following versions
@@ -32,4 +36,4 @@ Oracle feature store is a stack based solution that is deployed in the customer
       * - delta-spark
         - .. image:: https://img.shields.io/badge/delta-2.0.1-blue?style=for-the-badge&logo=pypi&logoColor=white
       * - pyspark
-        - .. image:: https://img.shields.io/badge/pyspark-3.2.1-blue?style=for-the-badge&logo=pypi&logoColor=white
+        - .. image:: https://img.shields.io/badge/pyspark-3.2.1-blue?style=for-the-badge&logo=pypi&logoColor=white
diff --git a/ads/feature_store/docs/source/statistics.rst b/ads/feature_store/docs/source/statistics.rst
@@ -0,0 +1,51 @@
+.. _Statistics:
+
+Statistics
+*************
+
+Feature Store provides functionality to compute statistics for feature groups as well as datasets and persist them along with the metadata. These statistics can help you
+to derive insights about the data quality. These statistical metrics are computed during materialisation time and persisting with other metadata.
+
+.. note::
+
+  Feature Store utilizes MLM Insights which is a Python API that helps evaluate and monitor data for entirety of ML Observability lifecycle. It performs data summarization which reduces a dataset into a set of descriptive statistics.
+
+The statistical metrics that are computed by feature store depend on the feature type.
+
++------------------------+-----------------------+
+| Numerical Metrics      | Categorical Metrics   |
++========================+=======================+
+| Skewness               | Count                 |
++------------------------+-----------------------+
+| StandardDeviation      | TopKFrequentElements  |
++------------------------+-----------------------+
+| Min                    | TypeMetric            |
++------------------------+-----------------------+
+| IsConstantFeature      | DuplicateCount        |
++------------------------+-----------------------+
+| IQR                    | Mode                  |
++------------------------+-----------------------+
+| Range                  | DistinctCount         |
++------------------------+-----------------------+
+| ProbabilityDistribution|                       |
++------------------------+-----------------------+
+| Variance               |                       |
++------------------------+-----------------------+
+| FrequencyDistribution  |                       |
++------------------------+-----------------------+
+| Count                  |                       |
++------------------------+-----------------------+
+| Max                    |                       |
++------------------------+-----------------------+
+| DistinctCount          |                       |
++------------------------+-----------------------+
+| Sum                    |                       |
++------------------------+-----------------------+
+| IsQuasiConstantFeature |                       |
++------------------------+-----------------------+
+| Quartiles              |                       |
++------------------------+-----------------------+
+| Mean                   |                       |
++------------------------+-----------------------+
+| Kurtosis               |                       |
++------------------------+-----------------------+
diff --git a/ads/feature_store/docs/source/terraform.rst b/ads/feature_store/docs/source/terraform.rst
@@ -95,8 +95,8 @@ Steps
     rm -f feature-store-terraform.zip \
     && wget https://objectstorage.us-ashburn-1.oraclecloud.com/p/vZogtXWwHqbkGLeqyKiqBmVxdbR4MK4nyOBqDsJNVE4sHGUY5KFi4T3mOFGA3FOy/n/idogsu2ylimg/b/oci-feature-store/o/beta/terraform/feature-store-terraform.zip \
     && oci resource-manager stack create \
-        --compartment-id <compartment-id> \
-        --config-source <path-to-downloaded-zip-file> \
+        --compartment-id <COMPARTMENT_OCID> \
+        --config-source feature-store-terraform.zip \
         --terraform-version 1.1.x \
         --variables '{
             "service_version": "<SERVICE_VERSION>",