Skip to content

Commit 90f2e30

Browse files
authored
Update documentation for feature store (#253)
2 parents ca61db6 + 5b00331 commit 90f2e30

File tree

8 files changed

+199
-48
lines changed

8 files changed

+199
-48
lines changed

ads/feature_store/docs/source/dataset.rst

Lines changed: 40 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -124,41 +124,60 @@ With a Dataset instance, we can get the last dataset job details using ``get_las
124124
df = dataset_job.get_validation_output().to_dataframe()
125125
df.show()
126126
127-
128127
Save expectation entity
129128
=======================
129+
Feature store allows you to define expectations on data being materialized into feature group instance. With a ``FeatureGroup`` instance, we can save the expectation entity using ``save_expectation()``
130130

131-
With a Dataset instance, we can save the expectation entity using ``save_expectation()``
132-
133-
.. note::
134-
135-
Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. Software developers have long known that automated testing is essential for managing complex codebases.
136131

137132
.. image:: figures/validation.png
138133

139-
140134
The ``.save_expectation()`` method takes the following optional parameter:
141135

142-
- ``expectation_suite: ExpectationSuite``. Expectation suite of great expectation
136+
- ``expectation: Expectation``. Expectation of great expectation
143137
- ``expectation_type: ExpectationType``. Type of expectation
144138
- ``ExpectationType.STRICT``: Fail the job if expectation not met
145139
- ``ExpectationType.LENIENT``: Pass the job even if expectation not met
146140

147141
.. code-block:: python3
148142
149-
dataset.save_expectation(expectation_suite, expectation_type="STRICT")
143+
feature_group.save_expectation(expectation_suite, expectation_type="STRICT")
144+
145+
.. seealso::
146+
147+
:ref:`Feature Validation`
148+
149+
Statistics Computation
150+
========================
151+
During the materialization feature store performs computation of statistical metrics for all the features by default. This can be configured using ``StatisticsConfig`` object which can be passed at the creation of
152+
dataset or it can be updated later as well.
153+
154+
.. code-block:: python3
155+
156+
# Define statistics configuration for selected features
157+
stats_config = StatisticsConfig().with_is_enabled(True).with_columns(["column1", "column2"])
158+
150159
160+
This can be used with dataset instance.
151161

152-
Statistics Results
153-
==================
154-
You can call the ``get_statistics()`` method of the Dataset instance to fetch feature statistics results of a dataset job.
162+
.. code-block:: python3
155163
156-
.. note::
164+
from ads.feature_store.dataset import Dataset
157165
158-
PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
166+
dataset = (
167+
Dataset
168+
.with_name("<dataset_name>")
169+
.with_entity_id(<entity_id>)
170+
.with_feature_store_id("<feature_store_id>")
171+
.with_description("<dataset_description>")
172+
.with_compartment_id("<compartment_id>")
173+
.with_dataset_ingestion_mode(DatasetIngestionMode.SQL)
174+
.with_query('SELECT col FROM <entity_id>.<feature_group_name>')
175+
.with_statistics_config(stats_config)
176+
)
159177
178+
You can call the ``get_statistics()`` method of the dataset to fetch metrics for a specific ingestion job.
160179

161-
The ``.get_statistics()`` method takes the following optional parameter:
180+
The ``get_statistics()`` method takes the following optional parameter:
162181

163182
- ``job_id: string``. Id of dataset job
164183

@@ -167,6 +186,12 @@ The ``.get_statistics()`` method takes the following optional parameter:
167186
# Fetch stats results for a dataset job
168187
df = dataset.get_statistics(job_id).to_pandas()
169188
189+
.. image:: figures/stats_1.png
190+
191+
.. seealso::
192+
193+
:ref:`Statistics`
194+
170195

171196
Get features
172197
============

ads/feature_store/docs/source/feature_group.rst

Lines changed: 45 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -152,49 +152,60 @@ Feature store provides an API similar to Pandas to join feature groups together
152152
153153
Save expectation entity
154154
=======================
155-
With a ``FeatureGroup`` instance, You can save the expectation details using ``with_expectation_suite()`` with parameters
155+
Feature store allows you to define expectations on data being materialized into feature group instance. With a ``FeatureGroup`` instance, we can save the expectation entity using ``save_expectation()``
156156

157-
- ``expectation_suite: ExpectationSuite``. ExpectationSuit of great expectation
157+
158+
.. image:: figures/validation.png
159+
160+
The ``.save_expectation()`` method takes the following optional parameter:
161+
162+
- ``expectation: Expectation``. Expectation of great expectation
158163
- ``expectation_type: ExpectationType``. Type of expectation
159164
- ``ExpectationType.STRICT``: Fail the job if expectation not met
160165
- ``ExpectationType.LENIENT``: Pass the job even if expectation not met
161166

162-
.. note::
167+
.. code-block:: python3
163168
164-
Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. Software developers have long known that automated testing is essential for managing complex codebases.
169+
feature_group.save_expectation(expectation_suite, expectation_type="STRICT")
170+
171+
.. seealso::
172+
173+
:ref:`Feature Validation`
165174

166-
.. image:: figures/validation.png
175+
176+
Statistics Computation
177+
========================
178+
During the materialization feature store performs computation of statistical metrics for all the features by default. This can be configured using ``StatisticsConfig`` object which can be passed at the creation of
179+
feature group or it can be updated later as well.
167180

168181
.. code-block:: python3
169182
170-
expectation_suite = ExpectationSuite(
171-
expectation_suite_name="expectation_suite_name"
172-
)
173-
expectation_suite.add_expectation(
174-
ExpectationConfiguration(
175-
expectation_type="expect_column_values_to_not_be_null",
176-
kwargs={"column": "<column>"},
177-
)
183+
# Define statistics configuration for selected features
184+
stats_config = StatisticsConfig().with_is_enabled(True).with_columns(["column1", "column2"])
178185
179-
feature_group_resource = (
180-
FeatureGroup()
181-
.with_feature_store_id(feature_store.id)
182-
.with_primary_keys(["<key>"])
183-
.with_name("<name>")
184-
.with_entity_id(entity.id)
185-
.with_compartment_id(<compartment_id>)
186-
.with_schema_details_from_dataframe(<datframe>)
187-
.with_expectation_suite(
188-
expectation_suite=expectation_suite,
189-
expectation_type=ExpectationType.STRICT,
190-
)
191-
)
192186
193-
You can call the ``get_validation_output()`` method of the FeatureGroup instance to fetch validation results for a specific ingestion job.
187+
This can be used with feature group instance.
188+
189+
.. code-block:: python3
190+
191+
# Fetch stats results for a feature group job
192+
from ads.feature_store.feature_group import FeatureGroup
194193
195-
Statistics Results
196-
==================
197-
You can call the ``get_statistics()`` method of the FeatureGroup instance to fetch statistics for a specific ingestion job.
194+
feature_group_resource = (
195+
FeatureGroup()
196+
.with_feature_store_id(feature_store.id)
197+
.with_primary_keys(["<key>"])
198+
.with_name("<name>")
199+
.with_entity_id(entity.id)
200+
.with_compartment_id(<compartment_id>)
201+
.with_schema_details_from_dataframe(<dataframe>)
202+
.with_statistics_config(stats_config)
203+
204+
You can call the ``get_statistics()`` method of the feature group to fetch metrics for a specific ingestion job.
205+
206+
The ``get_statistics()`` method takes the following optional parameter:
207+
208+
- ``job_id: string``. Id of feature group job
198209

199210
.. code-block:: python3
200211
@@ -203,6 +214,10 @@ You can call the ``get_statistics()`` method of the FeatureGroup instance to fet
203214
204215
.. image:: figures/stats_1.png
205216

217+
.. seealso::
218+
219+
:ref:`Statistics`
220+
206221
Get last feature group job
207222
==========================
208223
Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
.. _Feature Validation:
2+
3+
Feature Validation
4+
*************
5+
6+
Feature validation is the process of checking the quality and accuracy of the features used in a machine learning model. This is important because features that are not accurate or reliable can lead to poor model performance.
7+
Feature store allows you to define expectation on the data which is being materialized into feature group and dataset. This is achieved using open source library Great Expectations.
8+
9+
.. note::
10+
`Great Expectations <https://docs.greatexpectations.io/docs/>`_ is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. Software developers have long known that automated testing is essential for managing complex codebases.
11+
12+
13+
Expectations
14+
============
15+
An Expectation is a verifiable assertion about your data. You can define expectation as below:
16+
17+
.. code-block:: python3
18+
19+
from great_expectations.core.expectation_configuration import ExpectationConfiguration
20+
21+
# Create an Expectation
22+
expect_config = ExpectationConfiguration(
23+
# Name of expectation type being added
24+
expectation_type="expect_table_columns_to_match_ordered_list",
25+
# These are the arguments of the expectation
26+
# The keys allowed in the dictionary are Parameters and
27+
# Keyword Arguments of this Expectation Type
28+
kwargs={
29+
"column_list": [
30+
"column1",
31+
"column2",
32+
"column3",
33+
"column4",
34+
]
35+
},
36+
# This is how you can optionally add a comment about this expectation.
37+
meta={
38+
"notes": {
39+
"format": "markdown",
40+
"content": "details about this expectation. **Markdown** `Supported`",
41+
}
42+
},
43+
)
44+
45+
Expectations Suite
46+
============
47+
48+
Expectation Suite is a collection of verifiable assertions i.e. expectations about your data. You can define expectation suite as below:
49+
50+
.. code-block:: python3
51+
52+
# Create an Expectation Suite
53+
suite = context.add_expectation_suite(expectation_suite_name="example_suite")
54+
suite.add_expectation(expect_config)
Loading

ads/feature_store/docs/source/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ Welcome to oci-feature-store's documentation!
1616
feature_group_job
1717
dataset
1818
dataset_job
19+
statistics
20+
feature_validation
1921
demo
2022
notebook
2123
release_notes

ads/feature_store/docs/source/overview.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ Oracle feature store is a stack based solution that is deployed in the customer
1515
- ``Dataset``: A dataset is a collection of feature that are used together to either train a model or perform model inference.
1616
- ``Dataset Job``: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results.
1717

18+
.. important::
19+
20+
Prerequisite : Please contact #oci-feature-store_early-preview for getting your tenancy whitelisted for early access of feature store.
21+
1822
.. important::
1923

2024
The OCI Feature Store support following versions
@@ -32,4 +36,4 @@ Oracle feature store is a stack based solution that is deployed in the customer
3236
* - delta-spark
3337
- .. image:: https://img.shields.io/badge/delta-2.0.1-blue?style=for-the-badge&logo=pypi&logoColor=white
3438
* - pyspark
35-
- .. image:: https://img.shields.io/badge/pyspark-3.2.1-blue?style=for-the-badge&logo=pypi&logoColor=white
39+
- .. image:: https://img.shields.io/badge/pyspark-3.2.1-blue?style=for-the-badge&logo=pypi&logoColor=white
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
.. _Statistics:
2+
3+
Statistics
4+
*************
5+
6+
Feature Store provides functionality to compute statistics for feature groups as well as datasets and persist them along with the metadata. These statistics can help you
7+
to derive insights about the data quality. These statistical metrics are computed during materialisation time and persisting with other metadata.
8+
9+
.. note::
10+
11+
Feature Store utilizes MLM Insights which is a Python API that helps evaluate and monitor data for entirety of ML Observability lifecycle. It performs data summarization which reduces a dataset into a set of descriptive statistics.
12+
13+
The statistical metrics that are computed by feature store depend on the feature type.
14+
15+
+------------------------+-----------------------+
16+
| Numerical Metrics | Categorical Metrics |
17+
+========================+=======================+
18+
| Skewness | Count |
19+
+------------------------+-----------------------+
20+
| StandardDeviation | TopKFrequentElements |
21+
+------------------------+-----------------------+
22+
| Min | TypeMetric |
23+
+------------------------+-----------------------+
24+
| IsConstantFeature | DuplicateCount |
25+
+------------------------+-----------------------+
26+
| IQR | Mode |
27+
+------------------------+-----------------------+
28+
| Range | DistinctCount |
29+
+------------------------+-----------------------+
30+
| ProbabilityDistribution| |
31+
+------------------------+-----------------------+
32+
| Variance | |
33+
+------------------------+-----------------------+
34+
| FrequencyDistribution | |
35+
+------------------------+-----------------------+
36+
| Count | |
37+
+------------------------+-----------------------+
38+
| Max | |
39+
+------------------------+-----------------------+
40+
| DistinctCount | |
41+
+------------------------+-----------------------+
42+
| Sum | |
43+
+------------------------+-----------------------+
44+
| IsQuasiConstantFeature | |
45+
+------------------------+-----------------------+
46+
| Quartiles | |
47+
+------------------------+-----------------------+
48+
| Mean | |
49+
+------------------------+-----------------------+
50+
| Kurtosis | |
51+
+------------------------+-----------------------+

ads/feature_store/docs/source/terraform.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@ Steps
9595
rm -f feature-store-terraform.zip \
9696
&& wget https://objectstorage.us-ashburn-1.oraclecloud.com/p/vZogtXWwHqbkGLeqyKiqBmVxdbR4MK4nyOBqDsJNVE4sHGUY5KFi4T3mOFGA3FOy/n/idogsu2ylimg/b/oci-feature-store/o/beta/terraform/feature-store-terraform.zip \
9797
&& oci resource-manager stack create \
98-
--compartment-id <compartment-id> \
99-
--config-source <path-to-downloaded-zip-file> \
98+
--compartment-id <COMPARTMENT_OCID> \
99+
--config-source feature-store-terraform.zip \
100100
--terraform-version 1.1.x \
101101
--variables '{
102102
"service_version": "<SERVICE_VERSION>",

0 commit comments

Comments
 (0)