Skip to content

[FSTORE-1285] Model Dependent Transformation Functions #390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/user_guides/fs/feature_view/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ This section serves to provide guides and examples for the common usage of abstr
- [Feature Server](feature-server.md)
- [Query](query.md)
- [Helper columns](helper-columns.md)
- [Transformation Functions](transformation-function.md)
- [Model-Dependent Transformation Functions](transformation-function.md)
- [Spines](spine-query.md)
- [Feature Monitoring](feature_monitoring.md)
206 changes: 166 additions & 40 deletions docs/user_guides/fs/feature_view/transformation-function.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,151 @@
# Transformation Functions
# Model Dependent Transformation Functions

HSFS provides functionality to attach transformation functions to [feature views](./overview.md).
Hopsworks provides functionality to attach transformation functions to [feature views](./overview.md).

User defined, custom transformation functions need to be registered in the feature store to make them accessible for feature view creation. To register them in the feature store, they either have to be part of the library [installed](../../../user_guides/projects/python/python_install.md) in Hopsworks or attached when starting a [Jupyter notebook](../../../user_guides/projects/jupyter/python_notebook.md) or [Hopsworks job](../../../user_guides/projects/jobs/spark_job.md).
These transformation functions are primarily [model-dependent transformations](https://www.hopsworks.ai/dictionary/model-dependent-transformations). Model-dependent transformations generate feature data tailored to a specific model, often requiring the computation of training dataset statistics. Hopsworks enables you to define custom model-dependent transformation functions that can take multiple features and their associated statistics as input and produce multiple transformed features as output. Hopsworks also automatically executes the defined transformation function as a [`@pandas_udf`](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html) in a PySpark application and as Pandas functions in Python clients.

Custom transformation functions created in Hopsworks can be directly attached to feature views or stored in the feature store for later retrieval and attachment. These custom functions can be part of a library [installed](../../../user_guides/projects/python/python_install.md) in Hopsworks or added when starting a [Jupyter notebook](../../../user_guides/projects/jupyter/python_notebook.md) or [Hopsworks job](../../../user_guides/projects/jobs/spark_job.md).

Hopsworks also includes built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler`, `label_encoder`, and `one_hot_encoder` that can be easily imported and used.

!!! warning "Pyspark decorators"

Don't decorate transformation functions with Pyspark `@udf` or `@pandas_udf`, and also make sure not to use any Pyspark dependencies. That is because, the transformation functions may be executed by Python clients. HSFS will decorate transformation function for you only if it is used inside Pyspark application.
Don't decorate transformation functions with Pyspark `@udf` or `@pandas_udf`, and also make sure not to use any Pyspark dependencies. That is because, the transformation functions may be executed by Python clients. Hopsworks will automatically run transformations as pandas udfs for you only if it is used inside Pyspark application.

!!! warning "Java/Scala support"

Creating and attaching Transformation functions to feature views are not supported for HSFS Java or Scala client. If feature view with transformation function was created using python client, you cannot get training data or get feature vectors from HSFS Java or Scala client.


## Creation of Custom Transformation Functions

## Creation
Hopsworks ships built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.
User-defined, custom transformation functions can be created in Hopsworks using the [`@udf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/udf/) decorator. These functions should be designed as Pandas functions, meaning they must take input features as a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and return either a Pandas Series or a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

You can also create new functions. Let's assume that you have already installed Python library [transformation_fn_template](https://github.com/logicalclocks/transformation_fn_template) containing the transformation function `plus_one`.
The `@udf` decorator in Hopsworks creates a metadata class called `HopsworksUdf`. This class manages the necessary operations to supply feature statistics to custom transformation functions and execute them as `@pandas_udf` in PySpark applications or as pure Pandas functions in Python clients. The decorator requires the `return_type` of the transformation function, which indicates the type of features returned. This can be a single Python type if the transformation function returns a single transformed feature as a Pandas Series, or a list of Python types if it returns multiple transformed features as a Pandas DataFrame. The supported types include `str`, `int`, `float`, `bool`, `datetime.datetime`, `datetime.date`, and `datetime.time`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to provide a link to the api reference of hopsworks udf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added API reference to udf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version should be a variable {{{ hopsworks_version }}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used {{{hopsworks_version}}} in the link.


Hopsworks supports four types of transformation functions:

1. One to One: Transforms one feature into one transformed feature.
2. One to Many: Transforms one feature into multiple transformed features.
3. Many to One: Transforms multiple features into one transformed feature.
4. Many to Many: Transforms multiple features into multiple transformed features.

To create a One to One transformation function, the hopsworks `@udf` decorator must be provided with the return type as a Python type and the transformation function should take one argument as input and return a Pandas Series.

=== "Python"

!!! example "Register transformation function `plus_one` in the Hopsworks feature store."
!!! example "Creation of a Custom One to One Transformation Function in Hopsworks."
```python
from custom_functions import transformations
plus_one_meta = fs.create_transformation_function(
transformation_function=transformations.plus_one,
output_type=int,
version=1)
plus_one_meta.save()
from hopsworks import udf

@udf(int)
def add_one(feature):
return feature + 1
```

## Retrieval
To retrieve all transformation functions from the feature store, use `get_transformation_functions` which will return the list of available `TransformationFunction` objects. A specific transformation function can be retrieved with the `get_transformation_function` method where you can provide its name and version of the transformation function. If only the function name is provided then it will default to version 1.
Creation of a Many to One transformation function is similar to that of One to One transformation function, the only difference being that the transformation function accepts multiple features as input.

=== "Python"
!!! example "Creation of a Many to One Custom Transformation Function in Hopsworks."
```python
from hopsworks import udf

!!! example "Retrieving transformation functions from the feature store"
@udf(int)
def add_features(feature1, feature2, feature3):
return feature + feature2 + feature3
```

To create a One to Many transformation function, the hopsworks `@udf` decorator must be provided with the return type as a list of Python types and the transformation function should take one argument as input and return multiple features as a Pandas DataFrame. The return types provided to the decorator must match the types of each column in the returned Pandas DataFrame.

=== "Python"
!!! example "Creation of a One to Many Custom Transformation Function in Hopsworks."
```python
# get all transformation functions
fs.get_transformation_functions()
from hopsworks import udf
import pandas as pd

# get transformation function by name. This will default to version 1
plus_one_fn = fs.get_transformation_function(name="plus_one")
@udf([int, int])
def add_one_and_two(feature1):
return pd.DataFrame({"add_one":feature1 + 1, "add_two":feature1 + 2})
```

# get built-in transformation function min max scaler
min_max_scaler_fn = fs.get_transformation_function(name="min_max_scaler")
Creation of a Many to Many transformation function is similar to that of One to May transformation function, the only difference being that the transformation function accepts multiple features as input.

# get transformation function by name and version.
plus_one_fn = fs.get_transformation_function(name="plus_one", version=2)
=== "Python"
!!! example "Creation of a Many to Many Custom Transformation Function in Hopsworks."
```python
from hopsworks import udf
import pandas as pd

@udf([int, int, int])
def add_one_multiple(feature1, feature2, feature2):
return pd.DataFrame({"add_one_feature1":feature1 + 1, "add_one_feature2":feature2 + 1, "add_one_feature3":feature3 + 1})
```
To access statistics pertaining to an argument provided as input to the transformation function, it is necessary to define a keyword argument named `statistics` in the transformation function. This statistics argument should be provided with an instance of class `TransformationStatistics` as default value. The `TransformationStatistics` instance must be initialized with the names of the arguments for which statistical information is required.

The `TransformationStatistics` instance contains separate objects with the same name as the arguments used to initialize it. These objects encapsulate statistics related to the argument as instances of the `FeatureTransformationStatistics` class. Upon instantiation, instances of `FeatureTransformationStatistics` are initialized with `None` values. These placeholders are subsequently populated with the required statistics when the training dataset is created.

=== "Python"
!!! example "Creation of a Custom Transformation Function in Hopsworks that accesses Feature Statistics"
```python
from hopsworks import udf
from hsfs.transformation_statistics import TransformationStatistics

stats = TransformationStatistics("argument1", "argument2", "argument3")

@udf(int)
def add_features(argument1, argument2, argument3, statistics=stats):
return argument + argument2 + argument3 + statistics.argument1.mean + statistics.argument2.mean + statistics.argument3.mean
```

The output column generated by the transformation function follows a naming convention structured as `functionName_features_outputColumnNumber`. For instance, for the function named `add_one_multiple`, the output columns would be labeled as `add_one_multiple_feature1_feature2_feature3_0`, `add_one_multiple_feature1_feature2_feature3_1`, and `add_one_multiple_feature1_feature2_feature3_2`.

## Apply transformation functions to features

You can define in the feature view transformation functions as dict, where key is feature name and value is online transformation function name. Then the transformation functions are applied when you [read training data](./training-data.md#read-training-data), [read batch data](./batch-data.md#creation-with-transformation), or [get feature vectors](./feature-vectors.md#retrieval-with-transformation).
Transformation functions can be attached to a feature view as a list. Each transformation function can specify which features are to be use by explicitly providing their names as arguments. If no feature names are provided explicitly, the transformation function will default to using features from the feature view that matches the name of the transformation function's argument. Then the transformation functions are applied when you [read training data](./training-data.md#read-training-data), [read batch data](./batch-data.md#creation-with-transformation), or [get feature vectors](./feature-vectors.md#retrieval-with-transformation). The generated data includes both transformed and untransformed features in a DataFrame. The transformed features are organized by their output column names in alphabetical order and are positioned after the untransformed features. By default all features provided as input to a transformation function are dropped when training data, batch data or feature vectors as created.

=== "Python"

!!! example "Attaching transformation functions to the feature view"
```python
plus_one_fn = fs.get_transformation_function(name="plus_one", version=1)
feature_view = fs.create_feature_view(
name='transactions_view',
query=query,
labels=["fraud_label"],
transformation_functions={
"amount_spent": plus_one_fn
}
transformation_functions=[
add_one,
add_features,
add_one_and_two,
add_one_multiple
]
)
```

To explicitly pass the features to a transformation function the feature name to be used can be passed as arguments to the transformation function.


=== "Python"

!!! example "Attaching transformation functions to the feature view by explicitly specifying features to be passed to transformation function"
```python
feature_view = fs.create_feature_view(
name='transactions_view',
query=query,
labels=["fraud_label"],
transformation_functions=[
add_one("feature_1"),
add_one("feature_2"),
add_features("feature_1", "feature_2", "feature_3"),
add_one_and_two("feature_4"),
add_one_multiple("feature_5", "feature_6", "feature_7")
]
)
```

Built-in transformation functions are attached in the same way. The only difference is that it will compute the necessary statistics for the specific function in the background. For example min and max values for `min_max_scaler`; mean and standard deviation for `standard_scaler` etc.
Built-in transformation functions are attached in the same way. The only difference is that they can either be retrieved from the Hopsworks or imported from the hsfs module

=== "Python"

!!! example "Attaching built-in transformation functions to the feature view"
!!! example "Attaching built-in transformation functions to the feature view by retrieving from Hopsworks"
```python
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
standard_scaler = fs.get_transformation_function(name="standard_scaler")
Expand All @@ -80,15 +156,65 @@ Built-in transformation functions are attached in the same way. The only differe
name='transactions_view',
query=query,
labels=["fraud_label"],
transformation_functions = {
"category": label_encoder,
"amount": robust_scaler,
"loc_delta": min_max_scaler,
"age_at_transaction": standard_scaler
}
transformation_functions = [
label_encoder("category"),
robust_scaler("amount"),
min_max_scaler("loc_delta"),
standard_scaler("age_at_transaction")
]
)
```

!!! warning "Java/Scala support"
To attach built in transformation functions from the hsfs module they can be directly imported into the code from `hsfs.builtin_transformations`.

=== "Python"

!!! example "Attaching built-in transformation functions to the feature view by importing from hsfs"
```python
from hsfs.builtin_transformations import min_max_scaler, label_encoder, robust_scaler, standard_scaler

feature_view = fs.create_feature_view(
name='transactions_view',
query=query,
labels=["fraud_label"],
transformation_functions = [
label_encoder("category": ),
robust_scaler("amount"),
min_max_scaler("loc_delta"),
standard_scaler("age_at_transaction")
]
)
```

## Saving Transformation Functions to Feature Store
To save a transformation function to the feature store, use the `create_transformation_function` which would create a `TransformationFunction` object. The `TransformationFunction` object can then be saved by calling the save function.

=== "Python"

!!! example "Register transformation function `add_one` in the Hopsworks feature store."
```python
plus_one_meta = fs.create_transformation_function(
transformation_function=add_one,
version=1)
plus_one_meta.save()
```

## Retrieval from Feature Store
To retrieve all transformation functions from the feature store, use `get_transformation_functions` which will return the list of available `TransformationFunction` objects. A specific transformation function can be retrieved with the `get_transformation_function` method where you can provide its name and version of the transformation function. If only the function name is provided then it will default to version 1.

=== "Python"

!!! example "Retrieving transformation functions from the feature store"
```python
# get all transformation functions
fs.get_transformation_functions()

# get transformation function by name. This will default to version 1
plus_one_fn = fs.get_transformation_function(name="plus_one")

Creating and attaching Transformation functions to feature views are not supported for HSFS Java or Scala client. If feature view with transformation function was created using python client, you cannot get training data or get feature vectors from HSFS Java or Scala client.
# get built-in transformation function min max scaler
min_max_scaler_fn = fs.get_transformation_function(name="min_max_scaler")

# get transformation function by name and version.
plus_one_fn = fs.get_transformation_function(name="plus_one", version=2)
```
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ nav:
- Feature server: user_guides/fs/feature_view/feature-server.md
- Query: user_guides/fs/feature_view/query.md
- Helper Columns: user_guides/fs/feature_view/helper-columns.md
- Transformation Functions: user_guides/fs/feature_view/transformation-function.md
- Model-Dependent Transformation Functions: user_guides/fs/feature_view/transformation-function.md
- Spines: user_guides/fs/feature_view/spine-query.md
- Feature Monitoring:
- Getting started: user_guides/fs/feature_view/feature_monitoring.md
Expand Down