Skip to content

[FSTORE-1424] Feature logging #396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 208 additions & 0 deletions docs/user_guides/fs/feature_view/feature_logging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
# User Guide: Feature Logging with Feature View

Feature logging is essential for tracking and auditing the data your models use. This guide explains how to log features and predictions, and retrieve and manage these logs with feature view in Hopsworks.

## Logging Features and Predictions

After you have trained a model, logging the features it uses and the predictions it makes is crucial. This helps track what data was used during inference and allows for validation of predictions later. You can log either transformed or/and untransformed features.

### Enabling Feature Logging

To enable logging, set `logging_enabled=True` when creating the feature view. Two feature groups will be created for storing transformed and untransformed features, but they are not visible in the UI. The logged features will be written to the offline feature store every hour by scheduled materialization jobs which are created automatically.

```python
feature_view = fs.create_feature_view("name", query, logging_enabled=True)
```

Alternatively, you can call `feature_view.enable_logging()` for an existing feature view. Or, calling `feature_view.log()` will implicitly enable logging if it is not already enabled.

### Logging Features and Predictions

You can log features and predictions by calling `feature_view.log`. The logged features are written periodically to the offline store. If you need it to be available immediately, call `feature_view.materialize_log`.

You can log either transformed or/and untransformed features. To get untransformed features, you can specify `transform=False` in `feature_view.get_batch_data` or `feature_view.get_feature_vector(s)`. Inference helper columns are returned along with the untransformed features. To get the transformed features, you can call `feature_view.transform_batch_data` or `feature_view.transform_feature_vector(s)`. Inference helper columns are not returned as transformed features. [link to transformed features]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can log either transformed or/and untransformed features (by transformed, we mean that the categorical/numerical features haven been encoded).

Inference helper columns are returned along with the untransformed features (if they have been defined in the feature view).

You are missing the URL here:
link to transformed features

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDT does not only encode features, you can create new features. Also I modified it, inference helper columns are not returned in untransformed feature hence not logged.


You can also log predictions, and optionally the training dataset version and the model used for prediction. Prediction can be optionally provided as a column in the feature DataFrame or separately in the `prediction` argument. This is useful for logging real-time features and predictions which are often in type `list`, avoiding the need to ensure feature order of the labels. Training dataset version will also be logged if it is cached after you provide the training dataset version when calling `feature_view.init_serving(...)` or `feature_view.init_batch_scoring(...)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predictions can be optionally provided as one or more columns in the DataFrame containing the features or separately in the prediction argument. There must be the same number of prediction columns as there are labels in the feature view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this point:
This is useful for logging real-time features and predictions which are often in type list, avoiding the need to ensure feature order of the labels.

Typically, when there are 2 label columms, you call something like:
df['prediction_col1', 'prediction_col2'] = model.predict(df)
Scikit-learn and xgboost models that are trained on a dataframe can make predictions on the dataframe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The training dataset version will also be logged if you have called either feature_view.init_serving(...) or feature_view.init_batch_scoring(...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful for logging real-time features and predictions which are often in type list, avoiding the need to ensure feature order of the labels.

I rewrite it:
It is required to provide predictions in the predictions argument if you provide the features as list instead of pandas dataframe.


The time of calling `feature_view.log` is automatically logged, enabling filtering by logging time when retrieving logs.

#### Example 1: Log Features Only

You have a DataFrame of features you want to log.

```python
import pandas as pd

features = pd.DataFrame({
"feature1": [1.1, 2.2, 3.3],
"feature2": [4.4, 5.5, 6.6]
})

# Log features
feature_view.log(features)
```

#### Example 2: Log Features, Predictions, and Model

You can also log predictions, and optionally the training dataset and the model used for prediction.

```python
predictions = pd.DataFrame({
"prediction": [0, 1, 0]
})

# Log features and predictions
feature_view.log(features,
prediction=predictions,
training_dataset_version=1,
hsml_model=Model(1, "model", version=1)
)
```

#### Example 3: Log Both Transformed and Untransformed Features

**Batch Features**
```python
untransformed_df = fv.get_batch_data(transform=False)
# then apply the transformations after:
transformed_df = fv.transform_batch_data(untransformed_df)
# Log untransformed features
feature_view.log(untransformed_df)
# Log transformed features
feature_view.log(transformed_df, transformed=True)
```

**Real-time Features**
```python
untransformed_vector = fv.get_feature_vector({"id": 1}, transform=False)
# then apply the transformations after:
transformed_vector = fv.transform_feature_vector(untransformed_vector)
# Log untransformed features
feature_view.log(untransformed_vector)
# Log transformed features
feature_view.log(transformed_vector, transformed=True)
```

## Retrieving the Log Timeline

To audit and review the data logs, you might want to retrieve the timeline of log entries. This helps understand when data was logged and monitor the logging process.

### Retrieve Log Timeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a "log timeline" as i understand it.
For me a "log timelime" is about the log itself - the log was created, it was rotated, it was deleted.
It is not about the log entries, which is what we are interested in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually the hudi commit timeline.


Get the latest 10 log entries.

```python
# Retrieve the latest 10 log entries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of line 92

log_timeline = feature_view.get_log_timeline(limit=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call this tail without a named parameter?

log_timeline = feature_view.tail(10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_log_timeline is clearer.

print(log_timeline)
```

## Reading Log Entries

You may need to read specific log entries for analysis, such as entries within a particular time range or for a specific model version and training dataset version.

### Read All Log Entries

Read all log entries for comprehensive analysis. The output will return all values of the same primary keys instead of just the latest value.

```python
# Read all log entries
log_entries = feature_view.read_log()
print(log_entries)
```

### Read Log Entries Within a Time Range

Focus on logs within a specific time frame. You can specify `start_time` and `end_time` for filtering, but the time columns will not be returned in the DataFrame.

```python
# Read log entries from January 2022
log_entries = feature_view.read_log(start_time="2022-01-01", end_time="2022-01-31")
print(log_entries)
```

### Read Log Entries by Training Dataset Version

Analyze logs from a particular version of the training dataset. The training dataset version column will be returned in the DataFrame.

```python
# Read log entries of training dataset version 1
log_entries = feature_view.read_log(training_dataset_version=1)
print(log_entries)
```

### Read Log Entries by HSML Model

Analyze logs from a particular name and version of the HSML model. The HSML model column will be returned in the DataFrame.

```python
# Read log entries of a specific HSML model
log_entries = feature_view.read_log(hsml_model=Model(1, "model", version=1))
print(log_entries)
```

### Read Log Entries by Custom Filter

Provide filters which work similarly to the filter method in the `Query` class. The filter should be part of the query in the feature view.

```python
# Read log entries where feature1 is greater than 0
log_entries = feature_view.read_log(filter=fg.feature1 > 0)
print(log_entries)
```

## Pausing and Resuming Logging

During maintenance or updates, you might need to pause logging to save computation resources.

### Pause Logging

Pause the schedule of the materialization job for writing logs to the offline store.

```python
# Pause logging
feature_view.pause_logging()
```

### Resume Logging

Resume the schedule of the materialization job for writing logs to the offline store.

```python
# Resume logging
feature_view.resume_logging()
```

## Materializing Logs

Besides the scheduled materialization job, you can materialize logs from Kafka to the offline store on demand. This does not pause the scheduled job.

### Materialize Logs

Materialize logs and optionally wait for the process to complete.

```python
# Materialize logs and wait for completion
materialization_result = feature_view.materialize_log(wait=True)
print(materialization_result)
```

## Deleting Logs

When log data is no longer needed, you might want to delete it to free up space and maintain data hygiene. This operation deletes the feature groups and recreates new ones. Scheduled materialization job and log timeline are reset as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduled materialization job is reset as well.

What does the "log timeline" mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hudi commit timeline of the logging feature group


### Delete Logs

Remove all log entries, optionally specifying whether to delete transformed/untransformed logs.

```python
# Delete all log entries
feature_view.delete_log()

# Delete only transformed log entries
feature_view.delete_log(transformed=True)
```

## Summary

Feature logging is a crucial part of maintaining and monitoring your machine learning workflows. By following these examples, you can effectively log, retrieve, and delete logs to keep your data pipeline robust and auditable.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ nav:
- Feature Monitoring:
- Getting started: user_guides/fs/feature_view/feature_monitoring.md
- Advanced guide: user_guides/fs/feature_monitoring/feature_monitoring_advanced.md
- Feature Logging: user_guides/fs/feature_view/feature_logging.md
- Vector Similarity Search: user_guides/fs/vector_similarity_search.md
- Compute Engines: user_guides/fs/compute_engines.md
- Client Integrations:
Expand Down