Skip to content

[FSTORE-1789] Data source docs #495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
4 changes: 2 additions & 2 deletions docs/concepts/fs/feature_group/external_fg.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.
External feature groups are offline feature groups where their data is stored in an external table. An external table requires a data source, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.

In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, and Kafka
In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, RDS, and Kafka

<img src="../../../../assets/images/concepts/fs/fg-connector-api.svg">

2 changes: 1 addition & 1 deletion docs/concepts/fs/feature_group/fg_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ The online store stores only the latest values of features for a feature group.

The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models.

In most cases, offline data is stored in Hopsworks, but through the implementation of storage connectors, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
19 changes: 13 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,42 +40,49 @@ hide:
</div>
</div>
<div class="side-content">
<div class="name_item ingrey">Storage connectors</div>
<div class="name_item ingrey">Data Sources</div>
<div class="w-layout-grid">
<div class="db_frame">
<div class="icondb">
<div class="db_frame-top"></div>
<div class="db_frame-mid"></div>
</div>
<div class="name_item db"><a href="./user_guides/fs/storage_connector/creation/jdbc/">JDBC</a></div>
<div class="name_item db"><a href="./user_guides/fs/data_source/creation/jdbc/">JDBC</a></div>
</div>
<div class="db_frame">
<div class="icondb">
<div class="db_frame-top"></div>
<div class="db_frame-mid"></div>
</div>
<div class="name_item db"><a href="./user_guides/fs/storage_connector/creation/bigquery/">BigQuery</a></div>
<div class="name_item db"><a href="./user_guides/fs/data_source/creation/bigquery/">BigQuery</a></div>
</div>
<div class="db_frame">
<div class="icondb">
<div class="db_frame-top"></div>
<div class="db_frame-mid"></div>
</div>
<div class="name_item db"><a href="./user_guides/fs/storage_connector/creation/s3/">Object Store</a></div>
<div class="name_item db"><a href="./user_guides/fs/data_source/creation/s3/">Object Store</a></div>
</div>
<div class="db_frame">
<div class="icondb">
<div class="db_frame-top"></div>
<div class="db_frame-mid"></div>
</div>
<div class="name_item db"><a href="./user_guides/fs/storage_connector/creation/snowflake/">Snowflake</a></div>
<div class="name_item db"><a href="./user_guides/fs/data_source/creation/snowflake/">Snowflake</a></div>
</div>
<div class="db_frame">
<div class="icondb">
<div class="db_frame-top"></div>
<div class="db_frame-mid"></div>
</div>
<div class="name_item db"><a href="./user_guides/fs/storage_connector/creation/redshift/">RedShift</a></div>
<div class="name_item db"><a href="./user_guides/fs/data_source/creation/redshift/">RedShift</a></div>
</div>
<div class="db_frame">
<div class="icondb">
<div class="db_frame-top"></div>
<div class="db_frame-mid"></div>
</div>
<div class="name_item db"><a href="./user_guides/fs/data_source/creation/rds/">RDS</a></div>
</div>
</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/setup_installation/admin/roleChaining.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,6 @@ Add mappings by clicking on *New role chaining*. Enter the project name. Select
<figcaption>Create Role Chaining</figcaption>
</figure>

Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../../user_guides/fs/storage_connector/creation/s3.md#temporary-credentials).
Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../../user_guides/fs/data_source/creation/s3.md#temporary-credentials).

Project member can see the list of role they can assume by going the _Project Settings_ -> [Assuming IAM Roles](../../../user_guides/projects/iam_role/iam_role_chaining) page.
6 changes: 3 additions & 3 deletions docs/setup_installation/on_prem/external_kafka_cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ As mentioned above, when configuring Hopsworks to use an external Kafka cluster,
</figure>
</p>

#### Storage connector configuration
#### Data Source configuration

Users should create a [Kafka storage connector](../../user_guides/fs/storage_connector/creation/kafka.md) named `kafka_connector` which is going to be used by the feature store clients to configure the necessary Kafka producers to send data.
Users should create a [Kafka Data Source](../../user_guides/fs/data_source/creation/kafka.md) named `kafka_connector` which is going to be used by the feature store clients to configure the necessary Kafka producers to send data.
The configuration is done for each project to ensure its members have the necessary authentication/authorization.
If the storage connector is not found in the project, default values referring to Hopsworks managed Kafka will be used.
If the data source is not found in the project, default values referring to Hopsworks managed Kafka will be used.
6 changes: 3 additions & 3 deletions docs/user_guides/fs/compute_engines.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ Hopsworks is aiming to provide functional parity between the computational engin
| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | - | Functionality was deprecated in version 3.0 |
| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) <br/> [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. |
| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam/Java only write operations are supported |
| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. |
| Reading from Streaming Data Sources | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam/Java only write operations are supported |
| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | - | Reading training data that was written to external storage using a Data Source other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. |
| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. |
| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Data Source, from where you can read up the data into a Pandas/Polars Dataframe. |

## Python

Expand Down
4 changes: 2 additions & 2 deletions docs/user_guides/fs/feature_group/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@ By using partitioning the system will write the feature data in different subdir

When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. The currently support values are "HUDI", "DELTA", "NONE" (which defaults to Parquet).

##### Storage connector
##### Data Source

During the creation of a feature group, it is possible to define the `storage_connector` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. Currently, only [S3](../storage_connector/creation/s3.md) connectors and "DELTA" `time_travel_format` format is supported.
During the creation of a feature group, it is possible to define the `storage_connector` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. Currently, only [S3](../data_source/creation/s3.md) connectors and "DELTA" `time_travel_format` format is supported.

##### Online Table Configuration

Expand Down
Loading