diff --git a/docs/assets/images/guides/feature_group/credentials_selection.png b/docs/assets/images/guides/feature_group/credentials_selection.png new file mode 100644 index 000000000..8cbb87d0e Binary files /dev/null and b/docs/assets/images/guides/feature_group/credentials_selection.png differ diff --git a/docs/assets/images/guides/feature_group/data_source.png b/docs/assets/images/guides/feature_group/data_source.png new file mode 100644 index 000000000..9db62e434 Binary files /dev/null and b/docs/assets/images/guides/feature_group/data_source.png differ diff --git a/docs/assets/images/guides/feature_group/ext_table_selection.png b/docs/assets/images/guides/feature_group/ext_table_selection.png new file mode 100644 index 000000000..8b418bfd0 Binary files /dev/null and b/docs/assets/images/guides/feature_group/ext_table_selection.png differ diff --git a/docs/assets/images/guides/feature_group/primary_key_selection.png b/docs/assets/images/guides/feature_group/primary_key_selection.png new file mode 100644 index 000000000..05f68753f Binary files /dev/null and b/docs/assets/images/guides/feature_group/primary_key_selection.png differ diff --git a/docs/assets/images/guides/feature_group/validation_ext_feature_group.png b/docs/assets/images/guides/feature_group/validation_ext_feature_group.png new file mode 100644 index 000000000..d996f16ba Binary files /dev/null and b/docs/assets/images/guides/feature_group/validation_ext_feature_group.png differ diff --git a/docs/assets/images/guides/fs/storage_connector/adls-copy-app-id.png b/docs/assets/images/guides/fs/data_source/adls-copy-app-id.png similarity index 100% rename from docs/assets/images/guides/fs/storage_connector/adls-copy-app-id.png rename to docs/assets/images/guides/fs/data_source/adls-copy-app-id.png diff --git a/docs/assets/images/guides/fs/storage_connector/adls-copy-secret.png b/docs/assets/images/guides/fs/data_source/adls-copy-secret.png similarity index 100% rename from docs/assets/images/guides/fs/storage_connector/adls-copy-secret.png rename to docs/assets/images/guides/fs/data_source/adls-copy-secret.png diff --git a/docs/assets/images/guides/fs/storage_connector/adls-copy-tenant-id.png b/docs/assets/images/guides/fs/data_source/adls-copy-tenant-id.png similarity index 100% rename from docs/assets/images/guides/fs/storage_connector/adls-copy-tenant-id.png rename to docs/assets/images/guides/fs/data_source/adls-copy-tenant-id.png diff --git a/docs/assets/images/guides/fs/data_source/adls_creation.png b/docs/assets/images/guides/fs/data_source/adls_creation.png new file mode 100644 index 000000000..afdc6c900 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/adls_creation.png differ diff --git a/docs/assets/images/guides/fs/data_source/bigquery_creation.png b/docs/assets/images/guides/fs/data_source/bigquery_creation.png new file mode 100644 index 000000000..40afd4520 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/bigquery_creation.png differ diff --git a/docs/assets/images/guides/fs/data_source/data_source_overview.png b/docs/assets/images/guides/fs/data_source/data_source_overview.png new file mode 100644 index 000000000..7ef38089b Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/data_source_overview.png differ diff --git a/docs/assets/images/guides/fs/storage_connector/driver_upload.png b/docs/assets/images/guides/fs/data_source/driver_upload.png similarity index 100% rename from docs/assets/images/guides/fs/storage_connector/driver_upload.png rename to docs/assets/images/guides/fs/data_source/driver_upload.png diff --git a/docs/assets/images/guides/fs/data_source/gcs_creation.png b/docs/assets/images/guides/fs/data_source/gcs_creation.png new file mode 100644 index 000000000..5c4de54b0 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/gcs_creation.png differ diff --git a/docs/assets/images/guides/fs/data_source/hopsfs_creation.png b/docs/assets/images/guides/fs/data_source/hopsfs_creation.png new file mode 100644 index 000000000..d73698c70 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/hopsfs_creation.png differ diff --git a/docs/assets/images/guides/fs/data_source/jdbc_creation.png b/docs/assets/images/guides/fs/data_source/jdbc_creation.png new file mode 100644 index 000000000..9cff97e4d Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/jdbc_creation.png differ diff --git a/docs/assets/images/guides/fs/storage_connector/job_default_config.png b/docs/assets/images/guides/fs/data_source/job_default_config.png similarity index 100% rename from docs/assets/images/guides/fs/storage_connector/job_default_config.png rename to docs/assets/images/guides/fs/data_source/job_default_config.png diff --git a/docs/assets/images/guides/fs/storage_connector/jupyter_config.png b/docs/assets/images/guides/fs/data_source/jupyter_config.png similarity index 100% rename from docs/assets/images/guides/fs/storage_connector/jupyter_config.png rename to docs/assets/images/guides/fs/data_source/jupyter_config.png diff --git a/docs/assets/images/guides/fs/data_source/kafka_creation.png b/docs/assets/images/guides/fs/data_source/kafka_creation.png new file mode 100644 index 000000000..8e053a705 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/kafka_creation.png differ diff --git a/docs/assets/images/guides/fs/data_source/rds_creation.png b/docs/assets/images/guides/fs/data_source/rds_creation.png new file mode 100644 index 000000000..6635f2e41 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/rds_creation.png differ diff --git a/docs/assets/images/guides/fs/data_source/redshift_creation.png b/docs/assets/images/guides/fs/data_source/redshift_creation.png new file mode 100644 index 000000000..e45a00364 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/redshift_creation.png differ diff --git a/docs/assets/images/guides/fs/data_source/s3_creation.png b/docs/assets/images/guides/fs/data_source/s3_creation.png new file mode 100644 index 000000000..6e226c507 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/s3_creation.png differ diff --git a/docs/assets/images/guides/fs/storage_connector/snowflake_account_url.png b/docs/assets/images/guides/fs/data_source/snowflake_account_url.png similarity index 100% rename from docs/assets/images/guides/fs/storage_connector/snowflake_account_url.png rename to docs/assets/images/guides/fs/data_source/snowflake_account_url.png diff --git a/docs/assets/images/guides/fs/data_source/snowflake_creation.png b/docs/assets/images/guides/fs/data_source/snowflake_creation.png new file mode 100644 index 000000000..ddc9a57af Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/snowflake_creation.png differ diff --git a/docs/assets/images/guides/fs/storage_connector/adls_creation.png b/docs/assets/images/guides/fs/storage_connector/adls_creation.png deleted file mode 100644 index 70f0ea987..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/adls_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/bigquery_creation.png b/docs/assets/images/guides/fs/storage_connector/bigquery_creation.png deleted file mode 100644 index 35a485793..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/bigquery_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/gcs_creation.png b/docs/assets/images/guides/fs/storage_connector/gcs_creation.png deleted file mode 100644 index 8f8068dc4..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/gcs_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/hopsfs_creation.png b/docs/assets/images/guides/fs/storage_connector/hopsfs_creation.png deleted file mode 100644 index 89c0e0604..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/hopsfs_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/jdbc_creation.png b/docs/assets/images/guides/fs/storage_connector/jdbc_creation.png deleted file mode 100644 index 2d7593084..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/jdbc_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/kafka_creation.png b/docs/assets/images/guides/fs/storage_connector/kafka_creation.png deleted file mode 100644 index 83bcb55c0..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/kafka_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/redshift_creation.png b/docs/assets/images/guides/fs/storage_connector/redshift_creation.png deleted file mode 100644 index f4e238b2b..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/redshift_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/s3_creation.png b/docs/assets/images/guides/fs/storage_connector/s3_creation.png deleted file mode 100644 index e67dafe45..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/s3_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/snowflake_creation.png b/docs/assets/images/guides/fs/storage_connector/snowflake_creation.png deleted file mode 100644 index a95be8e66..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/snowflake_creation.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/storage_connector_create.png b/docs/assets/images/guides/fs/storage_connector/storage_connector_create.png deleted file mode 100644 index ace9694e4..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/storage_connector_create.png and /dev/null differ diff --git a/docs/assets/images/guides/fs/storage_connector/storage_connector_overview.png b/docs/assets/images/guides/fs/storage_connector/storage_connector_overview.png deleted file mode 100644 index ace9694e4..000000000 Binary files a/docs/assets/images/guides/fs/storage_connector/storage_connector_overview.png and /dev/null differ diff --git a/docs/concepts/fs/feature_group/external_fg.md b/docs/concepts/fs/feature_group/external_fg.md index 8a41adc7f..7d260b816 100644 --- a/docs/concepts/fs/feature_group/external_fg.md +++ b/docs/concepts/fs/feature_group/external_fg.md @@ -1,6 +1,6 @@ -External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table. +External feature groups are offline feature groups where their data is stored in an external table. An external table requires a data source, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table. -In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, and Kafka +In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, RDS, and Kafka diff --git a/docs/concepts/fs/feature_group/fg_overview.md b/docs/concepts/fs/feature_group/fg_overview.md index de8440cf6..d7a9311af 100644 --- a/docs/concepts/fs/feature_group/fg_overview.md +++ b/docs/concepts/fs/feature_group/fg_overview.md @@ -19,4 +19,4 @@ The online store stores only the latest values of features for a feature group. The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models. -In most cases, offline data is stored in Hopsworks, but through the implementation of storage connectors, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md). \ No newline at end of file +In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md). \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 2d72b7ab1..9344be312 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,42 +40,49 @@ hide:
-
Storage connectors
+
Data Sources
- +
+
+
+
+
+
+
diff --git a/docs/setup_installation/admin/roleChaining.md b/docs/setup_installation/admin/roleChaining.md index 7073004cd..ccf6bff69 100644 --- a/docs/setup_installation/admin/roleChaining.md +++ b/docs/setup_installation/admin/roleChaining.md @@ -121,6 +121,6 @@ Add mappings by clicking on *New role chaining*. Enter the project name. Select
Create Role Chaining
-Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../../user_guides/fs/storage_connector/creation/s3.md#temporary-credentials). +Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../../user_guides/fs/data_source/creation/s3.md#temporary-credentials). Project member can see the list of role they can assume by going the _Project Settings_ -> [Assuming IAM Roles](../../../user_guides/projects/iam_role/iam_role_chaining) page. diff --git a/docs/setup_installation/on_prem/external_kafka_cluster.md b/docs/setup_installation/on_prem/external_kafka_cluster.md index d7bb893a1..746a1fef7 100644 --- a/docs/setup_installation/on_prem/external_kafka_cluster.md +++ b/docs/setup_installation/on_prem/external_kafka_cluster.md @@ -60,8 +60,8 @@ As mentioned above, when configuring Hopsworks to use an external Kafka cluster,

-#### Storage connector configuration +#### Data Source configuration -Users should create a [Kafka storage connector](../../user_guides/fs/storage_connector/creation/kafka.md) named `kafka_connector` which is going to be used by the feature store clients to configure the necessary Kafka producers to send data. +Users should create a [Kafka Data Source](../../user_guides/fs/data_source/creation/kafka.md) named `kafka_connector` which is going to be used by the feature store clients to configure the necessary Kafka producers to send data. The configuration is done for each project to ensure its members have the necessary authentication/authorization. -If the storage connector is not found in the project, default values referring to Hopsworks managed Kafka will be used. +If the data source is not found in the project, default values referring to Hopsworks managed Kafka will be used. diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md index 26e44acef..529f36c29 100644 --- a/docs/user_guides/fs/compute_engines.md +++ b/docs/user_guides/fs/compute_engines.md @@ -25,10 +25,10 @@ Hopsworks is aiming to provide functional parity between the computational engin | Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | - | Functionality was deprecated in version 3.0 | | Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate)
[`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. | | Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. | -| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam/Java only write operations are supported | -| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. | +| Reading from Streaming Data Sources | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam/Java only write operations are supported | +| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | - | Reading training data that was written to external storage using a Data Source other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. | | Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. | -| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. | +| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Data Source, from where you can read up the data into a Pandas/Polars Dataframe. | ## Python diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index 1abce1c54..d39b827f1 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -85,9 +85,9 @@ By using partitioning the system will write the feature data in different subdir When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. The currently support values are "HUDI", "DELTA", "NONE" (which defaults to Parquet). -##### Storage connector +##### Data Source -During the creation of a feature group, it is possible to define the `storage_connector` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. Currently, only [S3](../storage_connector/creation/s3.md) connectors and "DELTA" `time_travel_format` format is supported. +During the creation of a feature group, it is possible to define the `storage_connector` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. Currently, only [S3](../data_source/creation/s3.md) connectors and "DELTA" `time_travel_format` format is supported. ##### Online Table Configuration diff --git a/docs/user_guides/fs/feature_group/create_external.md b/docs/user_guides/fs/feature_group/create_external.md index 9c0bbceb1..283a73b34 100644 --- a/docs/user_guides/fs/feature_group/create_external.md +++ b/docs/user_guides/fs/feature_group/create_external.md @@ -14,9 +14,9 @@ Before you begin this guide we suggest you read the [External Feature Group](../ ## Create using the HSFS APIs -### Retrieve the storage connector +### Retrieve the Data Source -To create an external feature group using the HSFS APIs you need to provide an existing [storage connector](../storage_connector/index.md). +To create an external feature group using the HSFS APIs you need to provide an existing [data source](../data_source/index.md). === "Python" @@ -79,7 +79,7 @@ The full method documentation is available [here](https://docs.hopsworks.ai/hops The version number is optional, if you don't specify the version number the APIs will create a new version by default with a version number equals to the highest existing version number plus one. -If the storage connector is defined for a data warehouse (e.g. JDBC, Snowflake, Redshift) you need to provide a SQL statement that will be executed to compute the features. If the storage connector is defined for a data lake, the location of the data as well as the format need to be provided. +If the data source is defined for a data warehouse (e.g. JDBC, Snowflake, Redshift) you need to provide a SQL statement that will be executed to compute the features. If the data source is defined for a data lake, the location of the data as well as the format need to be provided. Additionally we specify which columns of the DataFrame will be used as primary key, and event time. Composite primary keys are also supported. @@ -125,7 +125,7 @@ The `insert()` method takes a DataFrame as parameter and writes it _only_ to the Hopsworks Feature Store does not support time-travel queries on external feature groups. Additionally, support for `.read()` and `.show()` methods when using by the Python engine is limited to external feature groups defined on BigQuery and Snowflake and only when using the [Feature Query Service](../../../setup_installation/common/arrow_flight_duckdb.md). -Nevertheless, external feature groups defined top of any storage connector can be used to create a training dataset from a Python environment invoking one of the following methods: [create_training_data](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#create_training_data), [create_train_test_split](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#create_train_test_split) or the [create_train_validation_test_split](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#create_train_validation_test_split) +Nevertheless, external feature groups defined top of any data source can be used to create a training dataset from a Python environment invoking one of the following methods: [create_training_data](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#create_training_data), [create_train_test_split](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#create_train_test_split) or the [create_train_validation_test_split](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#create_train_validation_test_split) ### API Reference @@ -134,18 +134,40 @@ Nevertheless, external feature groups defined top of any storage connector can b ## Create using the UI -You can also create a new feature group through the UI. For this, navigate to the `Feature Groups` section and press the `Create` button at the top-right corner. +You can also create a new feature group through the UI. For this, navigate to the `Data Source` section and select existing credentials or create new ones for your prefered data source.

- List of Feature Groups + Data Source UI

-Subsequently, you will be able to define its properties (such as name, mode, features, and more). Refer to the documentation above for an explanation of the parameters available, they are the same as when you create a feature group using the SDK. Finally, complete the creation by clicking `Create New Feature Group` at the bottom of the page. +If you have existing credentials, simply proceed by clicking `Next: Select Tables`. If you not, create and save the credentials first.

- Create new Feature Group + setup credentials in Data Sources +
+

+ +The database navigation structure depends on your specific data source. You'll navigate through the appropriate hierarchy for your platform—such as Database → Schema → Table for Snowflake, or Project → Dataset → Table for BigQuery. In the UI you can select one or more tables, for each selected table, you must designate one or more columns as primary keys before proceeding. You can also optionally a single column as a timestamp for the row (supported types are timestamp, date and bigint), edit names and data types of individual columns you want to include. + +

+

+ Select Table in Data Sources for External feature Group +
+

+ +

+

+ select details of external feature group +
+

+ +Complete the creation by clicking `Next: Review Configuration` at the bottom of the page, you will be prompted with a final validation window where you will be able to create a name for your external feature group. + +

+

+ Validate the creation of a new external feature group

diff --git a/docs/user_guides/fs/index.md b/docs/user_guides/fs/index.md index 9e64e72c5..4d6ad1f0b 100644 --- a/docs/user_guides/fs/index.md +++ b/docs/user_guides/fs/index.md @@ -2,7 +2,7 @@ This section serves to provide guides and examples for the common usage of abstractions and functionality of the Feature Store through the Hopsworks UI and APIs. -- [Storage Connectors](storage_connector/index.md) +- [Data Sources](data_source/index.md) - [Feature Groups](feature_group/index.md) - [Feature Views](feature_view/index.md) - [Vector Similarity Search](vector_similarity_search.md) diff --git a/docs/user_guides/fs/provenance/provenance.md b/docs/user_guides/fs/provenance/provenance.md index 9b42e7e8a..4362d9247 100644 --- a/docs/user_guides/fs/provenance/provenance.md +++ b/docs/user_guides/fs/provenance/provenance.md @@ -4,7 +4,7 @@ Hopsworks allows users to track provenance (lineage) between: -- storage connectors +- data sources - feature groups - feature views - training datasets @@ -15,7 +15,7 @@ In the provenance pages we will call a provenance artifact or shortly artifact, With the following provenance graph: ``` -storage connector -> feature group -> feature group -> feature view -> training dataset -> model +data source -> feature group -> feature group -> feature view -> training dataset -> model ``` we will call the parent, the artifact to the left, and the child, the artifact to the right. So a feature view has a number of feature groups as parents and can have a number of training datasets as children. @@ -24,14 +24,14 @@ Tracking provenance allows users to determine where and if an artifact is being You can interact with the provenance graph using the UI or the APIs. -## Step 1: Storage connector lineage +## Step 1: Data Source lineage -The relationship between storage connectors and feature groups is captured automatically when you create an external feature group. You can inspect the relationship between storage connectors and feature groups using the APIs. +The relationship between data sources and feature groups is captured automatically when you create an external feature group. You can inspect the relationship between data sources and feature groups using the APIs. === "Python" ```python - # Retrieve the storage connector + # Retrieve the data source snowflake_sc = fs.get_storage_connector("snowflake_sc") # Create the user profiles feature group @@ -46,37 +46,37 @@ The relationship between storage connectors and feature groups is captured autom ### Using the APIs -Starting from a feature group metadata object, you can traverse upstream the provenance graph to retrieve the metadata objects of the storage connectors that are part of the feature group. To do so, you can use the [get_storage_connector_provenance](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_storage_connector_provenance) method. +Starting from a feature group metadata object, you can traverse upstream the provenance graph to retrieve the metadata objects of the data sources that are part of the feature group. To do so, you can use the [get_storage_connector_provenance](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_storage_connector_provenance) method. === "Python" ```python - # Returns all storage connectors linked to the provided feature group + # Returns all data sources linked to the provided feature group lineage = user_profiles_fg.get_storage_connector_provenance() - # List all accessible parent storage connectors + # List all accessible parent data sources lineage.accessible - # List all deleted parent storage connectors + # List all deleted parent data sources lineage.deleted - # List all the inaccessible parent storage connectors + # List all the inaccessible parent data sources lineage.inaccessible ``` === "Python" ```python - # Returns an accessible storage connector linked to the feature group (if it exists) + # Returns an accessible data source linked to the feature group (if it exists) user_profiles_fg.get_storage_connector() ``` -To traverse the provenance graph in the opposite direction (i.e. from the storage connector to the feature group), you can use the [get_feature_groups_provenance](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#get_feature_groups_provenance) method. When navigating the provenance graph downstream, the `deleted` feature groups are not tracked by provenance, as such, the `deleted` property will always return an empty list. +To traverse the provenance graph in the opposite direction (i.e. from the data source to the feature group), you can use the [get_feature_groups_provenance](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#get_feature_groups_provenance) method. When navigating the provenance graph downstream, the `deleted` feature groups are not tracked by provenance, as such, the `deleted` property will always return an empty list. === "Python" ```python - # Returns all feature groups linked to the provided storage connector + # Returns all feature groups linked to the provided data source lineage = snowflake_sc.get_feature_groups_provenance() # List all accessible downstream feature groups @@ -89,7 +89,7 @@ To traverse the provenance graph in the opposite direction (i.e. from the storag === "Python" ```python - # Returns all accessible feature groups linked to the storage connector (if any exists) + # Returns all accessible feature groups linked to the data source (if any exists) snowflake_sc.get_feature_groups() ``` @@ -183,7 +183,7 @@ To traverse the provenance graph in the opposite direction (i.e. from the parent lineage.inaccessible ``` -You can also visualize the relationship between the parent and child feature groups in the UI. In each feature group overview page you can find a provenance section with the graph of parent storage connectors/feature groups and child feature groups/feature views. +You can also visualize the relationship between the parent and child feature groups in the UI. In each feature group overview page you can find a provenance section with the graph of parent data source/feature groups and child feature groups/feature views.

diff --git a/docs/user_guides/fs/storage_connector/creation/adls.md b/docs/user_guides/fs/storage_connector/creation/adls.md index 1d31f502e..e497a1811 100644 --- a/docs/user_guides/fs/storage_connector/creation/adls.md +++ b/docs/user_guides/fs/storage_connector/creation/adls.md @@ -1,14 +1,14 @@ -# How-To set up a ADLS Storage Connector +# How-To set up a ADLS Data Source ## Introduction -Azure Data Lake Storage (ADLS) Gen2 is a HDFS-compatible filesystem on Azure for data analytics. The ADLS Gen2 filesystem stores its data in Azure Blob storage, ensuring low-cost storage, high availability, and disaster recovery. In Hopsworks, you can access ADLS Gen2 by defining a Storage Connector and creating and granting permissions to a service principal. +Azure Data Lake Storage (ADLS) Gen2 is a HDFS-compatible filesystem on Azure for data analytics. The ADLS Gen2 filesystem stores its data in Azure Blob storage, ensuring low-cost storage, high availability, and disaster recovery. In Hopsworks, you can access ADLS Gen2 by defining a Data Source and creating and granting permissions to a service principal. -In this guide, you will configure a Storage Connector in Hopsworks to save all the authentication information needed in order to set up a connection to your Azure ADLS filesystem. +In this guide, you will configure a Data Source in Hopsworks to save all the authentication information needed in order to set up a connection to your Azure ADLS filesystem. When you're finished, you'll be able to read files using Spark through HSFS APIs. You can also use the connector to write out training data from the Feature Store, in order to make it accessible by third parties. !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -19,16 +19,16 @@ Before you begin this guide you'll need to retrieve the following information fr - **Service Principal Registration:** Register the service principal, granting it a role assignment such as Storage Blob Data Contributor, on the Azure Data Lake Storage Gen2 account. !!! info - When you specify the 'container name' in the ADLS storage connector, you need to have previously created that container - the Hopsworks Feature Store will not create that storage container for you. + When you specify the 'container name' in the ADLS data source, you need to have previously created that container - the Hopsworks Feature Store will not create that storage container for you. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter ADLS Information @@ -36,10 +36,18 @@ Head to the Storage Connector View on Hopsworks (1) and set up a new storage con Enter the details for your ADLS connector. Start by giving it a **name** and an optional **description**.
- ![ADLS Connector Creation](../../../../assets/images/guides/fs/storage_connector/adls_creation.png) + ![ADLS Connector Creation](../../../../assets/images/guides/fs/data_source/adls_creation.png)
ADLS Connector Creation Form
+1. Select "Azure Data Lake" as the storage. +2. Set directory ID. +3. Enter the Application ID. +4. Paste the Service Credentials. +5. Specify account name. +6. Provide the container name. +7. Click on "Save Credentials". + ### Step 3: Azure Create an ADLS Resource When programmatically signing in, you need to pass the tenant ID with your authentication request and the application ID. You also need a certificate or an authentication key (described in the following section). To get those values, use the following steps: @@ -48,20 +56,20 @@ When programmatically signing in, you need to pass the tenant ID with your authe 2. From App registrations in Azure AD, select your application. 3. Copy the Directory (tenant) ID and store it in your application code.
- ![ADLS select tenant-id](../../../../assets/images/guides/fs/storage_connector/adls-copy-tenant-id.png) -
You need to copy the Directory (tenant) id and paste it to the Hopsworks ADLS storage connector "Directory id" text field.
+ ![ADLS select tenant-id](../../../../assets/images/guides/fs/data_source/adls-copy-tenant-id.png) +
You need to copy the Directory (tenant) id and paste it to the Hopsworks ADLS Data Source "Directory id" text field.
4. Copy the Application ID and store it in your application code.
- ![ADLS select app-id](../../../../assets/images/guides/fs/storage_connector/adls-copy-app-id.png) -
>You need to copy the Application id and paste it to the Hopsworks ADLS storage connector "Application id" text field.
+ ![ADLS select app-id](../../../../assets/images/guides/fs/data_source/adls-copy-app-id.png) +
>You need to copy the Application id and paste it to the Hopsworks ADLS Data Source "Application id" text field.
5. Create an Application Secret and copy it into the Service Credential field.
- ![ADLS enter application secret](../../../../assets/images/guides/fs/storage_connector/adls-copy-secret.png) -
You need to copy the Application Secret and paste it to the Hopsworks ADLS storage connector "Service Credential" text field.
+ ![ADLS enter application secret](../../../../assets/images/guides/fs/data_source/adls-copy-secret.png) +
You need to copy the Application Secret and paste it to the Hopsworks ADLS Data Source "Service Credential" text field.
#### Common Problems @@ -76,4 +84,4 @@ If you get an error "StatusCode=404 StatusDescription=The specified filesystem d ## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created ADLS connector. \ No newline at end of file +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created ADLS connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/creation/bigquery.md b/docs/user_guides/fs/storage_connector/creation/bigquery.md index 0ed26e353..1a18c557f 100644 --- a/docs/user_guides/fs/storage_connector/creation/bigquery.md +++ b/docs/user_guides/fs/storage_connector/creation/bigquery.md @@ -1,23 +1,23 @@ -# How-To set up a BigQuery Storage Connector +# How-To set up a BigQuery Data Source ## Introduction -A BigQuery storage connector provides integration to Google Cloud BigQuery. +A BigQuery data source provides integration to Google Cloud BigQuery. BigQuery is Google Cloud's managed data warehouse supporting that lets you run analytics and execute SQL queries over large scale data. Such data warehouses are often the source of raw data for feature engineering pipelines. -In this guide, you will configure a Storage Connector in Hopsworks to connect to your BigQuery project by saving the +In this guide, you will configure a Data Source in Hopsworks to connect to your BigQuery project by saving the necessary information. When you're finished, you'll be able to execute queries and read results of BigQuery using Spark through HSFS APIs. -The storage connector uses the Google `spark-bigquery-connector` behind the scenes. +The data source uses the Google `spark-bigquery-connector` behind the scenes. To read more about the spark connector, like the spark options or usage, check [Apache Spark SQL connector for Google BigQuery.](https://github.com/GoogleCloudDataproc/spark-bigquery-connector#usage 'github.com/GoogleCloudDataproc/spark-bigquery-connector') !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -35,44 +35,46 @@ Before you begin this guide you'll need to retrieve the following information ab To read data, the BigQuery service account user needs permission to `create read sesssion` which is available in **BigQuery Admin role**. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
-### Step 2: Enter connector details -Enter the details for your BigQuery connector. Start by giving +### Step 2: Enter source details +Enter the details for your BigQuery storage. Start by giving it a unique **name** and an optional **description**.
- ![BigQuery Connector Creation](../../../../assets/images/guides/fs/storage_connector/bigquery_creation.png) -
BigQuery Connector Creation Form
+ ![BigQuery Creation](../../../../assets/images/guides/fs/data_source/bigquery_creation.png) +
BigQuery Creation Form
-1. Choose `Google BigQuery` from the connector options. +1. Select "Google BigQuery" as the storage. 2. Next, set the name of the parent BigQuery project. This is used for billing by GCP. 3. Authentication: Here you should upload your `JSON keyfile for service account` used for authentication. You can choose to either upload from your local using `Upload new file` or choose an existing file within project using `From Project`. -4. Read Options: There are two ways to read via BigQuery, using the **BigQuery Table** or **BigQuery Query** option: - - 1. **BigQuery Table** - This option reads directly from BigQuery table reference. Note that it can only be used in `read` API for reading data from BigQuery. Creating external Feature Groups using this option is not yet supported. In the UI set the below fields, - 1. *BigQuery Project*: The BigQuery project - 2. *BigQuery Dataset*: The dataset of the table - 3. *BigQuery Table*: The table to read - 2. **BigQuery Query** - This option executes a SQL query at runtime. It can be used for both reading data and **creating external Feature Groups**. - 1. *Materialization Dataset*: Temporary dataset used by BigQuery for writing. It must be set to a dataset where the GCP user has table creation permission. The queried table must be in the same location as the `materializationDataset` (e.g 'EU' or 'US'). Also, if a table in the `SQL statement` is from project other than the `parentProject` then use the fully qualified table name i.e. `[project].[dataset].[table]` +4. Read Options: + In the UI set the below fields, + 1. *BigQuery Project*: The BigQuery project to read + 2. *BigQuery Dataset*: The dataset of the table (Optional) + 3. *BigQuery Table*: The table to read (Optional) + + +!!! note + *Materialization Dataset*: Temporary dataset used by BigQuery for writing. It must be set to a dataset where the GCP user has table creation permission. The queried table must be in the same location as the `materializationDataset` (e.g 'EU' or 'US'). Also, if a table in the `SQL statement` is from project other than the `parentProject` then use the fully qualified table name i.e. `[project].[dataset].[table]` (Read more details from Google documentation on usage of query for BigQuery spark connector [here](https://github.com/GoogleCloudDataproc/spark-bigquery-connector#reading-data-from-a-bigquery-query)). 5. Spark Options: Optionally, you can set additional spark options using the `Key - Value` pairs. +6. Click on "Save Credentials". ## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created BigQuery +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created BigQuery connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/creation/gcs.md b/docs/user_guides/fs/storage_connector/creation/gcs.md index 0e0acf4ab..c4b343ba0 100644 --- a/docs/user_guides/fs/storage_connector/creation/gcs.md +++ b/docs/user_guides/fs/storage_connector/creation/gcs.md @@ -1,23 +1,23 @@ -# How-To set up a GCS Storage Connector +# How-To set up a GCS Data Source ## Introduction -This particular type of storage connector provides integration to Google Cloud Storage (GCS). GCS is +This particular type of Data Source provides integration to Google Cloud Storage (GCS). GCS is an object storage service offered by Google Cloud. An object could be simply any piece of immutable data consisting of a file of any format, for example a `CSV` or `PARQUET`. These objects are stored in containers called as `buckets`. These types of storages are often the source for raw data from which features can be engineered. -In this guide, you will configure a Storage Connector in Hopsworks to connect to your GCS bucket by saving the +In this guide, you will configure a Data Source in Hopsworks to connect to your GCS bucket by saving the necessary information. When you're finished, you'll be able to read files from the GCS bucket using Spark through HSFS APIs. -The storage connector uses the Google `gcs-connector-hadoop` behind the scenes. For more information, check out [Google Cloud Storage Connector for Spark and Hadoop]( +The Data Source uses the Google `gcs-connector-hadoop` behind the scenes. For more information, check out [Google Cloud Data Source for Spark and Hadoop]( https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/master/gcs#google-cloud-storage-connector-for-spark-and-hadoop 'google-cloud-storage-connector-for-spark-and-hadoop') !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -36,13 +36,13 @@ Before you begin this guide you'll need to retrieve the following information ab Read more about encryption on [Google Documentation.](https://cloud.google.com/storage/docs/encryption/customer-supplied-keys) ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter connector details @@ -52,19 +52,20 @@ it a unique **name** and an optional **description**.
- ![GCS Connector Creation](../../../../assets/images/guides/fs/storage_connector/gcs_creation.png) + ![GCS Connector Creation](../../../../assets/images/guides/fs/data_source/gcs_creation.png)
GCS Connector Creation Form
-1. Choose `Google Cloud Storage` from the connector options. +1. Select "Google Cloud Storage" as the storage. 2. Next, set the name of the GCS Bucket you wish to connect with. 3. Authentication: Here you should upload your `JSON keyfile for service account` used for authentication. You can choose to either upload from your local using `Upload new file` or choose an existing file within project using `From Project`. 4. GCS Server Side Encryption: You can leave this to `Default Encryption` if you do not wish to provide explicit encrypting keys. Otherwise, optionally you can set the encryption setting for `AES-256` and provide the encryption key and hash when selected. +5. Click on "Save Credentials". ## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created GCS +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created GCS connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/creation/hopsfs.md b/docs/user_guides/fs/storage_connector/creation/hopsfs.md index 5fb579c44..d15c35d53 100644 --- a/docs/user_guides/fs/storage_connector/creation/hopsfs.md +++ b/docs/user_guides/fs/storage_connector/creation/hopsfs.md @@ -1,42 +1,42 @@ -# How-To set up a HopsFS Storage Connector +# How-To set up a HopsFS Data Source ## Introduction -HopsFS is a HDFS-compatible filesystem on AWS/Azure/on-premises for data analytics. HopsFS stores its data on object storage in the cloud (S3 in AWs and Blob storage on Azure) and on commodity servers on-premises, ensuring low-cost storage, high availability, and disaster recovery. In Hopsworks, you can access HopsFS natively in programs (Spark, TensorFlow, etc) without the need to define a Storage Connector. By default, every Project has a Storage Connector for Training Datasets. When you create training datasets from features in the Feature Store the HopsFS connector is the default Storage Connector. However, if you want to output data to a different dataset, you can define a new Storage Connector for that dataset. +HopsFS is a HDFS-compatible filesystem on AWS/Azure/on-premises for data analytics. HopsFS stores its data on object storage in the cloud (S3 in AWs and Blob storage on Azure) and on commodity servers on-premises, ensuring low-cost storage, high availability, and disaster recovery. In Hopsworks, you can access HopsFS natively in programs (Spark, TensorFlow, etc) without the need to define a Data Source. By default, every Project has a Data Source for Training Datasets. When you create training datasets from features in the Feature Store the HopsFS connector is the default Data Source. However, if you want to output data to a different dataset, you can define a new Data Source for that dataset. -In this guide, you will configure a HopsFS Storage Connector in Hopsworks which points at a different directory on the file system than the Training Datasets directory. +In this guide, you will configure a HopsFS Data Source in Hopsworks which points at a different directory on the file system than the Training Datasets directory. When you're finished, you'll be able to write training data to different locations in your cluster through HSFS APIs. !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites -Before you begin this guide you'll need to identify a **directory on the filesystem** of Hopsworks, to which you want to point the Storage Connector that you are going to create. +Before you begin this guide you'll need to identify a **directory on the filesystem** of Hopsworks, to which you want to point the Data Source that you are going to create. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter HopsFS Settings Enter the details for your HopsFS connector. Start by giving it a **name** and an optional **description**. -1. Select "HopsFS" as connector protocol. -2. Select the top-level directory to point the connector to. -3. Click "Setup storage connector". +1. Select "HopsFS" as the storage. +2. Select the top-level dataset to point the connector to. +3. Click on "Save Credentials".
- ![HopsFS Connector Creation](../../../../assets/images/guides/fs/storage_connector/hopsfs_creation.png) + ![HopsFS Connector Creation](../../../../assets/images/guides/fs/data_source/hopsfs_creation.png)
HopsFS Connector Creation Form
## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created HopsFS connector. \ No newline at end of file +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created HopsFS connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/creation/jdbc.md b/docs/user_guides/fs/storage_connector/creation/jdbc.md index 1d538a315..971bfd549 100644 --- a/docs/user_guides/fs/storage_connector/creation/jdbc.md +++ b/docs/user_guides/fs/storage_connector/creation/jdbc.md @@ -1,14 +1,14 @@ -# How-To set up a JDBC Storage Connector +# How-To set up a JDBC Data Source ## Introduction JDBC is an API provided by many database systems. Using JDBC connections one can query and update data in a database, usually oriented towards relational databases. Examples of databases you can connect to using JDBC are MySQL, Postgres, Oracle, DB2, MongoDB or Microsoft SQLServer. -In this guide, you will configure a Storage Connector in Hopsworks to save all the authentication information needed in order to set up a JDBC connection to your database of choice. +In this guide, you will configure a Data Source in Hopsworks to save all the authentication information needed in order to set up a JDBC connection to your database of choice. When you're finished, you'll be able to query the database using Spark through HSFS APIs. !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -23,13 +23,13 @@ jdbc:mysql://10.0.2.15:3306/[databaseName]?useSSL=false&allowPublicKeyRetrieval= - **Username and Password:** Typically, you will need to add username and password in your JDBC URL or as key/value parameters. So make sure you have retrieved a username and password with the suitable permissions for the database and table you want to query. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter JDBC Settings @@ -37,11 +37,11 @@ Head to the Storage Connector View on Hopsworks (1) and set up a new storage con Enter the details for your JDBC enabled database.
- ![JDBC Connector Creation](../../../../assets/images/guides/fs/storage_connector/jdbc_creation.png) + ![JDBC Connector Creation](../../../../assets/images/guides/fs/data_source/jdbc_creation.png)
JDBC Connector Creation Form
-1. Select "JDBC" as connector protocol. +1. Select "JDBC" as the storage. 2. Enter the JDBC connection url. This can for example also contain the username and password. 3. Add additional key/value arguments to be passed to the connection, such as username or password. These might differ by database. @@ -50,7 +50,7 @@ Enter the details for your JDBC enabled database. Driver class name is a mandatory argument even if using the default MySQL driver. Add it by specifying a property with the name `driver` and class name as value. The driver class name will differ based on the database. For MySQL databases, the class name is `com.mysql.cj.jdbc.Driver`, as shown in the example image. -4. Click "Setup storage connector". +4. Click on "Save Credentials". !!! note @@ -59,4 +59,4 @@ Enter the details for your JDBC enabled database. ## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created JDBC connector. \ No newline at end of file +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created JDBC connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/creation/kafka.md b/docs/user_guides/fs/storage_connector/creation/kafka.md index cadc39658..8dfe8970b 100644 --- a/docs/user_guides/fs/storage_connector/creation/kafka.md +++ b/docs/user_guides/fs/storage_connector/creation/kafka.md @@ -1,14 +1,14 @@ -# How-To set up a Kafka Storage Connector +# How-To set up a Kafka Data Source ## Introduction Apache Kafka is a distributed event store and stream-processing platform. It's a very popular framework for handling realtime data streams and is often used as a message broker for events coming from production systems until they are being processed and either loaded into a data warehouse or aggregated into features for Machine Learning. -In this guide, you will configure a Storage Connector in Hopsworks to save all the authentication information needed in order to set up a connection to your Kafka cluster. +In this guide, you will configure a Data Source in Hopsworks to save all the authentication information needed in order to set up a connection to your Kafka cluster. When you're finished, you'll be able to read from Kafka topics in your cluster using Spark through HSFS APIs. !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -16,23 +16,23 @@ Before you begin this guide you'll need to retrieve the following information fr - **Kafka Bootstrap servers:** It is the url of one of the Kafka brokers which you give to fetch the initial metadata about your Kafka cluster. The metadata consists of the topics, their partitions, the leader brokers for those partitions etc. Depending upon this metadata your producer or consumer produces or consumes the data. - **Security Protocol:** The security protocol you want to use to authenticate with your Kafka cluster. Make sure the chosen protocol is supported by your cluster. For an overview of the available protocols, please see the [Confluent Kafka Documentation](https://docs.confluent.io/platform/current/kafka/overview-authentication-methods.html). -- **Certificates:** Depending on the chosen security protocol, you might need TrustStore and KeyStore files along with the corresponding key password. Contact your Kafka administrator, if you don't know how to retrieve these. If you want to setup a storage connector to Hopsworks' internal Kafka cluster, you can download the needed certificates from the integration tab in your project settings. +- **Certificates:** Depending on the chosen security protocol, you might need TrustStore and KeyStore files along with the corresponding key password. Contact your Kafka administrator, if you don't know how to retrieve these. If you want to setup a data source to Hopsworks' internal Kafka cluster, you can download the needed certificates from the integration tab in your project settings. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter Kafka Settings Enter the details for your Kafka connector. Start by giving it a **name** and an optional **description**. -1. Select "Kafka" as connector protocol. +1. Select "Kafka" as the storage. 2. Add all the bootstrap server addresses and ports that you want the consumers/producers to connect to. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. 3. Choose the Security protocol. @@ -58,13 +58,13 @@ Enter the details for your Kafka connector. Start by giving it a **name** and an 4. The endpoint identification algorithm used by clients to validate server host name. The default value is `https`. Clients including client connections created by the broker for inter-broker communication verify that the broker host name matches the host name in the broker’s certificate. 5. Optional additional key/value arguments. -6. Click "Setup storage connector". +6. Click on "Save Credentials".
- ![Kafka Connector Creation](../../../../assets/images/guides/fs/storage_connector/kafka_creation.png) + ![Kafka Connector Creation](../../../../assets/images/guides/fs/data_source/kafka_creation.png)
Kafka Connector Creation Form
## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created Kafka connector. \ No newline at end of file +Move on to the [usage guide for Data Sources](../usage.md) to see how you can use your newly created Kafka connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/creation/rds.md b/docs/user_guides/fs/storage_connector/creation/rds.md new file mode 100644 index 000000000..ec530ba5b --- /dev/null +++ b/docs/user_guides/fs/storage_connector/creation/rds.md @@ -0,0 +1,64 @@ +# How-To set up an Amazon RDS JDBC Data Source + +## Introduction + +Amazon RDS (Relational Database Service) is a managed relational database service that supports several popular database engines, such as MySQL, PostgreSQL, Oracle, and Microsoft SQL Server. Using JDBC connections, you can query and update data in your RDS database from Hopsworks. + +In this guide, you will configure a Data Source in Hopsworks to securely store the authentication information needed to set up a JDBC connection to your Amazon RDS instance. +Once configured, you will be able to query your RDS database. + +!!! note + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. + +## Prerequisites + +Before you begin, ensure you have the following information from your Amazon RDS instance: + +- **Host:** You can find the endpoint for your RDS instance in the AWS Console. + + 1. Go to the AWS Console → `Aurora and RDS` + 2. Click on your DB instance. + 3. Under `Connectivity & security`, you'll find the endpoint + +Example: + +``` +mydb.abcdefg1234.us-west-2.rds.amazonaws.com +``` + +- **Database:** You can specify which database to use + +- **Port:** Provide the port to connect to + +- **Username and Password:** Obtain the username and password for your RDS database with the necessary permissions to access the required tables. + +## Creation in the UI + +### Step 1: Set up a new Data Source + +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2). + +
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
+
+ +### Step 2: Enter RDS Settings + +Enter the details for your Amazon RDS database. + +
+ ![RDS Connector Creation](../../../../assets/images/guides/fs/data_source/rds_creation.png) +
RDS Connector Creation Form
+
+ +1. Select "RDS" as the storage. +2. Paste the Host details. +3. Enter the database name. +4. Specify which port to use. +5. Provide the username and password. +6. Click on "Save Credentials". + +## Next Steps + +Proceed to the [usage guide for data sources](../usage.md) to learn how to use your newly created connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/creation/redshift.md b/docs/user_guides/fs/storage_connector/creation/redshift.md index 8939747b1..6fe67aeb1 100644 --- a/docs/user_guides/fs/storage_connector/creation/redshift.md +++ b/docs/user_guides/fs/storage_connector/creation/redshift.md @@ -1,4 +1,4 @@ -# How-To set up a Redshift Storage Connector +# How-To set up a Redshift Data Source ## Introduction @@ -6,11 +6,11 @@ Amazon Redshift is a popular managed data warehouse on AWS, used as a data wareh Data warehouses are often the source of raw data for feature engineering pipelines and Redshift supports scalable feature computation with SQL. However, Redshift is not viable as an online feature store that serves features to models in production, with its columnar database layout its latency is too high compared to OLTP databases or key-value stores. -In this guide, you will configure a Storage Connector in Hopsworks to save all the authentication information needed in order to set up a connection to your AWS Redshift cluster. +In this guide, you will configure a Data Source in Hopsworks to save all the authentication information needed in order to set up a connection to your AWS Redshift cluster. When you're finished, you'll be able to query the database using Spark through HSFS APIs. !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -26,20 +26,20 @@ Read more about IAM roles in our [AWS credentials pass-through guide](../../../. option `Instance Role` will use the default ARN Role configured for the cluster instance. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter The Connector Information Enter the details for your Redshift connector. Start by giving it a **name** and an optional **description**. -1. Select "Redshift" as connector protocol. +1. Select "Redshift" as the storage. 2. The name of the cluster. 3. The database endpoint. Should be in the format `[UUID].eu-west-1.redshift.amazonaws.com`. For example, if the endpoint info displayed in Redshift is `cluster-id.uuid.eu-north-1.redshift.amazonaws.com:5439/dev` the value to enter @@ -51,18 +51,19 @@ Enter the details for your Redshift connector. Start by giving it a **name** and included in Hopsworks or set a different driver (More on this later). 8. Optionally provide the database group and table for the connector. A database group is the group created for the user if applicable. More information, at [redshift documentation](https://docs.aws.amazon.com/redshift/latest/dg/r_Groups.html) -9. Set the appropriate authentication method. +9. Set the appropriate authentication method. +10. Click on "Save Credentials".
- ![Redshift Connector Creation](../../../../assets/images/guides/fs/storage_connector/redshift_creation.png) + ![Redshift Connector Creation](../../../../assets/images/guides/fs/data_source/redshift_creation.png)
Redshift Connector Creation Form
!!! warning "Session Duration" By default, the session duration that the role will be assumed for is 1 hour or 3600 seconds. - This means if you want to use the storage connector for example to [read or create an external Feature Group from Redshift](../usage.md##creating-an-external-feature-group), the operation cannot take longer than one hour. + This means if you want to use the data source for example to [read or create an external Feature Group from Redshift](../usage.md##creating-an-external-feature-group), the operation cannot take longer than one hour. - Your administrator can change the default session duration for AWS storage connectors, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_storage_connector_session_duration` [configuration property](../../../../setup_installation/admin/variables.md) to the appropriate value in seconds. + Your administrator can change the default session duration for AWS data sources, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_data_source_session_duration` [configuration property](../../../../setup_installation/admin/variables.md) to the appropriate value in seconds. ### Step 3: Upload the Redshift database driver (optional) @@ -79,7 +80,7 @@ You can now add the driver file to the default job and Jupyter configuration. Th 4. Under "Additional Jars" choose "Upload new file" to upload the driver jar file.
- ![Redshift Driver Job and Jupyter Configuration](../../../../assets/images/guides/fs/storage_connector/jupyter_config.png) + ![Redshift Driver Job and Jupyter Configuration](../../../../assets/images/guides/fs/data_source/jupyter_config.png)
Attaching the Redshift Driver to all Jobs and Jupyter Instances of the Project
@@ -91,7 +92,7 @@ file, you can select it using the "From Project" option. To upload the jar file 3. Upload the jar file
- ![Redshift Driver Upload](../../../../assets/images/guides/fs/storage_connector/driver_upload.png) + ![Redshift Driver Upload](../../../../assets/images/guides/fs/data_source/driver_upload.png)
Redshift Driver Upload in the File Browser
@@ -106,4 +107,4 @@ file, you can select it using the "From Project" option. To upload the jar file ## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created Redshift connector. +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created Redshift connector. diff --git a/docs/user_guides/fs/storage_connector/creation/s3.md b/docs/user_guides/fs/storage_connector/creation/s3.md index b1b5af332..d7711b91c 100644 --- a/docs/user_guides/fs/storage_connector/creation/s3.md +++ b/docs/user_guides/fs/storage_connector/creation/s3.md @@ -1,4 +1,4 @@ -# How-To set up a S3 Storage Connector +# How-To set up a S3 Data Source ## Introduction @@ -6,11 +6,11 @@ Amazon S3 or Amazon Simple Storage Service is a service offered by AWS that prov There are so called Data Lake House technologies such as Delta Lake or Apache Hudi, building an additional layer on top of object based storage with files, to provide database semantics like ACID transactions among others. This has the advantage that cheap storage can be turned into a cloud native data warehouse. These kind of storages are often the source for raw data from which features can be engineered. -In this guide, you will configure a Storage Connector in Hopsworks to save all the authentication information needed in order to set up a connection to your AWS S3 bucket. +In this guide, you will configure a Data Source in Hopsworks to save all the authentication information needed in order to set up a connection to your AWS S3 bucket. When you're finished, you'll be able to read files using Spark through HSFS APIs. You can also use the connector to write out training data from the Feature Store, in order to make it accessible by third parties. !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -18,18 +18,18 @@ Before you begin this guide you'll need to retrieve the following information fr - **Bucket:** You will need a S3 bucket that you have access to. The bucket is identified by its name. - **Path (Optional):** If needed, a path can be defined to ensure that all operations are restricted to a specific location within the bucket. -- **Region (Optional):** You will need an S3 region to have complete control over data when managing the feature group that relies on this storage connector. The region is identified by its code. +- **Region (Optional):** You will need an S3 region to have complete control over data when managing the feature group that relies on this data source. The region is identified by its code. - **Authentication Method:** You can authenticate using Access Key/Secret, or use IAM roles. If you want to use an IAM role it either needs to be attached to the entire Hopsworks cluster or Hopsworks needs to be able to assume the role. See [IAM role documentation](../../../../setup_installation/admin/roleChaining.md) for more information. - **Server Side Encryption details:** If your bucket has server side encryption (SSE) enabled, make sure you know which algorithm it is using (AES256 or SSE-KMS). If you are using SSE-KMS, you need the resource ARN of the managed key. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter Bucket Information @@ -39,7 +39,7 @@ And set the name of the S3 Bucket you want to point the connector to. Optionally, specify the region if you wish to have a Hopsworks-managed feature group stored using this connector.
- ![S3 Connector Creation](../../../../assets/images/guides/fs/storage_connector/s3_creation.png) + ![S3 Connector Creation](../../../../assets/images/guides/fs/data_source/s3_creation.png)
S3 Connector Creation Form
@@ -53,9 +53,9 @@ Choose temporary credentials if you are using [AWS Role chaining](../../../../se !!! warning "Session Duration" By default, the session duration that the role will be assumed for is 1 hour or 3600 seconds. - This means if you want to use the storage connector for example to write [training data to S3](../usage.md#writing-training-data), the training dataset creation cannot take longer than one hour. + This means if you want to use the data source for example to write [training data to S3](../usage.md#writing-training-data), the training dataset creation cannot take longer than one hour. - Your administrator can change the default session duration for AWS storage connectors, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_storage_connector_session_duration` [configuration variable](../../../../setup_installation/admin/variables.md) to the appropriate value in seconds. + Your administrator can change the default session duration for AWS data sources, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_data_source_session_duration` [configuration variable](../../../../setup_installation/admin/variables.md) to the appropriate value in seconds. #### Access Key/Secret The most simple authentication method are Access Key/Secret, choose this option to get started quickly, if you are able to retrieve the keys using the IAM user administration. @@ -74,12 +74,15 @@ If you have SSE-KMS enabled for your bucket, you can find the key ARN in the "Pr ### Step 5: Add Spark Options (optional) Here you can specify any additional spark options that you wish to add to the spark context at runtime. Multiple options can be added as key - value pairs. -To connect to a S3 compatiable storage other than AWS S3, you can add the option with key as `fs.s3a.endpoint` and the endpoint you want to use as value. The storage connector will then be able to read from your specified S3 compatible storage. +To connect to a S3 compatiable storage other than AWS S3, you can add the option with key as `fs.s3a.endpoint` and the endpoint you want to use as value. The data source will then be able to read from your specified S3 compatible storage. !!! warning "Spark Configuration" - When using the storage connector within a Spark application, the credentials are set at application level. This allows users to access multiple buckets with the same storage connector within the same application (assuming the credentials allow it). - You can disable this behaviour by setting the option `fs.s3a.global-conf` to `False`. If the `global-conf` option is disabled, the credentials are set on a per-bucket basis and users will be able to use the credentials to access data only from the bucket specified in the storage connector configuration. + When using the data source within a Spark application, the credentials are set at application level. This allows users to access multiple buckets with the same data source within the same application (assuming the credentials allow it). + You can disable this behaviour by setting the option `fs.s3a.global-conf` to `False`. If the `global-conf` option is disabled, the credentials are set on a per-bucket basis and users will be able to use the credentials to access data only from the bucket specified in the data source configuration. + +### Step 6: Save changes +Click on "Save Credentials". ## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created S3 connector. +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created S3 connector. diff --git a/docs/user_guides/fs/storage_connector/creation/snowflake.md b/docs/user_guides/fs/storage_connector/creation/snowflake.md index 50c0ba8ca..49ad242b2 100644 --- a/docs/user_guides/fs/storage_connector/creation/snowflake.md +++ b/docs/user_guides/fs/storage_connector/creation/snowflake.md @@ -1,4 +1,4 @@ -# How-To set up a Snowflake Storage Connector +# How-To set up a Snowflake Data Source ## Introduction @@ -6,11 +6,11 @@ Snowflake provides a cloud-based data storage and analytics service, used as a d Data warehouses are often the source of raw data for feature engineering pipelines and Snowflake supports scalable feature computation with SQL. However, Snowflake is not viable as an online feature store that serves features to models in production, with its columnar database layout its latency is too high compared to OLTP databases or key-value stores. -In this guide, you will configure a Storage Connector in Hopsworks to save all the authentication information needed in order to set up a connection to your Snowflake database. +In this guide, you will configure a Data Source in Hopsworks to save all the authentication information needed in order to set up a connection to your Snowflake database. When you're finished, you'll be able to query the database using Spark through HSFS APIs. !!! note - Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically. + Currently, it is only possible to create data sources in the Hopsworks UI. You cannot create a data source programmatically. ## Prerequisites @@ -30,12 +30,12 @@ SQL, as explained in [Snowflake documentation](https://docs.snowflake.com/en/user-guide/organizations-gs.html#viewing-the-name-of-your-organization-and-its-accounts). Below is an example of how to view the account and organization to get the account identifier from the Snowsight UI.
- ![Viewing Snowflake account identifier](../../../../assets/images/guides/fs/storage_connector/snowflake_account_url.png) + ![Viewing Snowflake account identifier](../../../../assets/images/guides/fs/data_source/snowflake_account_url.png)
Viewing Snowflake account identifier
!!! warning "Token-based authentication or password based" - The Snowflake storage connector supports both username and password authentication as well as token-based authentication. + The Snowflake data source supports both username and password authentication as well as token-based authentication. Currently token-based authentication is in beta phase. Users are advised to use username/password and/or create a service account for accessing Snowflake from Hopsworks. @@ -50,35 +50,35 @@ These are a few additional **optional** arguments: - **Application:** The application field can also be specified to have better observability in Snowflake with regards to which application is running which query. The application field can be a simple string like “Hopsworks” or, for instance, the project name, to track usage and queries from each Hopsworks project. ## Creation in the UI -### Step 1: Set up new storage connector +### Step 1: Set up new Data Source -Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2). +Head to the Data Source View on Hopsworks (1) and set up a new Data Source (2).
- ![Storage Connector Creation](../../../../assets/images/guides/fs/storage_connector/storage_connector_create.png) -
The Storage Connector View in the User Interface
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
### Step 2: Enter Snowflake Settings Enter the details for your Snowflake connector. Start by giving it a **name** and an optional **description**. -1. Select "Snowflake" as connector protocol. +1. Select "Snowflake" as storage. 2. Specify the hostname for your account in the following format `.snowflakecomputing.com` or `https://-.snowflakecomputing.com`. 3. Login name for the Snowflake user. 4. Password for the Snowflake user or Token. -5. The database to connect to. -6. The schema to use for the connection to the database. -7. Additional optional arguments. For example, you can point the connector to a specific table in the database only. +5. The warehouse to connect to. +6. The database to use for the connection. +7. Add any additional optional arguments. For example, you can specify `Application`. 8. Optional additional key/value arguments. -9. Click "Setup storage connector". +9. Click on "Save Credentials".
- ![Snowflake Connector Creation](../../../../assets/images/guides/fs/storage_connector/snowflake_creation.png) + ![Snowflake Connector Creation](../../../../assets/images/guides/fs/data_source/snowflake_creation.png)
Snowflake Connector Creation Form
## Next Steps -Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created Snowflake connector. \ No newline at end of file +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created Snowflake connector. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/index.md b/docs/user_guides/fs/storage_connector/index.md index cc23cbbb7..5ab2e83da 100644 --- a/docs/user_guides/fs/storage_connector/index.md +++ b/docs/user_guides/fs/storage_connector/index.md @@ -1,21 +1,21 @@ -# Storage Connector Guides +# Data Source Guides -You can define storage connectors in Hopsworks for batch and streaming data sources. Storage connectors securely store the authentication information about how to connect to an external data store. They can be used from programs within Hopsworks or externally. +You can define data sources in Hopsworks for batch and streaming data sources. Data Sources securely store the authentication information about how to connect to an external data store. They can be used from programs within Hopsworks or externally. -There are three main use cases for Storage Connectors: +There are three main use cases for Data Sources: - Simply use it to read data from the storage into a dataframe. -- [External (on-demand) Feature Groups](../../../concepts/fs/feature_group/external_fg.md) can be defined with storage connectors as data source. This way, Hopsworks stores only the metadata about the features, but does not keep a copy of the data itself. This is also called the Connector API. +- [External (on-demand) Feature Groups](../../../concepts/fs/feature_group/external_fg.md) can be defined with data sources. This way, Hopsworks stores only the metadata about the features, but does not keep a copy of the data itself. This is also called the Connector API. - Write [training data](../../../concepts/fs/feature_view/offline_api.md) to an external storage system to make it accessible by third parties. - Manage [feature group](../../../user_guides/fs/feature_group/create.md) that stores offline data in an external storage system. -Storage connectors provide two main mechanisms for authentication: using credentials or an authentication role (IAM Role on AWS or Managed Identity on Azure). Hopsworks supports both a single IAM role (AWS) or Managed Identity (Azure) for the whole Hopsworks cluster or multiple IAM roles (AWS) or Managed Identities (Azure) that can only be assumed by users with a specific role in a specific project. +Data Sources provide two main mechanisms for authentication: using credentials or an authentication role (IAM Role on AWS or Managed Identity on Azure). Hopsworks supports both a single IAM role (AWS) or Managed Identity (Azure) for the whole Hopsworks cluster or multiple IAM roles (AWS) or Managed Identities (Azure) that can only be assumed by users with a specific role in a specific project. -By default, each project is created with three default Storage Connectors: A JDBC connector to the online feature store, a HopsFS connector to the Training Datasets directory of the project and a JDBC connector to the offline feature store. +By default, each project is created with three default Data Sources: A JDBC connector to the online feature store, a HopsFS connector to the Training Datasets directory of the project and a JDBC connector to the offline feature store.
- ![Image title](../../../assets/images/guides/fs/storage_connector/storage_connector_overview.png) -
The Storage Connector View in the User Interface
+ ![Image title](../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
## Cloud Agnostic @@ -49,4 +49,4 @@ For GCP the following storage systems are supported: ## Next Steps -Move on to the [Configuration and Creation Guides](creation/jdbc.md) to learn how to set up a storage connector. \ No newline at end of file +Move on to the [Configuration and Creation Guides](creation/jdbc.md) to learn how to set up a data source. \ No newline at end of file diff --git a/docs/user_guides/fs/storage_connector/usage.md b/docs/user_guides/fs/storage_connector/usage.md index 0f224126e..6ddb39a2f 100644 --- a/docs/user_guides/fs/storage_connector/usage.md +++ b/docs/user_guides/fs/storage_connector/usage.md @@ -1,7 +1,7 @@ -# Storage Connector Usage -Here, we look at how to use a Storage Connector after it has been created. -Storage Connectors provide an important first step for integrating with external data sources. -The 3 fundamental functionalities where storage connectors are used are: +# Data Source Usage +Here, we look at how to use a Data Source after it has been created. +Data Sources provide an important first step for integrating with external data. +The 3 fundamental functionalities where data sources are used are: 1. Reading data into Spark Dataframes 2. Creating external feature groups @@ -9,8 +9,8 @@ The 3 fundamental functionalities where storage connectors are used are: We will walk through each functionality in the sections below. -## Retrieving a Storage Connector -We retrieve a storage connector simply by its unique name. +## Retrieving a Data Source +We retrieve a data source simply by its unique name. === "PySpark" ```python @@ -18,7 +18,7 @@ We retrieve a storage connector simply by its unique name. # Connect to the Hopsworks feature store project = hopsworks.login() feature_store = project.get_feature_store() - # Retrieve storage connector + # Retrieve data source connector = feature_store.get_storage_connector('connector_name') ``` @@ -31,13 +31,13 @@ We retrieve a storage connector simply by its unique name. val connector = featureStore.getGcsConnector("connector_name") ``` -## Reading a Spark Dataframe from a Storage Connector +## Reading a Spark Dataframe from a Data Source -One of the most common usages of a Storage Connector is to read data directly into a Spark Dataframe. +One of the most common usages of a Data Source is to read data directly into a Spark Dataframe. It's achieved via the `read` API of the connector object, which hides all the complexity of authentication and integration with a data storage source. -The `read` API primarily has two parameters for specifying the data source, `path` and `query`, depending on the storage connector type. -The exact behaviour could change depending on the storage connector type, but broadly they could be classified as below +The `read` API primarily has two parameters for specifying the data source, `path` and `query`, depending on the data source type. +The exact behaviour could change depending on the fdata source type, but broadly they could be classified as below ### Data lake/object based connectors @@ -67,8 +67,8 @@ Instead, user can do this setup explicitly with the `prepare_spark` method and t use multiple connectors in one Spark session. `prepare_spark` handles only one bucket associated with that particular connector, however, it is possible to set up multiple connectors with different types as long as their Spark properties do not interfere with each other. So, for example a S3 connector and a Snowflake connector can be used in the same session, without calling `prepare_spark` multiple times, as the properties don’t interfere with each other. -If the storage connector is used in another API call, `prepare_spark` gets implicitly invoked, for example, -when a user materialises a training dataset using a storage connector or uses the storage connector to set up an External Feature Group. +If the data source is used in another API call, `prepare_spark` gets implicitly invoked, for example, +when a user materialises a training dataset using a data source or uses the data source to set up an External Feature Group. So users do not need to call `prepare_spark` every time they do an operation with a connector, it is only necessary when reading directly using Spark . Using `prepare_spark` is also not necessary when using the `read` API. @@ -104,7 +104,7 @@ passing any SQL query to the `query` argument. This is mostly relevant for Googl ### Streaming based connector -For reading data streams, the Kafka Storage Connector supports reading a Kafka topic into Spark Structured Streaming Dataframes +For reading data streams, the Kafka Data Source supports reading a Kafka topic into Spark Structured Streaming Dataframes instead of a static Dataframe as in other connector types. === "PySpark" @@ -115,19 +115,19 @@ instead of a static Dataframe as in other connector types. ## Creating an External Feature Group -Another important aspect of a storage connector is its ability to facilitate creation of external feature groups with +Another important aspect of a data source is its ability to facilitate creation of external feature groups with the [Connector API](../../../concepts/fs/feature_group/external_fg.md). [External feature groups](../feature_group/create_external.md) are basically offline feature groups and essentially stored as tables on external data sources. -The `Connector API` relies on storage connectors behind the scenes to integrate with external datasource. -This enables seamless integration with any data source as long as there is a storage connector defined. +The `Connector API` relies on data sources behind the scenes to integrate with external datasource. +This enables seamless integration with any data source as long as there is a data source defined. To create an external feature group, we use the `create_external_feature_group` API, also known as `Connector API`, -and simply pass the storage connector created before to the `storage_connector` argument. -Depending on the external data source, we should set either the `query` argument for data warehouse based data sources, or +and simply pass the data source created before to the `storage_connector` argument. +Depending on the external source, we should set either the `query` argument for data warehouse based sources, or the `path` and `data_format` arguments for data lake based sources, similar to reading into dataframes as explained in above section. Example for any data warehouse/SQL based external sources, we set the desired SQL to `query` argument, and set the `storage_connector` -argument to the storage connector object of desired data source. +argument to the data source object of desired data source. === "PySpark" ```python fg = feature_store.create_external_feature_group(name="sales", @@ -146,7 +146,7 @@ For more information on `Connector API`, read detailed guide about [external fea ## Writing Training Data -Storage connectors are also used while writing training data to external sources. While calling the +Data Sources are also used while writing training data to external sources. While calling the [Feature View](../../../concepts/fs/feature_view/fv_overview.md) API `create_training_data` , we can pass the `storage_connector` argument which is necessary to materialise the data to external sources, as shown below. @@ -164,7 +164,7 @@ the data to external sources, as shown below. Read more about training data creation [here](../feature_view/training-data.md). ## Next Steps -We have gone through the basic use cases of a storage connector. +We have gone through the basic use cases of a data source. For more details about the API functionality for any specific connector type, checkout the [API section](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#storage-connector). diff --git a/docs/user_guides/index.md b/docs/user_guides/index.md index 549180953..55e0146a7 100644 --- a/docs/user_guides/index.md +++ b/docs/user_guides/index.md @@ -3,7 +3,7 @@ This section serves to provide guides and examples for the common usage of abstractions and functionality of the Hopsworks platform through the Hopsworks UI and APIs. - [Client Installation](client_installation/index.md): How to get started with the Hopsworks Client libraries. -- [Feature Store](fs/index.md): Learn about the common usage of the core Hopsworks Feature Store abstractions, such as Feature Groups, Feature Views, Data Validation and Storage Connectors. Also, learn from the [Client Integrations](integrations/index.md) guides how to connect to the Feature Store from external environments such as a local Python environment, Databricks, or AWS Sagemaker +- [Feature Store](fs/index.md): Learn about the common usage of the core Hopsworks Feature Store abstractions, such as Feature Groups, Feature Views, Data Validation and Data Sources. Also, learn from the [Client Integrations](integrations/index.md) guides how to connect to the Feature Store from external environments such as a local Python environment, Databricks, or AWS Sagemaker - [MLOps](mlops/index.md): Learn about the common usage of Hopsworks MLOps abstractions, such as the Model Registry or Model Serving. - [Projects](projects/index.md): The core abstraction on Hopsworks are [Projects](../concepts/projects/governance.md). Learn in this section how to manage your projects and the services therein. - [Migration](migration/40_migration.md): Learn how to migrate to newer versions of Hopsworks. diff --git a/docs/user_guides/integrations/databricks/configuration.md b/docs/user_guides/integrations/databricks/configuration.md index fa69b6250..137779edf 100644 --- a/docs/user_guides/integrations/databricks/configuration.md +++ b/docs/user_guides/integrations/databricks/configuration.md @@ -81,7 +81,7 @@ During the cluster configuration the following steps will be taken: - Configure the necessary Spark properties to authenticate and communicate with the Feature Store !!! note "HopsFS configuration" - It is not necessary to configure HopsFS if data is stored outside the Hopsworks file system. To do this define [Storage Connectors](../../fs/storage_connector/index.md) and link them to [Feature Groups](../../fs/feature_group/create.md) and [Training Datasets](../../fs/feature_view/training-data.md). + It is not necessary to configure HopsFS if data is stored outside the Hopsworks file system. To do this define [Data Sources](../../fs/data_source/index.md) and link them to [Feature Groups](../../fs/feature_group/create.md) and [Training Datasets](../../fs/feature_view/training-data.md). When a cluster is configured for a specific project user, all the operations with the Hopsworks Feature Store will be executed as that project user. If another user needs to re-use the same cluster, the cluster can be reconfigured by following the same steps above. diff --git a/docs/user_guides/migration/30_migration.md b/docs/user_guides/migration/30_migration.md index 4226faf84..8f1eda08f 100644 --- a/docs/user_guides/migration/30_migration.md +++ b/docs/user_guides/migration/30_migration.md @@ -164,10 +164,10 @@ Furthermore, the functionality provided by the `model` and `serving` module in ` This list is meant to serve as a starting point to explore the new features of the Hopsworks 3.0 release, which can significantly improve your workflows. -### Added new Storage Connectors: GCS, BigQuery and Kafka -With the added support for Google Cloud, we added also two new [storage connectors](../fs/storage_connector/index.md): [Google Cloud Storage](../fs/storage_connector/creation/gcs.md) and [Google BigQuery](../fs/storage_connector/creation/bigquery.md). Users can use these connectors to create external feature groups or write out training data. +### Added new Data Sources: GCS, BigQuery and Kafka +With the added support for Google Cloud, we added also two new [data sources](../fs/data_source/index.md): [Google Cloud Storage](../fs/data_source/creation/gcs.md) and [Google BigQuery](../fs/data_source/creation/bigquery.md). Users can use these connectors to create external feature groups or write out training data. -Additionally, to make it easier for users to get started with Spark Streaming applications, we added a [Kafka connector](../fs/storage_connector/creation/kafka.md), which let’s you easily read a Kafka topic into a Spark Streaming Dataframe. +Additionally, to make it easier for users to get started with Spark Streaming applications, we added a [Kafka connector](../fs/data_source/creation/kafka.md), which let’s you easily read a Kafka topic into a Spark Streaming Dataframe. ### Optimized Default Hudi Options By default, Hudi tends to over-partition input, and therefore the layout of Feature Groups. The default parallelism is 200, to ensure each Spark partition stays within the 2GB limit for inputs up to 500GB. The new default is the following for all insert/upsert operations: diff --git a/docs/user_guides/mlops/provenance/provenance.md b/docs/user_guides/mlops/provenance/provenance.md index 611366af4..344d3d8c1 100644 --- a/docs/user_guides/mlops/provenance/provenance.md +++ b/docs/user_guides/mlops/provenance/provenance.md @@ -4,7 +4,7 @@ Hopsworks allows users to track provenance (lineage) between: -- storage connectors +- data sources - feature groups - feature views - training datasets @@ -15,7 +15,7 @@ In the provenance pages we will call a provenance artifact or shortly artifact, With the following provenance graph: ``` -storage connector -> feature group -> feature group -> feature view -> training dataset -> model +data source -> feature group -> feature group -> feature view -> training dataset -> model ``` we will call the parent, the artifact to the left, and the child, the artifact to the right. So a feature view has a number of feature groups as parents and can have a number of training datasets as children. diff --git a/docs/user_guides/projects/iam_role/iam_role_chaining.md b/docs/user_guides/projects/iam_role/iam_role_chaining.md index 7fd0df3cd..654fac898 100644 --- a/docs/user_guides/projects/iam_role/iam_role_chaining.md +++ b/docs/user_guides/projects/iam_role/iam_role_chaining.md @@ -26,4 +26,4 @@ In the _Project Settings_ page you can find the _IAM Role Chaining_ section show ### Step 2: Use the IAM role -You can now use the IAM roles listed in your project when creating a storage connector with [Temporary Credentials](../../../fs/storage_connector/creation/s3/#temporary-credentials). +You can now use the IAM roles listed in your project when creating a Data Source with [Temporary Credentials](../../../fs/data_source/creation/s3/#temporary-credentials). diff --git a/docs/user_guides/projects/project/add_members.md b/docs/user_guides/projects/project/add_members.md index 811965e88..53e0c3516 100644 --- a/docs/user_guides/projects/project/add_members.md +++ b/docs/user_guides/projects/project/add_members.md @@ -35,7 +35,7 @@ Data owners hold the highest authority in the project, having full control of it They are allowed to: - Share a project - Manage the project and its members -- Work with all feature store abstractions (such as Feature groups, Feature views, Storage connectors, etc.) +- Work with all feature store abstractions (such as Feature groups, Feature views, Data Sources, etc.) It is worth mentioning that the project's creator (aka. `author`) is a special type of `Data owner`. He is the only user capable of deleting the project and it is impossible to change his role to `Data scientist`. diff --git a/mkdocs.yml b/mkdocs.yml index 2deb9ff5f..1b11def12 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -56,19 +56,20 @@ nav: - user_guides/index.md - Feature Store: - user_guides/fs/index.md - - Storage Connector: - - user_guides/fs/storage_connector/index.md + - Data Source: + - user_guides/fs/data_source/index.md - Configuration and Creation: - - JDBC: user_guides/fs/storage_connector/creation/jdbc.md - - Snowflake: user_guides/fs/storage_connector/creation/snowflake.md - - Kafka: user_guides/fs/storage_connector/creation/kafka.md - - HopsFS: user_guides/fs/storage_connector/creation/hopsfs.md - - S3: user_guides/fs/storage_connector/creation/s3.md - - Redshift: user_guides/fs/storage_connector/creation/redshift.md - - ADLS: user_guides/fs/storage_connector/creation/adls.md - - BigQuery: user_guides/fs/storage_connector/creation/bigquery.md - - GCS: user_guides/fs/storage_connector/creation/gcs.md - - Usage: user_guides/fs/storage_connector/usage.md + - JDBC: user_guides/fs/data_source/creation/jdbc.md + - Snowflake: user_guides/fs/data_source/creation/snowflake.md + - Kafka: user_guides/fs/data_source/creation/kafka.md + - HopsFS: user_guides/fs/data_source/creation/hopsfs.md + - S3: user_guides/fs/data_source/creation/s3.md + - Redshift: user_guides/fs/data_source/creation/redshift.md + - ADLS: user_guides/fs/data_source/creation/adls.md + - BigQuery: user_guides/fs/data_source/creation/bigquery.md + - GCS: user_guides/fs/data_source/creation/gcs.md + - RDS: user_guides/fs/data_source/creation/rds.md + - Usage: user_guides/fs/data_source/usage.md - Feature Group: - user_guides/fs/feature_group/index.md - Create: user_guides/fs/feature_group/create.md