Feature Store#
You can retrieve the current feature store instance using Project.get_feature_store.
FeatureStore #
Feature Store class used to manage feature store entities, like feature groups and feature views.
offline_featurestore_name property #
offline_featurestore_name: str
Name of the offline feature store database.
online_featurestore_name property #
online_featurestore_name: str | None
Name of the online feature store database.
project_name property #
project_name: str
Name of the project in which the feature store is located.
create_external_feature_group #
create_external_feature_group(
name: str,
storage_connector: storage_connector.StorageConnector,
query: str | None = None,
data_format: str | None = None,
path: str | None = "",
options: dict[str, str] | None = None,
version: int | None = None,
description: str | None = "",
primary_key: list[str] | None = None,
foreign_key: list[str] | None = None,
embedding_index: EmbeddingIndex | None = None,
features: list[feature.Feature] | None = None,
statistics_config: StatisticsConfig
| bool
| dict
| None = None,
event_time: str | None = None,
expectation_suite: expectation_suite.ExpectationSuite
| TypeVar("great_expectations.core.ExpectationSuite")
| None = None,
online_enabled: bool = False,
topic_name: str | None = None,
notification_topic_name: str | None = None,
online_config: OnlineConfig
| dict[str, Any]
| None = None,
data_source: ds.DataSource
| dict[str, Any]
| None = None,
ttl: float | timedelta | None = None,
ttl_enabled: bool | None = None,
online_disk: bool | None = None,
) -> feature_group.ExternalFeatureGroup
Create an external feature group metadata object.
Example
# connect to the Feature Store
fs = ...
external_fg = fs.create_external_feature_group(
name="sales",
version=1,
description="Physical shop sales features",
query=query,
storage_connector=connector,
primary_key=['ss_store_sk'],
event_time='sale_date',
ttl=timedelta(days=30),
)
Lazy
This method is lazy and does not persist any metadata in the feature store on its own. To persist the feature group metadata in the feature store, call the save() method.
You can enable online storage for external feature groups, however, the sync from the external storage to Hopsworks online storage needs to be done manually:
external_fg = fs.create_external_feature_group(
name="sales",
version=1,
description="Physical shop sales features",
query=query,
storage_connector=connector,
primary_key=['ss_store_sk'],
event_time='sale_date',
online_enabled=True,
online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
online_disk=True, # Online data will be stored on disk instead of in memory
ttl=timedelta(days=30),
)
external_fg.save()
# read from external storage and filter data to sync to online
df = external_fg.read().filter(external_fg.customer_status == "active")
# insert to online storage
external_fg.insert(df)
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the external feature group to create. TYPE: |
storage_connector | The storage connector used to establish connectivity with the data source. TYPE: |
query | A string containing a SQL query valid for the target data source. The query will be used to pull data from the data sources when the feature group is used. TYPE: |
data_format | If the external feature groups refers to a directory with data, the data format to use when reading it. TYPE: |
path | The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: |
options | Additional options to be used by the engine when reading data from the specified storage connector. For example, |
version | Version of the external feature group to retrieve, defaults to TYPE: |
description | A string describing the contents of the external feature group to improve discoverability for Data Scientists. TYPE: |
primary_key | A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list |
foreign_key | A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list |
features | Optionally, define the schema of the external feature group manually as a list of |
statistics_config | A configuration object, or a dictionary with keys:
The values should be booleans indicating the setting. To fully turn off statistics computation pass TYPE: |
event_time | Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: TYPE: |
online_enabled | Define whether it should be possible to sync the feature group to the online feature store for low latency access. TYPE: |
expectation_suite | Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: |
topic_name | Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: |
notification_topic_name | Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: |
online_config | Optionally, define configuration which is used to configure online table. |
data_source | The data source specifying the location of the data. Overrides the path and query arguments when specified. |
ttl | Optional time-to-live duration for features in this group. Can be specified as:
This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default no TTL is set. |
ttl_enabled | Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: |
online_disk | Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.ExternalFeatureGroup | The external feature group metadata object. |
create_feature_group #
create_feature_group(
name: str,
version: int | None = None,
description: str = "",
online_enabled: bool = False,
time_travel_format: str | None = None,
partition_key: list[str] | None = None,
primary_key: list[str] | None = None,
foreign_key: list[str] | None = None,
embedding_index: EmbeddingIndex | None = None,
hudi_precombine_key: str | None = None,
features: list[feature.Feature] | None = None,
statistics_config: StatisticsConfig
| bool
| dict
| None = None,
event_time: str | None = None,
stream: bool = False,
expectation_suite: expectation_suite.ExpectationSuite
| TypeVar("great_expectations.core.ExpectationSuite")
| None = None,
parents: list[feature_group.FeatureGroup] | None = None,
topic_name: str | None = None,
notification_topic_name: str | None = None,
transformation_functions: list[
TransformationFunction | HopsworksUdf
]
| None = None,
online_config: OnlineConfig
| dict[str, Any]
| None = None,
offline_backfill_every_hr: int | str | None = None,
storage_connector: storage_connector.StorageConnector
| dict[str, Any] = None,
path: str | None = None,
data_source: ds.DataSource
| dict[str, Any]
| None = None,
ttl: float | timedelta | None = None,
ttl_enabled: bool | None = None,
online_disk: bool | None = None,
) -> feature_group.FeatureGroup
Create a feature group metadata object.
Example
# connect to the Feature Store
fs = ...
# define the on-demand transformation functions
@udf(int)
def plus_one(value):
return value + 1
@udf(int)
def plus_two(value):
return value + 2
# construct list of "transformation functions" on features
transformation_functions = [plus_one("feature1"), plus_two("feature2")]
fg = fs.create_feature_group(
name='air_quality',
description='Air Quality characteristics of each day',
version=1,
primary_key=['city','date'],
online_enabled=True,
event_time='date',
transformation_functions=transformation_functions,
online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
online_disk=True, # Online data will be stored on disk instead of in memory
ttl=timedelta(days=7) # features will be deleted after 7 days
)
Lazy
This method is lazy and does not persist any metadata or feature data in the feature store on its own. To persist the feature group and save feature data along the metadata in the feature store, call the save() method with a DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature group to create. TYPE: |
version | Version of the feature group to create, defaults to TYPE: |
description | A string describing the contents of the feature group to improve discoverability for Data Scientists. TYPE: |
online_enabled | Define whether the feature group should be made available also in the online feature store for low latency access. TYPE: |
time_travel_format | Format used for time travel, defaults to TYPE: |
partition_key | A list of feature names to be used as partition key when writing the feature data to the offline storage, defaults to empty list |
primary_key | A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list |
foreign_key | A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list |
embedding_index |
TYPE: |
hudi_precombine_key | A feature name to be used as a precombine key for the TYPE: |
features | Optionally, define the schema of the feature group manually as a list of |
statistics_config | A configuration object, or a dictionary with keys:
The values should be booleans indicating the setting. To fully turn off statistics computation pass TYPE: |
event_time | Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: TYPE: |
stream | Optionally, define whether the feature group should support real time stream writing capabilities. Stream enabled Feature Groups have unified single API for writing streaming features transparently to both online and offline store. TYPE: |
expectation_suite | Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: |
parents | Optionally, define the parents of this feature group as the origin where the data is coming from. TYPE: |
topic_name | Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: |
notification_topic_name | Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: |
transformation_functions | On-Demand Transformation functions attached to the feature group. It can be a list of list of user defined functions defined using the hopsworks TYPE: |
online_config | Optionally, define configuration which is used to configure online table. |
offline_backfill_every_hr | If specified, the materialization job will be scheduled to run periodically. The value can be either an integer representing the number of hours between each run or a string representing a cron expression. Set the value to None to avoid scheduling the materialization job. By default, no scheduling is done. |
storage_connector | The storage connector used to establish connectivity with the data source. TYPE: |
path | The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: |
data_source | The data source specifying the location of the data. Overrides the path and query arguments when specified. |
ttl | Optional time-to-live duration for features in this group. Can be specified as:
This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default, no TTL is set. |
ttl_enabled | Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: |
online_disk | Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.FeatureGroup | The feature group metadata object. |
create_feature_view #
create_feature_view(
name: str,
query: Query,
version: int | None = None,
description: str | None = "",
labels: list[str] | None = None,
inference_helper_columns: list[str] | None = None,
training_helper_columns: list[str] | None = None,
transformation_functions: list[
TransformationFunction | HopsworksUdf
]
| None = None,
logging_enabled: bool | None = False,
extra_log_columns: list[feature.Feature]
| list[dict[str, str]]
| None = None,
) -> feature_view.FeatureView
Create a feature view metadata object and saved it to hopsworks.
Example
# connect to the Feature Store
fs = ...
# get the feature group instances
fg1 = fs.get_or_create_feature_group(...)
fg2 = fs.get_or_create_feature_group(...)
# construct the query
query = fg1.select_all().join(fg2.select_all())
# define the transformation function as a Hopsworks's UDF
@udf(int)
def plus_one(value):
return value + 1
# construct list of "transformation functions" on features
transformation_functions = [plus_one("feature1"), plus_one("feature1"))]
feature_view = fs.create_feature_view(
name='air_quality_fv',
version=1,
transformation_functions=transformation_functions,
query=query
)
Example
# get feature store instance
fs = ...
# define query object
query = ...
# define list of transformation functions
mapping_transformers = ...
# create feature view
feature_view = fs.create_feature_view(
name='feature_view_name',
version=1,
transformation_functions=mapping_transformers,
query=query
)
Warning
as_of argument in the Query will be ignored because feature view does not support time travel query.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature view to create. TYPE: |
query | Feature store TYPE: |
version | Version of the feature view to create, defaults to TYPE: |
description | A string describing the contents of the feature view to improve discoverability for Data Scientists. TYPE: |
labels | A list of feature names constituting the prediction label/feature of the feature view. When replaying a |
inference_helper_columns | A list of feature names that are not used in training the model itself but can be used during batch or online inference for extra information. Inference helper column name(s) must be part of the |
training_helper_columns | A list of feature names that are not the part of the model schema itself but can be used during training as a helper for extra information. Training helper column name(s) must be part of the |
transformation_functions | Model Dependent Transformation functions attached to the feature view. It can be a list of list of user defined functions defined using the hopsworks TYPE: |
logging_enabled | If true, enable feature logging for the feature view. TYPE: |
extra_log_columns | Extra columns to be logged in addition to the features used in the feature view. It can be a list of Feature objects or list a dictionaries that contains the the name and type of the columns as keys. Defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_view.FeatureView | The feature view metadata object. |
create_on_demand_feature_group #
create_on_demand_feature_group(
name: str,
storage_connector: storage_connector.StorageConnector,
query: str | None = None,
data_format: str | None = None,
path: str | None = "",
options: dict[str, str] | None = None,
version: int | None = None,
description: str | None = "",
primary_key: list[str] | None = None,
foreign_key: list[str] | None = None,
features: list[feature.Feature] | None = None,
statistics_config: StatisticsConfig
| bool
| dict
| None = None,
event_time: str | None = None,
expectation_suite: expectation_suite.ExpectationSuite
| TypeVar("great_expectations.core.ExpectationSuite")
| None = None,
topic_name: str | None = None,
notification_topic_name: str | None = None,
data_source: ds.DataSource
| dict[str, Any]
| None = None,
online_enabled: bool = False,
ttl: float | timedelta | None = None,
ttl_enabled: bool | None = None,
) -> feature_group.ExternalFeatureGroup
Create an external feature group metadata object.
Deprecated
create_on_demand_feature_group method is deprecated. Use the create_external_feature_group method instead.
Lazy
This method is lazy and does not persist any metadata in the feature store on its own. To persist the feature group metadata in the feature store, call the save() method.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the external feature group to create. TYPE: |
storage_connector | The storage connector used to establish connectivity with the data source. TYPE: |
query | A string containing a SQL query valid for the target data source. The query will be used to pull data from the data sources when the feature group is used. TYPE: |
data_format | If the external feature groups refers to a directory with data, the data format to use when reading it. TYPE: |
path | The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: |
options | Additional options to be used by the engine when reading data from the specified storage connector. For example, |
version | Version of the external feature group to retrieve, defaults to TYPE: |
description | A string describing the contents of the external feature group to improve discoverability for Data Scientists. TYPE: |
primary_key | A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list |
foreign_key | A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list |
features | Optionally, define the schema of the external feature group manually as a list of |
statistics_config | A configuration object, or a dictionary with keys:
The values should be booleans indicating the setting. To fully turn off statistics computation pass TYPE: |
event_time | Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: TYPE: |
topic_name | Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: |
notification_topic_name | Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: |
expectation_suite | Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: |
data_source | The data source specifying the location of the data. Overrides the path and query arguments when specified. |
online_enabled | Define whether it should be possible to sync the feature group to the online feature store for low latency access. TYPE: |
ttl | Optional time-to-live duration for features in this group. Can be specified as:
This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default no TTL is set. |
ttl_enabled | Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.ExternalFeatureGroup | The external feature group metadata object. |
create_training_dataset #
create_training_dataset(
name: str,
version: int | None = None,
description: str | None = "",
data_format: str | None = "tfrecords",
coalesce: bool | None = False,
storage_connector: storage_connector.StorageConnector
| None = None,
splits: dict[str, float] | None = None,
location: str | None = "",
seed: int | None = None,
statistics_config: StatisticsConfig
| bool
| dict
| None = None,
label: list[str] | None = None,
transformation_functions: dict[
str, TransformationFunction
]
| None = None,
train_split: str = None,
) -> training_dataset.TrainingDataset
Create a training dataset metadata object.
Deprecated
TrainingDataset is deprecated, use FeatureView instead. From version 3.0 training datasets created with this API are not visibile in the API anymore.
Lazy
This method is lazy and does not persist any metadata or feature data in the feature store on its own. To materialize the training dataset and save feature data along the metadata in the feature store, call the save() method with a DataFrame or Query.
Data Formats
The feature store currently supports the following data formats for training datasets:
- tfrecord
- csv
- tsv
- parquet
- avro
- orc
Currently not supported petastorm, hdf5 and npy file formats.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the training dataset to create. TYPE: |
version | Version of the training dataset to retrieve, defaults to TYPE: |
description | A string describing the contents of the training dataset to improve discoverability for Data Scientists. TYPE: |
data_format | The data format used to save the training dataset. TYPE: |
coalesce | If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split. TYPE: |
storage_connector | Storage connector defining the sink location for the training dataset, defaults to TYPE: |
splits | A dictionary defining training dataset splits to be created. Keys in the dictionary define the name of the split as |
location | Path to complement the sink storage connector with, e.g., if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to TYPE: |
seed | Optionally, define a seed to create the random splits with, in order to guarantee reproducability. TYPE: |
statistics_config | A configuration object, or a dictionary with keys:
The values should be booleans indicating the setting. To fully turn off statistics computation pass TYPE: |
label | A list of feature names constituting the prediction label/feature of the training dataset. When replaying a |
transformation_functions | A dictionary mapping transformation functions to the features they should be applied to before writing out the training data and at inference time. Defaults to TYPE: |
train_split | If TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
training_dataset.TrainingDataset | The training dataset metadata object. |
create_transformation_function #
create_transformation_function(
transformation_function: HopsworksUdf,
version: int | None = None,
) -> TransformationFunction
Create a transformation function metadata object.
Example
# define the transformation function as a Hopsworks's UDF
@udf(int)
def plus_one(value):
return value + 1
# create transformation function
plus_one_meta = fs.create_transformation_function(
transformation_function=plus_one,
version=1
)
# persist transformation function in backend
plus_one_meta.save()
Lazy
This method is lazy and does not persist the transformation function in the feature store on its own. To materialize the transformation function and save call the save() method of the transformation function metadata object.
| PARAMETER | DESCRIPTION |
|---|---|
transformation_function | Hopsworks UDF. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
TransformationFunction | The TransformationFunction metadata object. |
get_external_feature_group #
get_external_feature_group(
name: str, version: int = None
) -> feature_group.ExternalFeatureGroup
Get an external feature group entity from the feature store.
Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.
Example
# connect to the Feature Store
fs = ...
external_fg = fs.get_external_feature_group("external_fg_test")
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the external feature group to get. TYPE: |
version | Version of the external feature group to retrieve, by defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.ExternalFeatureGroup | The external feature group metadata object or |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_external_feature_groups #
get_external_feature_groups(
name: str | None = None,
) -> list[feature_group.ExternalFeatureGroup]
Get a list of all external feature groups from the feature store, or all versions of an external feature group.
Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.
Example
# connect to the Feature Store
fs = ...
external_fgs_list = fs.get_external_feature_groups("external_fg_test")
Example
# connect to the Feature Store
fs = ...
# retrieve all external feature groups available in the feature store
external_fgs_list = fs.get_external_feature_groups()
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the external feature group to get the versions of; by default it is TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
list[feature_group.ExternalFeatureGroup] | List of external feature group metadata objects. |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_feature_group #
get_feature_group(
name: str, version: int = None
) -> (
feature_group.FeatureGroup
| feature_group.ExternalFeatureGroup
| feature_group.SpineGroup
)
Get a feature group entity from the feature store.
Getting a feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.
Example
# connect to the Feature Store
fs = ...
fg = fs.get_feature_group(
name="electricity_prices",
version=1,
)
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature group to get. TYPE: |
version | Version of the feature group to retrieve, defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.FeatureGroup | feature_group.ExternalFeatureGroup | feature_group.SpineGroup | The feature group metadata object or |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_feature_groups #
get_feature_groups(
name: str | None = None,
) -> list[
feature_group.FeatureGroup
| feature_group.ExternalFeatureGroup
| feature_group.SpineGroup
]
Get all feature groups from the feature store, or all versions of a feature group specified by its name.
Getting a feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.
Example
# connect to the Feature Store
fs = ...
# retrieve all versions of electricity_prices feature group
fgs_list = fs.get_feature_groups(
name="electricity_prices"
)
Example
# connect to the Feature Store
fs = ...
# retrieve all feature groups available in the feature store
fgs_list = fs.get_feature_groups()
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature group to get the versions of; by default it is TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
list[feature_group.FeatureGroup | feature_group.ExternalFeatureGroup | feature_group.SpineGroup] | List of feature group metadata objects. |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_feature_view #
get_feature_view(
name: str, version: int = None
) -> feature_view.FeatureView
Get a feature view entity from the feature store.
Getting a feature view from the Feature Store means getting its metadata.
Example
# get feature store instance
fs = ...
# get feature view instance
feature_view = fs.get_feature_view(
name='feature_view_name',
version=1
)
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature view to get. TYPE: |
version | Version of the feature view to retrieve, defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_view.FeatureView | The feature view metadata object or |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_feature_views #
get_feature_views(
name: str,
) -> list[feature_view.FeatureView]
Get a list of all versions of a feature view entity from the feature store.
Getting a feature view from the Feature Store means getting its metadata.
Example
# get feature store instance
fs = ...
# get a list of all versions of a feature view
feature_view = fs.get_feature_views(
name='feature_view_name'
)
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature view to get. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
list[feature_view.FeatureView] | List of feature view metadata objects. |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_on_demand_feature_group #
get_on_demand_feature_group(
name: str, version: int = None
) -> feature_group.ExternalFeatureGroup
Get an external feature group entity from the feature store.
Deprecated
get_on_demand_feature_group method is deprecated. Use the get_external_feature_group method instead.
Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the external feature group to get. TYPE: |
version | Version of the external feature group to retrieve, defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.ExternalFeatureGroup | The external feature group metadata object or |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_on_demand_feature_groups #
get_on_demand_feature_groups(
name: str,
) -> list[feature_group.ExternalFeatureGroup]
Get a list of all versions of an external feature group entity from the feature store.
Deprecated
get_on_demand_feature_groups method is deprecated. Use the get_external_feature_groups method instead.
Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the external feature group to get. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
list[feature_group.ExternalFeatureGroup] | List of external feature group metadata objects. |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_online_storage_connector #
get_online_storage_connector() -> (
storage_connector.StorageConnector
)
Get the storage connector for the Online Feature Store of the respective project's feature store.
The returned storage connector depends on the project that you are connected to.
Example
# connect to the Feature Store
fs = ...
online_storage_connector = fs.get_online_storage_connector()
| RETURNS | DESCRIPTION |
|---|---|
storage_connector.StorageConnector | JDBC storage connector to the Online Feature Store. |
get_or_create_feature_group #
get_or_create_feature_group(
name: str,
version: int,
description: str | None = "",
online_enabled: bool | None = False,
time_travel_format: str | None = None,
partition_key: list[str] | None = None,
primary_key: list[str] | None = None,
foreign_key: list[str] | None = None,
embedding_index: EmbeddingIndex | None = None,
hudi_precombine_key: str | None = None,
features: list[feature.Feature] | None = None,
statistics_config: StatisticsConfig
| bool
| dict
| None = None,
expectation_suite: expectation_suite.ExpectationSuite
| TypeVar("great_expectations.core.ExpectationSuite")
| None = None,
event_time: str | None = None,
stream: bool | None = False,
parents: list[feature_group.FeatureGroup] | None = None,
topic_name: str | None = None,
notification_topic_name: str | None = None,
transformation_functions: list[
TransformationFunction | HopsworksUdf
]
| None = None,
online_config: OnlineConfig
| dict[str, Any]
| None = None,
offline_backfill_every_hr: int | str | None = None,
storage_connector: storage_connector.StorageConnector
| dict[str, Any] = None,
path: str | None = None,
data_source: ds.DataSource
| dict[str, Any]
| None = None,
ttl: float | timedelta | None = None,
ttl_enabled: bool | None = None,
online_disk: bool | None = None,
) -> (
feature_group.FeatureGroup
| feature_group.ExternalFeatureGroup
| feature_group.SpineGroup
)
Get feature group metadata object or create a new one if it doesn't exist.
This method doesn't update existing feature group metadata object.
Example
# connect to the Feature Store
fs = ...
fg = fs.get_or_create_feature_group(
name="electricity_prices",
version=1,
description="Electricity prices from NORD POOL",
primary_key=["day", "area"],
online_enabled=True,
event_time="timestamp",
transformation_functions=transformation_functions,
online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
online_disk=True, # Online data will be stored on disk instead of in memory
ttl=timedelta(days=30),
)
Lazy
This method is lazy and does not persist any metadata or feature data in the feature store on its own. To persist the feature group and save feature data along the metadata in the feature store, call the insert() method with a DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature group to create. TYPE: |
version | Version of the feature group to retrieve or create. TYPE: |
description | A string describing the contents of the feature group to improve discoverability for Data Scientists. TYPE: |
online_enabled | Define whether the feature group should be made available also in the online feature store for low latency access. TYPE: |
time_travel_format | Format used for time travel, defaults to TYPE: |
partition_key | A list of feature names to be used as partition key when writing the feature data to the offline storage, defaults to empty list |
primary_key | A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list |
foreign_key | A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list |
embedding_index |
TYPE: |
hudi_precombine_key | A feature name to be used as a precombine key for the TYPE: |
features | Optionally, define the schema of the feature group manually as a list of |
statistics_config | A configuration object, or a dictionary with keys:
The values should be booleans indicating the setting. To fully turn off statistics computation pass TYPE: |
event_time | Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: TYPE: |
stream | Optionally, define whether the feature group should support real time stream writing capabilities. Stream enabled Feature Groups have unified single API for writing streaming features transparently to both online and offline store. TYPE: |
expectation_suite | Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: |
parents | Optionally, define the parents of this feature group as the origin where the data is coming from. TYPE: |
topic_name | Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: |
notification_topic_name | Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: |
transformation_functions | On-Demand Transformation functions attached to the feature group. It can be a list of list of user defined functions defined using the hopsworks TYPE: |
online_config | Optionally, define configuration which is used to configure online table. |
offline_backfill_every_hr | If specified, the materialization job will be scheduled to run periodically. The value can be either an integer representing the number of hours between each run or a string representing a cron expression. Set the value to None to avoid scheduling the materialization job. By default, no scheduling is done. |
storage_connector | The storage connector used to establish connectivity with the data source. TYPE: |
path | The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: |
data_source | The data source specifying the location of the data. Overrides the path and query arguments when specified. |
ttl | Optional time-to-live duration for features in this group. Can be specified as:
This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default, no TTL is set. |
ttl_enabled | Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: |
online_disk | Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.FeatureGroup | feature_group.ExternalFeatureGroup | feature_group.SpineGroup | The feature group metadata object. |
get_or_create_feature_view #
get_or_create_feature_view(
name: str,
query: Query,
version: int,
description: str | None = "",
labels: list[str] | None = None,
inference_helper_columns: list[str] | None = None,
training_helper_columns: list[str] | None = None,
transformation_functions: dict[
str, TransformationFunction
]
| None = None,
logging_enabled: bool | None = False,
extra_log_columns: list[feature.Feature]
| list[dict[str, str]]
| None = None,
) -> feature_view.FeatureView
Get feature view metadata object or create a new one if it doesn't exist.
This method doesn't update existing feature view metadata object.
Example
# connect to the Feature Store
fs = ...
feature_view = fs.get_or_create_feature_view(
name='bitcoin_feature_view',
version=1,
transformation_functions=transformation_functions,
query=query
)
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the feature view to create. TYPE: |
query | Feature store TYPE: |
version | Version of the feature view to create. TYPE: |
description | A string describing the contents of the feature view to improve discoverability for Data Scientists. TYPE: |
labels | A list of feature names constituting the prediction label/feature of the feature view. When replaying a |
inference_helper_columns | A list of feature names that are not used in training the model itself but can be used during batch or online inference for extra information. Inference helper column name(s) must be part of the |
training_helper_columns | A list of feature names that are not the part of the model schema itself but can be used during training as a helper for extra information. Training helper column name(s) must be part of the |
transformation_functions | Model Dependent Transformation functions attached to the feature view. It can be a list of list of user defined functions defined using the hopsworks TYPE: |
logging_enabled | If true, enable feature logging for the feature view. TYPE: |
extra_log_columns | Extra columns to be logged in addition to the features used in the feature view. It can be a list of Feature objects or list a dictionaries that contains the the name and type of the columns as keys. Defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_view.FeatureView | The feature view metadata object. |
get_or_create_spine_group #
get_or_create_spine_group(
name: str,
version: int | None = None,
description: str | None = "",
primary_key: list[str] | None = None,
foreign_key: list[str] | None = None,
event_time: str | None = None,
features: list[feature.Feature] | None = None,
dataframe: pd.DataFrame
| TypeVar("pyspark.sql.DataFrame")
| TypeVar("pyspark.RDD")
| np.ndarray
| list[list] = None,
) -> feature_group.SpineGroup
Create a spine group metadata object.
Instead of using a feature group to save a label/prediction target, you can use a spine together with a dataframe containing the labels. A Spine is essentially a metadata object similar to a feature group, however, the data is not materialized in the feature store. It only containes the needed metadata such as the relevant event time column and primary key columns to perform point-in-time correct joins.
Example
# connect to the Feature Store
fs = ...
spine_df = pd.Dataframe()
spine_group = fs.get_or_create_spine_group(
name="sales",
version=1,
description="Physical shop sales features",
primary_key=['ss_store_sk'],
event_time='sale_date',
dataframe=spine_df,
)
Note that you can inspect the dataframe in the spine group, or replace the dataframe:
spine_group.dataframe.show()
spine_group.dataframe = new_df
The spine can then be used to construct queries, with only one speciality:
Note
Spines can only be used on the left side of a feature join, as this is the base set of entities for which features are to be fetched and the left side of the join determines the event timestamps to compare against.
If you want to use the query for a feature view to be used for online serving, you can only select the label or target feature from the spine. For the online lookup, the label is not required, therefore it is important to only select label from the left feature group, so that we don't need to provide a spine for online serving.
These queries can then be used to create feature views. Since the dataframe contained in the spine is not being materialized, every time you use a feature view created with spine to read data you will have to provide a dataframe with the same structure again.
For example, to generate training data:
X_train, X_test, y_train, y_test = feature_view_spine.train_test_split(0.2, spine=training_data_entities)
Or to get batches of fresh data for batch scoring:
feature_view_spine.get_batch_data(spine=scoring_entities_df).show()
Here you have the chance to pass a different set of entities to generate the training dataset.
Sometimes it might be handy to create a feature view with a regular feature group containing the label, but then at serving time to use a spine in order to fetch features for example only for a small set of primary key values. To do this, you can pass the spine group instead of a dataframe. Just make sure it contains the needed primary key, event time and label column.
feature_view.get_batch_data(spine=spine_group)
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the spine group to create. TYPE: |
version | Version of the spine group to retrieve, defaults to TYPE: |
description | A string describing the contents of the spine group to improve discoverability for Data Scientists. TYPE: |
primary_key | A list of feature names to be used as primary key for the spine group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list |
foreign_key | A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list |
event_time | Optionally, provide the name of the feature containing the event time for the features in this spine group. If event_time is set the spine group can be used for point-in-time joins. TYPE: |
features | Optionally, define the schema of the spine group manually as a list of Note: Event time data type restriction The supported data types for the event time column are: |
dataframe | Spine dataframe with primary key, event time and label column to use for point in time join when fetching features. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
feature_group.SpineGroup | The spine group metadata object. |
get_storage_connector #
get_storage_connector(
name: str,
) -> storage_connector.StorageConnector
Get a previously created storage connector from the feature store.
Storage connectors encapsulate all information needed for the execution engine to read and write to specific storage. This storage can be S3, a JDBC compliant database or the distributed filesystem HOPSFS.
If you want to connect to the online feature store, see the get_online_storage_connector method to get the JDBC connector for the Online Feature Store.
Example
# connect to the Feature Store
fs = ...
sc = fs.get_storage_connector("demo_fs_meb10000_Training_Datasets")
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the storage connector to retrieve. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
storage_connector.StorageConnector | Storage connector object. |
get_training_dataset #
get_training_dataset(
name: str, version: int = None
) -> training_dataset.TrainingDataset
Get a training dataset entity from the feature store.
Deprecated
TrainingDataset is deprecated, use FeatureView instead. You can still retrieve old training datasets using this method, but after upgrading the old training datasets will also be available under a Feature View with the same name and version.
It is recommended to use this method only for old training datasets that have been created directly from Dataframes and not with Query objects.
Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the training dataset to get. TYPE: |
version | Version of the training dataset to retrieve, defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
training_dataset.TrainingDataset | The training dataset metadata object. |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_training_datasets #
get_training_datasets(
name: str,
) -> list[training_dataset.TrainingDataset]
Get a list of all versions of a training dataset entity from the feature store.
Deprecated
TrainingDataset is deprecated, use FeatureView instead.
Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the training dataset to get. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
list[training_dataset.TrainingDataset] | List of training dataset metadata objects. |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | If the backend encounters an error when handling the request. |
get_transformation_function #
get_transformation_function(
name: str, version: int | None = None
) -> TransformationFunction
Get transformation function metadata object.
Get transformation function by name
This will default to version 1.
# get feature store instance
fs = ...
# get transformation function metadata object
plus_one_fn = fs.get_transformation_function(name="plus_one")
Get built-in transformation function min max scaler
# get feature store instance
fs = ...
# get transformation function metadata object
min_max_scaler_fn = fs.get_transformation_function(name="min_max_scaler")
Get transformation function by name and version
# get feature store instance
fs = ...
# get transformation function metadata object
min_max_scaler = fs.get_transformation_function(name="min_max_scaler", version=2)
You can define in the feature view transformation functions as dict, where key is feature name and value is online transformation function instance. Then the transformation functions are applied when you read training data, get batch data, or get feature vector(s).
Attach transformation functions to the feature view
# get feature store instance
fs = ...
# define query object
query = ...
# get transformation function metadata object
min_max_scaler = fs.get_transformation_function(name="min_max_scaler", version=1)
# attach transformation functions
feature_view = fs.create_feature_view(
name='feature_view_name',
query=query,
labels=["target_column"],
transformation_functions=[min_max_scaler("feature1")]
)
Built-in transformation functions are attached in the same way. The only difference is that it will compute the necessary statistics for the specific function in the background. For example min and max values for min_max_scaler; mean and standard deviation for standard_scaler etc.
Attach built-in transformation functions to the feature view
# get feature store instance
fs = ...
# define query object
query = ...
# retrieve transformation functions
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
standard_scaler = fs.get_transformation_function(name="standard_scaler")
robust_scaler = fs.get_transformation_function(name="robust_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")
# attach built-in transformation functions while creating feature view
feature_view = fs.create_feature_view(
name='transactions_view',
query=query,
labels=["fraud_label"],
transformation_functions = [
label_encoder("category_column"),
robust_scaler("weight"),
min_max_scaler("age"),
standard_scaler("salary")
]
)
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of transformation function. TYPE: |
version | Version of transformation function. Optional, if not provided all functions that match to provided name will be retrieved. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
TransformationFunction | The TransformationFunction metadata object. |
get_transformation_functions #
get_transformation_functions() -> list[
TransformationFunction
]
Get all transformation functions metadata objects.
Get all transformation functions
# get feature store instance
fs = ...
# get all transformation functions
list_transformation_fns = fs.get_transformation_functions()
| RETURNS | DESCRIPTION |
|---|---|
list[TransformationFunction] | List of transformation function instances. |
sql #
sql(
query: str,
dataframe_type: Literal[
"default",
"spark",
"pandas",
"polars",
"numpy",
"python",
] = "default",
online: bool = False,
read_options: dict | None = None,
) -> pd.DataFrame | pd.Series | np.ndarray | pl.DataFrame
Execute SQL command on the offline or online feature store database.
Example
# connect to the Feature Store
fs = ...
# construct the query and show head rows
query_res_head = fs.sql("SELECT * FROM `fg_1`").head()
| PARAMETER | DESCRIPTION |
|---|---|
query | The SQL query to execute. TYPE: |
dataframe_type | The type of the returned dataframe. Defaults to TYPE: |
online | Set to true to execute the query against the online feature store. TYPE: |
read_options | Additional options as key/value pairs to pass to the execution engine. For spark engine: Dictionary of read options for Spark. For python engine: If running queries on the online feature store, users can provide an entry TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame | pd.Series | np.ndarray | pl.DataFrame | DataFrame depending on the chosen type. |