Feature Store#

You can retrieve the current feature store instance using Project.get_feature_store.

FeatureStore #

Feature Store class used to manage feature store entities, like feature groups and feature views.

id `property` #

id: int

Id of the feature store.

name `property` #

name: str

Name of the feature store.

offline_featurestore_name `property` #

offline_featurestore_name: str

Name of the offline feature store database.

online_enabled `property` #

online_enabled: bool

Indicator whether online feature store is enabled.

online_featurestore_name `property` #

online_featurestore_name: str | None

Name of the online feature store database.

project_id `property` #

project_id: int

Id of the project in which the feature store is located.

project_name `property` #

project_name: str

Name of the project in which the feature store is located.

create_external_feature_group #

create_external_feature_group(
    name: str,
    storage_connector: storage_connector.StorageConnector,
    query: str | None = None,
    data_format: str | None = None,
    path: str | None = "",
    options: dict[str, str] | None = None,
    version: int | None = None,
    description: str | None = "",
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    embedding_index: EmbeddingIndex | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    event_time: str | None = None,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    online_enabled: bool = False,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    online_config: OnlineConfig
    | dict[str, Any]
    | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
    online_disk: bool | None = None,
) -> feature_group.ExternalFeatureGroup

Create an external feature group metadata object.

Example

# connect to the Feature Store
fs = ...

external_fg = fs.create_external_feature_group(
    name="sales",
    version=1,
    description="Physical shop sales features",
    query=query,
    storage_connector=connector,
    primary_key=['ss_store_sk'],
    event_time='sale_date',
    ttl=timedelta(days=30),
)

Lazy

This method is lazy and does not persist any metadata in the feature store on its own. To persist the feature group metadata in the feature store, call the save() method.

You can enable online storage for external feature groups, however, the sync from the external storage to Hopsworks online storage needs to be done manually:

external_fg = fs.create_external_feature_group(
    name="sales",
    version=1,
    description="Physical shop sales features",
    query=query,
    storage_connector=connector,
    primary_key=['ss_store_sk'],
    event_time='sale_date',
    online_enabled=True,
    online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
    online_disk=True, # Online data will be stored on disk instead of in memory
    ttl=timedelta(days=30),
)
external_fg.save()

# read from external storage and filter data to sync to online
df = external_fg.read().filter(external_fg.customer_status == "active")

# insert to online storage
external_fg.insert(df)

PARAMETER	DESCRIPTION
`name`	Name of the external feature group to create. TYPE: `str`
`storage_connector`	The storage connector used to establish connectivity with the data source. TYPE: `storage_connector.StorageConnector`
`query`	A string containing a SQL query valid for the target data source. The query will be used to pull data from the data sources when the feature group is used. TYPE: `str \| None` DEFAULT: `None`
`data_format`	If the external feature groups refers to a directory with data, the data format to use when reading it. TYPE: `str \| None` DEFAULT: `None`
`path`	The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: `str \| None` DEFAULT: `''`
`options`	Additional options to be used by the engine when reading data from the specified storage connector. For example, `{"header": True}` when reading CSV files with column names in the first row. TYPE: `dict[str, str] \| None` DEFAULT: `None`
`version`	Version of the external feature group to retrieve, defaults to `None` and will create the feature group with incremented version from the last version in the feature store. TYPE: `int \| None` DEFAULT: `None`
`description`	A string describing the contents of the external feature group to improve discoverability for Data Scientists. TYPE: `str \| None` DEFAULT: `''`
`primary_key`	A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list `[]`, and the feature group won't have any primary key. TYPE: `list[str] \| None` DEFAULT: `None`
`foreign_key`	A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list `[]`, and the feature group won't have any foreign key. TYPE: `list[str] \| None` DEFAULT: `None`
`features`	Optionally, define the schema of the external feature group manually as a list of `Feature` objects. Defaults to empty list `[]` and will use the schema information of the DataFrame resulting by executing the provided query against the data source. TYPE: `list[feature.Feature] \| None` DEFAULT: `None`
`statistics_config`	A configuration object, or a dictionary with keys: `"enabled"` to generally enable descriptive statistics computation for this external feature group, `"correlations"` to turn on feature correlation computation, `"histograms"` to compute feature value frequencies, and `"exact_uniqueness"` to compute uniqueness, distinctness and entropy. The values should be booleans indicating the setting. To fully turn off statistics computation pass `statistics_config=False`. Defaults to `None` and will compute only descriptive statistics. TYPE: `StatisticsConfig \| bool \| dict \| None` DEFAULT: `None`
`event_time`	Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: `timestamp`, `date` and `bigint`. TYPE: `str \| None` DEFAULT: `None`
`online_enabled`	Define whether it should be possible to sync the feature group to the online feature store for low latency access. TYPE: `bool` DEFAULT: `False`
`expectation_suite`	Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: `expectation_suite.ExpectationSuite \| TypeVar('great_expectations.core.ExpectationSuite') \| None` DEFAULT: `None`
`topic_name`	Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: `str \| None` DEFAULT: `None`
`notification_topic_name`	Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: `str \| None` DEFAULT: `None`
`online_config`	Optionally, define configuration which is used to configure online table. TYPE: `OnlineConfig \| dict[str, Any] \| None` DEFAULT: `None`
`data_source`	The data source specifying the location of the data. Overrides the path and query arguments when specified. TYPE: `ds.DataSource \| dict[str, Any] \| None` DEFAULT: `None`
`ttl`	Optional time-to-live duration for features in this group. Can be specified as: An integer or float representing seconds A timedelta object This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default no TTL is set. TYPE: `float \| timedelta \| None` DEFAULT: `None`
`ttl_enabled`	Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: `bool \| None` DEFAULT: `None`
`online_disk`	Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage. TYPE: `bool \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.ExternalFeatureGroup`	The external feature group metadata object.

create_feature_group #

create_feature_group(
    name: str,
    version: int | None = None,
    description: str = "",
    online_enabled: bool = False,
    time_travel_format: str | None = None,
    partition_key: list[str] | None = None,
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    embedding_index: EmbeddingIndex | None = None,
    hudi_precombine_key: str | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    event_time: str | None = None,
    stream: bool = False,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    parents: list[feature_group.FeatureGroup] | None = None,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    transformation_functions: list[
        TransformationFunction | HopsworksUdf
    ]
    | None = None,
    online_config: OnlineConfig
    | dict[str, Any]
    | None = None,
    offline_backfill_every_hr: int | str | None = None,
    storage_connector: storage_connector.StorageConnector
    | dict[str, Any] = None,
    path: str | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
    online_disk: bool | None = None,
) -> feature_group.FeatureGroup

Create a feature group metadata object.

Example

# connect to the Feature Store
fs = ...

# define the on-demand transformation functions
@udf(int)
def plus_one(value):
    return value + 1

@udf(int)
def plus_two(value):
    return value + 2

# construct list of "transformation functions" on features
transformation_functions = [plus_one("feature1"), plus_two("feature2")]

fg = fs.create_feature_group(
    name='air_quality',
    description='Air Quality characteristics of each day',
    version=1,
    primary_key=['city','date'],
    online_enabled=True,
    event_time='date',
    transformation_functions=transformation_functions,
    online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
    online_disk=True,  # Online data will be stored on disk instead of in memory
    ttl=timedelta(days=7)  # features will be deleted after 7 days
)

Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To persist the feature group and save feature data along the metadata in the feature store, call the save() method with a DataFrame.

PARAMETER	DESCRIPTION
`name`	Name of the feature group to create. TYPE: `str`
`version`	Version of the feature group to create, defaults to `None` and will create the feature group with incremented version from the last version in the feature store. TYPE: `int \| None` DEFAULT: `None`
`description`	A string describing the contents of the feature group to improve discoverability for Data Scientists. TYPE: `str` DEFAULT: `''`
`online_enabled`	Define whether the feature group should be made available also in the online feature store for low latency access. TYPE: `bool` DEFAULT: `False`
`time_travel_format`	Format used for time travel, defaults to `"HUDI"`. TYPE: `str \| None` DEFAULT: `None`
`partition_key`	A list of feature names to be used as partition key when writing the feature data to the offline storage, defaults to empty list `[]`. TYPE: `list[str] \| None` DEFAULT: `None`
`primary_key`	A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list `[]`, and the feature group won't have any primary key. TYPE: `list[str] \| None` DEFAULT: `None`
`foreign_key`	A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list `[]`, and the feature group won't have any foreign key. TYPE: `list[str] \| None` DEFAULT: `None`
`embedding_index`	`EmbeddingIndex`. If an embedding index is provided, vector database is used as online feature store. This enables similarity search by using `FeatureGroup.find_neighbors`. TYPE: `EmbeddingIndex \| None` DEFAULT: `None`
`hudi_precombine_key`	A feature name to be used as a precombine key for the `"HUDI"` feature group. If feature group has time travel format `"HUDI"` and hudi precombine key was not specified then the first primary key of the feature group will be used as hudi precombine key. TYPE: `str \| None` DEFAULT: `None`
`features`	Optionally, define the schema of the feature group manually as a list of `Feature` objects. Defaults to empty list `[]` and will use the schema information of the DataFrame provided in the `save` method. TYPE: `list[feature.Feature] \| None` DEFAULT: `None`
`statistics_config`	A configuration object, or a dictionary with keys: `enabled` to generally enable descriptive statistics computation for this feature group, `correlations` to turn on feature correlation computation, `histograms` to compute feature value frequencies, and `exact_uniqueness` to compute uniqueness, distinctness and entropy. The values should be booleans indicating the setting. To fully turn off statistics computation pass `statistics_config=False`. By default, it computes only descriptive statistics. TYPE: `StatisticsConfig \| bool \| dict \| None` DEFAULT: `None`
`event_time`	Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: `timestamp`, `date` and `bigint`. TYPE: `str \| None` DEFAULT: `None`
`stream`	Optionally, define whether the feature group should support real time stream writing capabilities. Stream enabled Feature Groups have unified single API for writing streaming features transparently to both online and offline store. TYPE: `bool` DEFAULT: `False`
`expectation_suite`	Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: `expectation_suite.ExpectationSuite \| TypeVar('great_expectations.core.ExpectationSuite') \| None` DEFAULT: `None`
`parents`	Optionally, define the parents of this feature group as the origin where the data is coming from. TYPE: `list[feature_group.FeatureGroup] \| None` DEFAULT: `None`
`topic_name`	Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: `str \| None` DEFAULT: `None`
`notification_topic_name`	Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: `str \| None` DEFAULT: `None`
`transformation_functions`	On-Demand Transformation functions attached to the feature group. It can be a list of list of user defined functions defined using the hopsworks `@udf` decorator. Defaults to `None`, no transformations. TYPE: `list[TransformationFunction \| HopsworksUdf] \| None` DEFAULT: `None`
`online_config`	Optionally, define configuration which is used to configure online table. TYPE: `OnlineConfig \| dict[str, Any] \| None` DEFAULT: `None`
`offline_backfill_every_hr`	If specified, the materialization job will be scheduled to run periodically. The value can be either an integer representing the number of hours between each run or a string representing a cron expression. Set the value to None to avoid scheduling the materialization job. By default, no scheduling is done. TYPE: `int \| str \| None` DEFAULT: `None`
`storage_connector`	The storage connector used to establish connectivity with the data source. TYPE: `storage_connector.StorageConnector \| dict[str, Any]` DEFAULT: `None`
`path`	The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: `str \| None` DEFAULT: `None`
`data_source`	The data source specifying the location of the data. Overrides the path and query arguments when specified. TYPE: `ds.DataSource \| dict[str, Any] \| None` DEFAULT: `None`
`ttl`	Optional time-to-live duration for features in this group. Can be specified as: An integer or float representing seconds A timedelta object This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default, no TTL is set. TYPE: `float \| timedelta \| None` DEFAULT: `None`
`ttl_enabled`	Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: `bool \| None` DEFAULT: `None`
`online_disk`	Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage. TYPE: `bool \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.FeatureGroup`	The feature group metadata object.

create_feature_view #

create_feature_view(
    name: str,
    query: Query,
    version: int | None = None,
    description: str | None = "",
    labels: list[str] | None = None,
    inference_helper_columns: list[str] | None = None,
    training_helper_columns: list[str] | None = None,
    transformation_functions: list[
        TransformationFunction | HopsworksUdf
    ]
    | None = None,
    logging_enabled: bool | None = False,
    extra_log_columns: list[feature.Feature]
    | list[dict[str, str]]
    | None = None,
) -> feature_view.FeatureView

Create a feature view metadata object and saved it to hopsworks.

Example

# connect to the Feature Store
fs = ...

# get the feature group instances
fg1 = fs.get_or_create_feature_group(...)
fg2 = fs.get_or_create_feature_group(...)

# construct the query
query = fg1.select_all().join(fg2.select_all())

# define the transformation function as a Hopsworks's UDF
@udf(int)
def plus_one(value):
    return value + 1

# construct list of "transformation functions" on features
transformation_functions = [plus_one("feature1"), plus_one("feature1"))]

feature_view = fs.create_feature_view(
    name='air_quality_fv',
    version=1,
    transformation_functions=transformation_functions,
    query=query
)

Example

# get feature store instance
fs = ...

# define query object
query = ...

# define list of transformation functions
mapping_transformers = ...

# create feature view
feature_view = fs.create_feature_view(
    name='feature_view_name',
    version=1,
    transformation_functions=mapping_transformers,
    query=query
)

Warning

as_of argument in the Query will be ignored because feature view does not support time travel query.

PARAMETER	DESCRIPTION
`name`	Name of the feature view to create. TYPE: `str`
`query`	Feature store `Query`. TYPE: `Query`
`version`	Version of the feature view to create, defaults to `None` and will create the feature view with incremented version from the last version in the feature store. TYPE: `int \| None` DEFAULT: `None`
`description`	A string describing the contents of the feature view to improve discoverability for Data Scientists. TYPE: `str \| None` DEFAULT: `''`
`labels`	A list of feature names constituting the prediction label/feature of the feature view. When replaying a `Query` during model inference, the label features can be omitted from the feature vector retrieval. Defaults to `[]`, no label. TYPE: `list[str] \| None` DEFAULT: `None`
`inference_helper_columns`	A list of feature names that are not used in training the model itself but can be used during batch or online inference for extra information. Inference helper column name(s) must be part of the `Query` object. If inference helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to be prepended to the original column name when defining `inference_helper_columns` list. When replaying a `Query` during model inference, the inference helper columns optionally can be omitted during batch (`get_batch_data`) and will be omitted during online inference (`get_feature_vector(s)`). To get inference helper column(s) during online inference use `get_inference_helper(s)` method. Defaults to `[], no helper columns. TYPE: `list[str] \| None` DEFAULT: `None`
`training_helper_columns`	A list of feature names that are not the part of the model schema itself but can be used during training as a helper for extra information. Training helper column name(s) must be part of the `Query` object. If training helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to prepended to the original column name when defining `training_helper_columns` list. When replaying a `Query` during model inference, the training helper columns will be omitted during both batch and online inference. Training helper columns can be optionally fetched with training data. For more details see documentation for feature view's get training data methods. Defaults to `[]`, no training helper columns. TYPE: `list[str] \| None` DEFAULT: `None`
`transformation_functions`	Model Dependent Transformation functions attached to the feature view. It can be a list of list of user defined functions defined using the hopsworks `@udf` decorator. Defaults to `None`, no transformations. TYPE: `list[TransformationFunction \| HopsworksUdf] \| None` DEFAULT: `None`
`logging_enabled`	If true, enable feature logging for the feature view. TYPE: `bool \| None` DEFAULT: `False`
`extra_log_columns`	Extra columns to be logged in addition to the features used in the feature view. It can be a list of Feature objects or list a dictionaries that contains the the name and type of the columns as keys. Defaults to `None`, no extra log columns. Setting this argument implicitly enables feature logging. TYPE: `list[feature.Feature] \| list[dict[str, str]] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_view.FeatureView`	The feature view metadata object.

create_on_demand_feature_group #

create_on_demand_feature_group(
    name: str,
    storage_connector: storage_connector.StorageConnector,
    query: str | None = None,
    data_format: str | None = None,
    path: str | None = "",
    options: dict[str, str] | None = None,
    version: int | None = None,
    description: str | None = "",
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    event_time: str | None = None,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    online_enabled: bool = False,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
) -> feature_group.ExternalFeatureGroup

Create an external feature group metadata object.

Deprecated

create_on_demand_feature_group method is deprecated. Use the create_external_feature_group method instead.

Lazy

This method is lazy and does not persist any metadata in the feature store on its own. To persist the feature group metadata in the feature store, call the save() method.

PARAMETER	DESCRIPTION
`name`	Name of the external feature group to create. TYPE: `str`
`storage_connector`	The storage connector used to establish connectivity with the data source. TYPE: `storage_connector.StorageConnector`
`query`	A string containing a SQL query valid for the target data source. The query will be used to pull data from the data sources when the feature group is used. TYPE: `str \| None` DEFAULT: `None`
`data_format`	If the external feature groups refers to a directory with data, the data format to use when reading it. TYPE: `str \| None` DEFAULT: `None`
`path`	The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: `str \| None` DEFAULT: `''`
`options`	Additional options to be used by the engine when reading data from the specified storage connector. For example, `{"header": True}` when reading CSV files with column names in the first row. TYPE: `dict[str, str] \| None` DEFAULT: `None`
`version`	Version of the external feature group to retrieve, defaults to `None` and will create the feature group with incremented version from the last version in the feature store. TYPE: `int \| None` DEFAULT: `None`
`description`	A string describing the contents of the external feature group to improve discoverability for Data Scientists. TYPE: `str \| None` DEFAULT: `''`
`primary_key`	A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list `[]`, and the feature group won't have any primary key. TYPE: `list[str] \| None` DEFAULT: `None`
`foreign_key`	A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list `[]`, and the feature group won't have any foreign key. TYPE: `list[str] \| None` DEFAULT: `None`
`features`	Optionally, define the schema of the external feature group manually as a list of `Feature` objects. Defaults to empty list `[]` and will use the schema information of the DataFrame resulting by executing the provided query against the data source. TYPE: `list[feature.Feature] \| None` DEFAULT: `None`
`statistics_config`	A configuration object, or a dictionary with keys: `"enabled"` to generally enable descriptive statistics computation for this external feature group, `"correlations"` to turn on feature correlation computation, `"histograms"` to compute feature value frequencies, and `"exact_uniqueness"` to compute uniqueness, distinctness and entropy. The values should be booleans indicating the setting. To fully turn off statistics computation pass `statistics_config=False`. Defaults to `None` and will compute only descriptive statistics. TYPE: `StatisticsConfig \| bool \| dict \| None` DEFAULT: `None`
`event_time`	Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: `timestamp`, `date` and `bigint`. TYPE: `str \| None` DEFAULT: `None`
`topic_name`	Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: `str \| None` DEFAULT: `None`
`notification_topic_name`	Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: `str \| None` DEFAULT: `None`
`expectation_suite`	Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: `expectation_suite.ExpectationSuite \| TypeVar('great_expectations.core.ExpectationSuite') \| None` DEFAULT: `None`
`data_source`	The data source specifying the location of the data. Overrides the path and query arguments when specified. TYPE: `ds.DataSource \| dict[str, Any] \| None` DEFAULT: `None`
`online_enabled`	Define whether it should be possible to sync the feature group to the online feature store for low latency access. TYPE: `bool` DEFAULT: `False`
`ttl`	Optional time-to-live duration for features in this group. Can be specified as: An integer or float representing seconds A timedelta object This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default no TTL is set. TYPE: `float \| timedelta \| None` DEFAULT: `None`
`ttl_enabled`	Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: `bool \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.ExternalFeatureGroup`	The external feature group metadata object.

create_training_dataset #

create_training_dataset(
    name: str,
    version: int | None = None,
    description: str | None = "",
    data_format: str | None = "tfrecords",
    coalesce: bool | None = False,
    storage_connector: storage_connector.StorageConnector
    | None = None,
    splits: dict[str, float] | None = None,
    location: str | None = "",
    seed: int | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    label: list[str] | None = None,
    transformation_functions: dict[
        str, TransformationFunction
    ]
    | None = None,
    train_split: str = None,
) -> training_dataset.TrainingDataset

Create a training dataset metadata object.

Deprecated

TrainingDataset is deprecated, use FeatureView instead. From version 3.0 training datasets created with this API are not visibile in the API anymore.

Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To materialize the training dataset and save feature data along the metadata in the feature store, call the save() method with a DataFrame or Query.

Data Formats

The feature store currently supports the following data formats for training datasets:

tfrecord
csv
tsv
parquet
avro
orc

Currently not supported petastorm, hdf5 and npy file formats.

PARAMETER	DESCRIPTION
`name`	Name of the training dataset to create. TYPE: `str`
`version`	Version of the training dataset to retrieve, defaults to `None` and will create the training dataset with incremented version from the last version in the feature store. TYPE: `int \| None` DEFAULT: `None`
`description`	A string describing the contents of the training dataset to improve discoverability for Data Scientists. TYPE: `str \| None` DEFAULT: `''`
`data_format`	The data format used to save the training dataset. TYPE: `str \| None` DEFAULT: `'tfrecords'`
`coalesce`	If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split. TYPE: `bool \| None` DEFAULT: `False`
`storage_connector`	Storage connector defining the sink location for the training dataset, defaults to `None`, and materializes training dataset on HopsFS. TYPE: `storage_connector.StorageConnector \| None` DEFAULT: `None`
`splits`	A dictionary defining training dataset splits to be created. Keys in the dictionary define the name of the split as `str`, values represent percentage of samples in the split as `float`. Currently, only random splits are supported. Defaults to empty dict`{}`, creating only a single training dataset without splits. TYPE: `dict[str, float] \| None` DEFAULT: `None`
`location`	Path to complement the sink storage connector with, e.g., if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to `""`, saving the training dataset at the root defined by the storage connector. TYPE: `str \| None` DEFAULT: `''`
`seed`	Optionally, define a seed to create the random splits with, in order to guarantee reproducability. TYPE: `int \| None` DEFAULT: `None`
`statistics_config`	A configuration object, or a dictionary with keys: `"enabled"` to generally enable descriptive statistics computation for this feature group, `"correlations"` to turn on feature correlation computation, and `"histograms"` to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation pass `statistics_config=False`. Defaults to `None` and will compute only descriptive statistics. TYPE: `StatisticsConfig \| bool \| dict \| None` DEFAULT: `None`
`label`	A list of feature names constituting the prediction label/feature of the training dataset. When replaying a `Query` during model inference, the label features can be omitted from the feature vector retrieval. Defaults to `[]`, no label. TYPE: `list[str] \| None` DEFAULT: `None`
`transformation_functions`	A dictionary mapping transformation functions to the features they should be applied to before writing out the training data and at inference time. Defaults to `{}`, no transformations. TYPE: `dict[str, TransformationFunction] \| None` DEFAULT: `None`
`train_split`	If `splits` is set, provide the name of the split that is going to be used for training. The statistics of this split will be used for transformation functions if necessary. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`training_dataset.TrainingDataset`	The training dataset metadata object.

create_transformation_function #

create_transformation_function(
    transformation_function: HopsworksUdf,
    version: int | None = None,
) -> TransformationFunction

Create a transformation function metadata object.

Example

# define the transformation function as a Hopsworks's UDF
@udf(int)
def plus_one(value):
    return value + 1

# create transformation function
plus_one_meta = fs.create_transformation_function(
        transformation_function=plus_one,
        version=1
    )

# persist transformation function in backend
plus_one_meta.save()

Lazy

This method is lazy and does not persist the transformation function in the feature store on its own. To materialize the transformation function and save call the save() method of the transformation function metadata object.

PARAMETER	DESCRIPTION
`transformation_function`	Hopsworks UDF. TYPE: `HopsworksUdf`

RETURNS	DESCRIPTION
`TransformationFunction`	The TransformationFunction metadata object.

get_external_feature_group #

get_external_feature_group(
    name: str, version: int = None
) -> feature_group.ExternalFeatureGroup

Get an external feature group entity from the feature store.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example

# connect to the Feature Store
fs = ...

external_fg = fs.get_external_feature_group("external_fg_test")

PARAMETER	DESCRIPTION
`name`	Name of the external feature group to get. TYPE: `str`
`version`	Version of the external feature group to retrieve, by defaults to `None` and will return the `version=1`. TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.ExternalFeatureGroup`	The external feature group metadata object or `None` if it does not exist.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_external_feature_groups #

get_external_feature_groups(
    name: str | None = None,
) -> list[feature_group.ExternalFeatureGroup]

Get a list of all external feature groups from the feature store, or all versions of an external feature group.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example

# connect to the Feature Store
fs = ...

external_fgs_list = fs.get_external_feature_groups("external_fg_test")

Example

# connect to the Feature Store
fs = ...

# retrieve all external feature groups available in the feature store
external_fgs_list = fs.get_external_feature_groups()

PARAMETER	DESCRIPTION
`name`	Name of the external feature group to get the versions of; by default it is `None` and all external feature groups are returned. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[feature_group.ExternalFeatureGroup]`	List of external feature group metadata objects.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_feature_group #

get_feature_group(
    name: str, version: int = None
) -> (
    feature_group.FeatureGroup
    | feature_group.ExternalFeatureGroup
    | feature_group.SpineGroup
)

Get a feature group entity from the feature store.

Getting a feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example

# connect to the Feature Store
fs = ...

fg = fs.get_feature_group(
        name="electricity_prices",
        version=1,
    )

PARAMETER	DESCRIPTION
`name`	Name of the feature group to get. TYPE: `str`
`version`	Version of the feature group to retrieve, defaults to `None` and will return the `version=1`. TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.FeatureGroup \| feature_group.ExternalFeatureGroup \| feature_group.SpineGroup`	The feature group metadata object or `None` if it does not exist.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_feature_groups #

get_feature_groups(
    name: str | None = None,
) -> list[
    feature_group.FeatureGroup
    | feature_group.ExternalFeatureGroup
    | feature_group.SpineGroup
]

Get all feature groups from the feature store, or all versions of a feature group specified by its name.

Getting a feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example

# connect to the Feature Store
fs = ...

# retrieve all versions of electricity_prices feature group
fgs_list = fs.get_feature_groups(
        name="electricity_prices"
    )

Example

# connect to the Feature Store
fs = ...

# retrieve all feature groups available in the feature store
fgs_list = fs.get_feature_groups()

PARAMETER	DESCRIPTION
`name`	Name of the feature group to get the versions of; by default it is `None` and all feature groups are returned. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[feature_group.FeatureGroup \| feature_group.ExternalFeatureGroup \| feature_group.SpineGroup]`	List of feature group metadata objects.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_feature_view #

get_feature_view(
    name: str, version: int = None
) -> feature_view.FeatureView

Get a feature view entity from the feature store.

Getting a feature view from the Feature Store means getting its metadata.

Example

# get feature store instance
fs = ...

# get feature view instance
feature_view = fs.get_feature_view(
    name='feature_view_name',
    version=1
)

PARAMETER	DESCRIPTION
`name`	Name of the feature view to get. TYPE: `str`
`version`	Version of the feature view to retrieve, defaults to `None` and will return the `version=1`. TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_view.FeatureView`	The feature view metadata object or `None` if it does not exist.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_feature_views #

get_feature_views(
    name: str,
) -> list[feature_view.FeatureView]

Get a list of all versions of a feature view entity from the feature store.

Getting a feature view from the Feature Store means getting its metadata.

Example

# get feature store instance
fs = ...

# get a list of all versions of a feature view
feature_view = fs.get_feature_views(
    name='feature_view_name'
)

PARAMETER	DESCRIPTION
`name`	Name of the feature view to get. TYPE: `str`

RETURNS	DESCRIPTION
`list[feature_view.FeatureView]`	List of feature view metadata objects.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_on_demand_feature_group #

get_on_demand_feature_group(
    name: str, version: int = None
) -> feature_group.ExternalFeatureGroup

Get an external feature group entity from the feature store.

Deprecated

get_on_demand_feature_group method is deprecated. Use the get_external_feature_group method instead.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

PARAMETER	DESCRIPTION
`name`	Name of the external feature group to get. TYPE: `str`
`version`	Version of the external feature group to retrieve, defaults to `None` and will return the `version=1`. TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.ExternalFeatureGroup`	The external feature group metadata object or `None` if it does not exist.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_on_demand_feature_groups #

get_on_demand_feature_groups(
    name: str,
) -> list[feature_group.ExternalFeatureGroup]

Get a list of all versions of an external feature group entity from the feature store.

Deprecated

get_on_demand_feature_groups method is deprecated. Use the get_external_feature_groups method instead.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

PARAMETER	DESCRIPTION
`name`	Name of the external feature group to get. TYPE: `str`

RETURNS	DESCRIPTION
`list[feature_group.ExternalFeatureGroup]`	List of external feature group metadata objects.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_online_storage_connector #

get_online_storage_connector() -> (
    storage_connector.StorageConnector
)

Get the storage connector for the Online Feature Store of the respective project's feature store.

The returned storage connector depends on the project that you are connected to.

Example

# connect to the Feature Store
fs = ...

online_storage_connector = fs.get_online_storage_connector()

RETURNS	DESCRIPTION
`storage_connector.StorageConnector`	JDBC storage connector to the Online Feature Store.

get_or_create_feature_group #

get_or_create_feature_group(
    name: str,
    version: int,
    description: str | None = "",
    online_enabled: bool | None = False,
    time_travel_format: str | None = None,
    partition_key: list[str] | None = None,
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    embedding_index: EmbeddingIndex | None = None,
    hudi_precombine_key: str | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    event_time: str | None = None,
    stream: bool | None = False,
    parents: list[feature_group.FeatureGroup] | None = None,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    transformation_functions: list[
        TransformationFunction | HopsworksUdf
    ]
    | None = None,
    online_config: OnlineConfig
    | dict[str, Any]
    | None = None,
    offline_backfill_every_hr: int | str | None = None,
    storage_connector: storage_connector.StorageConnector
    | dict[str, Any] = None,
    path: str | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
    online_disk: bool | None = None,
) -> (
    feature_group.FeatureGroup
    | feature_group.ExternalFeatureGroup
    | feature_group.SpineGroup
)

Get feature group metadata object or create a new one if it doesn't exist.

This method doesn't update existing feature group metadata object.

Example

# connect to the Feature Store
fs = ...

fg = fs.get_or_create_feature_group(
    name="electricity_prices",
    version=1,
    description="Electricity prices from NORD POOL",
    primary_key=["day", "area"],
    online_enabled=True,
    event_time="timestamp",
    transformation_functions=transformation_functions,
    online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
    online_disk=True, # Online data will be stored on disk instead of in memory
    ttl=timedelta(days=30),
)

Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To persist the feature group and save feature data along the metadata in the feature store, call the insert() method with a DataFrame.

PARAMETER	DESCRIPTION
`name`	Name of the feature group to create. TYPE: `str`
`version`	Version of the feature group to retrieve or create. TYPE: `int`
`description`	A string describing the contents of the feature group to improve discoverability for Data Scientists. TYPE: `str \| None` DEFAULT: `''`
`online_enabled`	Define whether the feature group should be made available also in the online feature store for low latency access. TYPE: `bool \| None` DEFAULT: `False`
`time_travel_format`	Format used for time travel, defaults to `"HUDI"`. TYPE: `str \| None` DEFAULT: `None`
`partition_key`	A list of feature names to be used as partition key when writing the feature data to the offline storage, defaults to empty list `[]`. TYPE: `list[str] \| None` DEFAULT: `None`
`primary_key`	A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list `[]`, and the feature group won't have any primary key. TYPE: `list[str] \| None` DEFAULT: `None`
`foreign_key`	A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list `[]`, and the feature group won't have any foreign key. TYPE: `list[str] \| None` DEFAULT: `None`
`embedding_index`	`EmbeddingIndex`. If an embedding index is provided, vector database is used as online feature store. This enables similarity search by using `FeatureGroup.find_neighbors`. TYPE: `EmbeddingIndex \| None` DEFAULT: `None`
`hudi_precombine_key`	A feature name to be used as a precombine key for the `"HUDI"` feature group. If feature group has time travel format `"HUDI"` and hudi precombine key was not specified then the first primary key of the feature group will be used as hudi precombine key. TYPE: `str \| None` DEFAULT: `None`
`features`	Optionally, define the schema of the feature group manually as a list of `Feature` objects. Defaults to empty list `[]` and will use the schema information of the DataFrame provided in the `save` method. TYPE: `list[feature.Feature] \| None` DEFAULT: `None`
`statistics_config`	A configuration object, or a dictionary with keys: `enabled` to generally enable descriptive statistics computation for this feature group, `correlations` to turn on feature correlation computation, `histograms` to compute feature value frequencies, and `exact_uniqueness` to compute uniqueness, distinctness and entropy. The values should be booleans indicating the setting. To fully turn off statistics computation pass `statistics_config=False`. By default, it computes only descriptive statistics. TYPE: `StatisticsConfig \| bool \| dict \| None` DEFAULT: `None`
`event_time`	Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins. Note: Event time data type restriction The supported data types for the event time column are: `timestamp`, `date` and `bigint`. TYPE: `str \| None` DEFAULT: `None`
`stream`	Optionally, define whether the feature group should support real time stream writing capabilities. Stream enabled Feature Groups have unified single API for writing streaming features transparently to both online and offline store. TYPE: `bool \| None` DEFAULT: `False`
`expectation_suite`	Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion. TYPE: `expectation_suite.ExpectationSuite \| TypeVar('great_expectations.core.ExpectationSuite') \| None` DEFAULT: `None`
`parents`	Optionally, define the parents of this feature group as the origin where the data is coming from. TYPE: `list[feature_group.FeatureGroup] \| None` DEFAULT: `None`
`topic_name`	Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic. TYPE: `str \| None` DEFAULT: `None`
`notification_topic_name`	Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent. TYPE: `str \| None` DEFAULT: `None`
`transformation_functions`	On-Demand Transformation functions attached to the feature group. It can be a list of list of user defined functions defined using the hopsworks `@udf` decorator. Defaults to `None`, no transformations. TYPE: `list[TransformationFunction \| HopsworksUdf] \| None` DEFAULT: `None`
`online_config`	Optionally, define configuration which is used to configure online table. TYPE: `OnlineConfig \| dict[str, Any] \| None` DEFAULT: `None`
`offline_backfill_every_hr`	If specified, the materialization job will be scheduled to run periodically. The value can be either an integer representing the number of hours between each run or a string representing a cron expression. Set the value to None to avoid scheduling the materialization job. By default, no scheduling is done. TYPE: `int \| str \| None` DEFAULT: `None`
`storage_connector`	The storage connector used to establish connectivity with the data source. TYPE: `storage_connector.StorageConnector \| dict[str, Any]` DEFAULT: `None`
`path`	The location within the scope of the storage connector, from where to read the data for the external feature group. TYPE: `str \| None` DEFAULT: `None`
`data_source`	The data source specifying the location of the data. Overrides the path and query arguments when specified. TYPE: `ds.DataSource \| dict[str, Any] \| None` DEFAULT: `None`
`ttl`	Optional time-to-live duration for features in this group. Can be specified as: An integer or float representing seconds A timedelta object This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default, no TTL is set. TYPE: `float \| timedelta \| None` DEFAULT: `None`
`ttl_enabled`	Optionally, enable TTL for this feature group. Defaults to True if ttl is set. TYPE: `bool \| None` DEFAULT: `None`
`online_disk`	Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage. TYPE: `bool \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.FeatureGroup \| feature_group.ExternalFeatureGroup \| feature_group.SpineGroup`	The feature group metadata object.

get_or_create_feature_view #

get_or_create_feature_view(
    name: str,
    query: Query,
    version: int,
    description: str | None = "",
    labels: list[str] | None = None,
    inference_helper_columns: list[str] | None = None,
    training_helper_columns: list[str] | None = None,
    transformation_functions: dict[
        str, TransformationFunction
    ]
    | None = None,
    logging_enabled: bool | None = False,
    extra_log_columns: list[feature.Feature]
    | list[dict[str, str]]
    | None = None,
) -> feature_view.FeatureView

Get feature view metadata object or create a new one if it doesn't exist.

This method doesn't update existing feature view metadata object.

Example

# connect to the Feature Store
fs = ...

feature_view = fs.get_or_create_feature_view(
    name='bitcoin_feature_view',
    version=1,
    transformation_functions=transformation_functions,
    query=query
)

PARAMETER	DESCRIPTION
`name`	Name of the feature view to create. TYPE: `str`
`query`	Feature store `Query`. TYPE: `Query`
`version`	Version of the feature view to create. TYPE: `int`
`description`	A string describing the contents of the feature view to improve discoverability for Data Scientists. TYPE: `str \| None` DEFAULT: `''`
`labels`	A list of feature names constituting the prediction label/feature of the feature view. When replaying a `Query` during model inference, the label features can be omitted from the feature vector retrieval. Defaults to `[]`, no label. TYPE: `list[str] \| None` DEFAULT: `None`
`inference_helper_columns`	A list of feature names that are not used in training the model itself but can be used during batch or online inference for extra information. Inference helper column name(s) must be part of the `Query` object. If inference helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to be prepended to the original column name when defining `inference_helper_columns` list. When replaying a `Query` during model inference, the inference helper columns optionally can be omitted during batch (`get_batch_data`) and will be omitted during online inference (`get_feature_vector(s)`). To get inference helper column(s) during online inference use `get_inference_helper(s)` method. Defaults to `[], no helper columns. TYPE: `list[str] \| None` DEFAULT: `None`
`training_helper_columns`	A list of feature names that are not the part of the model schema itself but can be used during training as a helper for extra information. Training helper column name(s) must be part of the `Query` object. If training helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to prepended to the original column name when defining `training_helper_columns` list. When replaying a `Query` during model inference, the training helper columns will be omitted during both batch and online inference. Training helper columns can be optionally fetched with training data. For more details see documentation for feature view's get training data methods. Defaults to `[]`, no training helper columns. TYPE: `list[str] \| None` DEFAULT: `None`
`transformation_functions`	Model Dependent Transformation functions attached to the feature view. It can be a list of list of user defined functions defined using the hopsworks `@udf` decorator. Defaults to `None`, no transformations. TYPE: `dict[str, TransformationFunction] \| None` DEFAULT: `None`
`logging_enabled`	If true, enable feature logging for the feature view. TYPE: `bool \| None` DEFAULT: `False`
`extra_log_columns`	Extra columns to be logged in addition to the features used in the feature view. It can be a list of Feature objects or list a dictionaries that contains the the name and type of the columns as keys. Defaults to `None`, no extra log columns. Setting this argument implicitly enables feature logging. TYPE: `list[feature.Feature] \| list[dict[str, str]] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_view.FeatureView`	The feature view metadata object.

get_or_create_spine_group #

get_or_create_spine_group(
    name: str,
    version: int | None = None,
    description: str | None = "",
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    event_time: str | None = None,
    features: list[feature.Feature] | None = None,
    dataframe: pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list] = None,
) -> feature_group.SpineGroup

Create a spine group metadata object.

Instead of using a feature group to save a label/prediction target, you can use a spine together with a dataframe containing the labels. A Spine is essentially a metadata object similar to a feature group, however, the data is not materialized in the feature store. It only containes the needed metadata such as the relevant event time column and primary key columns to perform point-in-time correct joins.

Example

# connect to the Feature Store
fs = ...

spine_df = pd.Dataframe()

spine_group = fs.get_or_create_spine_group(
    name="sales",
    version=1,
    description="Physical shop sales features",
    primary_key=['ss_store_sk'],
    event_time='sale_date',
    dataframe=spine_df,
)

Note that you can inspect the dataframe in the spine group, or replace the dataframe:

spine_group.dataframe.show()

spine_group.dataframe = new_df

The spine can then be used to construct queries, with only one speciality:

Note

Spines can only be used on the left side of a feature join, as this is the base set of entities for which features are to be fetched and the left side of the join determines the event timestamps to compare against.

If you want to use the query for a feature view to be used for online serving, you can only select the label or target feature from the spine. For the online lookup, the label is not required, therefore it is important to only select label from the left feature group, so that we don't need to provide a spine for online serving.

These queries can then be used to create feature views. Since the dataframe contained in the spine is not being materialized, every time you use a feature view created with spine to read data you will have to provide a dataframe with the same structure again.

For example, to generate training data:

X_train, X_test, y_train, y_test = feature_view_spine.train_test_split(0.2, spine=training_data_entities)

Or to get batches of fresh data for batch scoring:

feature_view_spine.get_batch_data(spine=scoring_entities_df).show()

Here you have the chance to pass a different set of entities to generate the training dataset.

Sometimes it might be handy to create a feature view with a regular feature group containing the label, but then at serving time to use a spine in order to fetch features for example only for a small set of primary key values. To do this, you can pass the spine group instead of a dataframe. Just make sure it contains the needed primary key, event time and label column.

feature_view.get_batch_data(spine=spine_group)

PARAMETER	DESCRIPTION
`name`	Name of the spine group to create. TYPE: `str`
`version`	Version of the spine group to retrieve, defaults to `None` and will create the spine group with incremented version from the last version in the feature store. TYPE: `int \| None` DEFAULT: `None`
`description`	A string describing the contents of the spine group to improve discoverability for Data Scientists. TYPE: `str \| None` DEFAULT: `''`
`primary_key`	A list of feature names to be used as primary key for the spine group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list `[]`, and the spine group won't have any primary key. TYPE: `list[str] \| None` DEFAULT: `None`
`foreign_key`	A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list `[]`, and the feature group won't have any foreign key. TYPE: `list[str] \| None` DEFAULT: `None`
`event_time`	Optionally, provide the name of the feature containing the event time for the features in this spine group. If event_time is set the spine group can be used for point-in-time joins. TYPE: `str \| None` DEFAULT: `None`
`features`	Optionally, define the schema of the spine group manually as a list of `Feature` objects. Defaults to empty list `[]` and will use the schema information of the DataFrame resulting by executing the provided query against the data source. Note: Event time data type restriction The supported data types for the event time column are: `timestamp`, `date` and `bigint`. TYPE: `list[feature.Feature] \| None` DEFAULT: `None`
`dataframe`	Spine dataframe with primary key, event time and label column to use for point in time join when fetching features. TYPE: `pd.DataFrame \| TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| np.ndarray \| list[list]` DEFAULT: `None`

RETURNS	DESCRIPTION
`feature_group.SpineGroup`	The spine group metadata object.

get_storage_connector #

get_storage_connector(
    name: str,
) -> storage_connector.StorageConnector

Get a previously created storage connector from the feature store.

Storage connectors encapsulate all information needed for the execution engine to read and write to specific storage. This storage can be S3, a JDBC compliant database or the distributed filesystem HOPSFS.

If you want to connect to the online feature store, see the get_online_storage_connector method to get the JDBC connector for the Online Feature Store.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("demo_fs_meb10000_Training_Datasets")

PARAMETER	DESCRIPTION
`name`	Name of the storage connector to retrieve. TYPE: `str`

RETURNS	DESCRIPTION
`storage_connector.StorageConnector`	Storage connector object.

get_training_dataset #

get_training_dataset(
    name: str, version: int = None
) -> training_dataset.TrainingDataset

Get a training dataset entity from the feature store.

Deprecated

TrainingDataset is deprecated, use FeatureView instead. You can still retrieve old training datasets using this method, but after upgrading the old training datasets will also be available under a Feature View with the same name and version.

It is recommended to use this method only for old training datasets that have been created directly from Dataframes and not with Query objects.

Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.

PARAMETER	DESCRIPTION
`name`	Name of the training dataset to get. TYPE: `str`
`version`	Version of the training dataset to retrieve, defaults to `None` and will return the `version=1`. TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`training_dataset.TrainingDataset`	The training dataset metadata object.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_training_datasets #

get_training_datasets(
    name: str,
) -> list[training_dataset.TrainingDataset]

Get a list of all versions of a training dataset entity from the feature store.

Deprecated

TrainingDataset is deprecated, use FeatureView instead.

Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.

PARAMETER	DESCRIPTION
`name`	Name of the training dataset to get. TYPE: `str`

RETURNS	DESCRIPTION
`list[training_dataset.TrainingDataset]`	List of training dataset metadata objects.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	If the backend encounters an error when handling the request.

get_transformation_function #

get_transformation_function(
    name: str, version: int | None = None
) -> TransformationFunction

Get transformation function metadata object.

Get transformation function by name

This will default to version 1.

# get feature store instance
fs = ...

# get transformation function metadata object
plus_one_fn = fs.get_transformation_function(name="plus_one")

Get built-in transformation function min max scaler

# get feature store instance
fs = ...

# get transformation function metadata object
min_max_scaler_fn = fs.get_transformation_function(name="min_max_scaler")

Get transformation function by name and version

# get feature store instance
fs = ...

# get transformation function metadata object
min_max_scaler = fs.get_transformation_function(name="min_max_scaler", version=2)

You can define in the feature view transformation functions as dict, where key is feature name and value is online transformation function instance. Then the transformation functions are applied when you read training data, get batch data, or get feature vector(s).

Attach transformation functions to the feature view

# get feature store instance
fs = ...

# define query object
query = ...

# get transformation function metadata object
min_max_scaler = fs.get_transformation_function(name="min_max_scaler", version=1)

# attach transformation functions
feature_view = fs.create_feature_view(
    name='feature_view_name',
    query=query,
    labels=["target_column"],
    transformation_functions=[min_max_scaler("feature1")]
)

Built-in transformation functions are attached in the same way. The only difference is that it will compute the necessary statistics for the specific function in the background. For example min and max values for min_max_scaler; mean and standard deviation for standard_scaler etc.

Attach built-in transformation functions to the feature view

# get feature store instance
fs = ...

# define query object
query = ...

# retrieve transformation functions
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
standard_scaler = fs.get_transformation_function(name="standard_scaler")
robust_scaler = fs.get_transformation_function(name="robust_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# attach built-in transformation functions while creating feature view
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=query,
    labels=["fraud_label"],
    transformation_functions = [
        label_encoder("category_column"),
        robust_scaler("weight"),
        min_max_scaler("age"),
        standard_scaler("salary")
    ]
)

PARAMETER	DESCRIPTION
`name`	Name of transformation function. TYPE: `str`
`version`	Version of transformation function. Optional, if not provided all functions that match to provided name will be retrieved. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`TransformationFunction`	The TransformationFunction metadata object.

get_transformation_functions #

get_transformation_functions() -> list[
    TransformationFunction
]

Get all transformation functions metadata objects.

Get all transformation functions

# get feature store instance
fs = ...

# get all transformation functions
list_transformation_fns = fs.get_transformation_functions()

RETURNS	DESCRIPTION
`list[TransformationFunction]`	List of transformation function instances.

sql #

sql(
    query: str,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
    online: bool = False,
    read_options: dict | None = None,
) -> pd.DataFrame | pd.Series | np.ndarray | pl.DataFrame

Execute SQL command on the offline or online feature store database.

Example

# connect to the Feature Store
fs = ...

# construct the query and show head rows
query_res_head = fs.sql("SELECT * FROM `fg_1`").head()

PARAMETER	DESCRIPTION
`query`	The SQL query to execute. TYPE: `str`
`dataframe_type`	The type of the returned dataframe. Defaults to `"default"`, which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`
`online`	Set to true to execute the query against the online feature store. TYPE: `bool` DEFAULT: `False`
`read_options`	Additional options as key/value pairs to pass to the execution engine. For spark engine: Dictionary of read options for Spark. For python engine: If running queries on the online feature store, users can provide an entry `{'external': True}`, this instructs the library to use the `host` parameter in the `hopsworks.login` to establish the connection to the online feature store. If not set, or set to False, the online feature store storage connector is used which relies on the private ip. TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`pd.DataFrame \| pd.Series \| np.ndarray \| pl.DataFrame`	DataFrame depending on the chosen type.

Feature Store#

FeatureStore #

id property #

name property #

offline_featurestore_name property #

online_enabled property #

online_featurestore_name property #

project_id property #

project_name property #

create_external_feature_group #

create_feature_group #

create_feature_view #

create_on_demand_feature_group #

create_training_dataset #

create_transformation_function #

get_external_feature_group #

get_external_feature_groups #

get_feature_group #

get_feature_groups #

get_feature_view #

get_feature_views #

get_on_demand_feature_group #

get_on_demand_feature_groups #

get_online_storage_connector #

get_or_create_feature_group #

get_or_create_feature_view #

get_or_create_spine_group #

get_storage_connector #

get_training_dataset #

get_training_datasets #

get_transformation_function #

get_transformation_functions #

sql #

id `property` #

name `property` #

offline_featurestore_name `property` #

online_enabled `property` #

online_featurestore_name `property` #

project_id `property` #

project_name `property` #