Skip to content

Feature Store#

You can retrieve the current feature store instance using Project.get_feature_store.

FeatureStore #

Feature Store class used to manage feature store entities, like feature groups and feature views.

id property #

id: int

Id of the feature store.

name property #

name: str

Name of the feature store.

offline_featurestore_name property #

offline_featurestore_name: str

Name of the offline feature store database.

online_enabled property #

online_enabled: bool

Indicator whether online feature store is enabled.

online_featurestore_name property #

online_featurestore_name: str | None

Name of the online feature store database.

project_id property #

project_id: int

Id of the project in which the feature store is located.

project_name property #

project_name: str

Name of the project in which the feature store is located.

create_external_feature_group #

create_external_feature_group(
    name: str,
    storage_connector: storage_connector.StorageConnector,
    query: str | None = None,
    data_format: str | None = None,
    path: str | None = "",
    options: dict[str, str] | None = None,
    version: int | None = None,
    description: str | None = "",
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    embedding_index: EmbeddingIndex | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    event_time: str | None = None,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    online_enabled: bool = False,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    online_config: OnlineConfig
    | dict[str, Any]
    | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
    online_disk: bool | None = None,
) -> feature_group.ExternalFeatureGroup

Create an external feature group metadata object.

Example
# connect to the Feature Store
fs = ...

external_fg = fs.create_external_feature_group(
    name="sales",
    version=1,
    description="Physical shop sales features",
    query=query,
    storage_connector=connector,
    primary_key=['ss_store_sk'],
    event_time='sale_date',
    ttl=timedelta(days=30),
)
Lazy

This method is lazy and does not persist any metadata in the feature store on its own. To persist the feature group metadata in the feature store, call the save() method.

You can enable online storage for external feature groups, however, the sync from the external storage to Hopsworks online storage needs to be done manually:

external_fg = fs.create_external_feature_group(
    name="sales",
    version=1,
    description="Physical shop sales features",
    query=query,
    storage_connector=connector,
    primary_key=['ss_store_sk'],
    event_time='sale_date',
    online_enabled=True,
    online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
    online_disk=True, # Online data will be stored on disk instead of in memory
    ttl=timedelta(days=30),
)
external_fg.save()

# read from external storage and filter data to sync to online
df = external_fg.read().filter(external_fg.customer_status == "active")

# insert to online storage
external_fg.insert(df)
PARAMETER DESCRIPTION
name

Name of the external feature group to create.

TYPE: str

storage_connector

The storage connector used to establish connectivity with the data source.

TYPE: storage_connector.StorageConnector

query

A string containing a SQL query valid for the target data source. The query will be used to pull data from the data sources when the feature group is used.

TYPE: str | None DEFAULT: None

data_format

If the external feature groups refers to a directory with data, the data format to use when reading it.

TYPE: str | None DEFAULT: None

path

The location within the scope of the storage connector, from where to read the data for the external feature group.

TYPE: str | None DEFAULT: ''

options

Additional options to be used by the engine when reading data from the specified storage connector. For example, {"header": True} when reading CSV files with column names in the first row.

TYPE: dict[str, str] | None DEFAULT: None

version

Version of the external feature group to retrieve, defaults to None and will create the feature group with incremented version from the last version in the feature store.

TYPE: int | None DEFAULT: None

description

A string describing the contents of the external feature group to improve discoverability for Data Scientists.

TYPE: str | None DEFAULT: ''

primary_key

A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list [], and the feature group won't have any primary key.

TYPE: list[str] | None DEFAULT: None

foreign_key

A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list [], and the feature group won't have any foreign key.

TYPE: list[str] | None DEFAULT: None

features

Optionally, define the schema of the external feature group manually as a list of Feature objects. Defaults to empty list [] and will use the schema information of the DataFrame resulting by executing the provided query against the data source.

TYPE: list[feature.Feature] | None DEFAULT: None

statistics_config

A configuration object, or a dictionary with keys:

  • "enabled" to generally enable descriptive statistics computation for this external feature group,
  • "correlations" to turn on feature correlation computation,
  • "histograms" to compute feature value frequencies, and
  • "exact_uniqueness" to compute uniqueness, distinctness and entropy.

The values should be booleans indicating the setting. To fully turn off statistics computation pass statistics_config=False. Defaults to None and will compute only descriptive statistics.

TYPE: StatisticsConfig | bool | dict | None DEFAULT: None

event_time

Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins.

Note: Event time data type restriction The supported data types for the event time column are: timestamp, date and bigint.

TYPE: str | None DEFAULT: None

online_enabled

Define whether it should be possible to sync the feature group to the online feature store for low latency access.

TYPE: bool DEFAULT: False

expectation_suite

Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion.

TYPE: expectation_suite.ExpectationSuite | TypeVar('great_expectations.core.ExpectationSuite') | None DEFAULT: None

topic_name

Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic.

TYPE: str | None DEFAULT: None

notification_topic_name

Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent.

TYPE: str | None DEFAULT: None

online_config

Optionally, define configuration which is used to configure online table.

TYPE: OnlineConfig | dict[str, Any] | None DEFAULT: None

data_source

The data source specifying the location of the data. Overrides the path and query arguments when specified.

TYPE: ds.DataSource | dict[str, Any] | None DEFAULT: None

ttl

Optional time-to-live duration for features in this group.

Can be specified as:

  • An integer or float representing seconds
  • A timedelta object

This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default no TTL is set.

TYPE: float | timedelta | None DEFAULT: None

ttl_enabled

Optionally, enable TTL for this feature group. Defaults to True if ttl is set.

TYPE: bool | None DEFAULT: None

online_disk

Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
feature_group.ExternalFeatureGroup

The external feature group metadata object.

create_feature_group #

create_feature_group(
    name: str,
    version: int | None = None,
    description: str = "",
    online_enabled: bool = False,
    time_travel_format: str | None = None,
    partition_key: list[str] | None = None,
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    embedding_index: EmbeddingIndex | None = None,
    hudi_precombine_key: str | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    event_time: str | None = None,
    stream: bool = False,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    parents: list[feature_group.FeatureGroup] | None = None,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    transformation_functions: list[
        TransformationFunction | HopsworksUdf
    ]
    | None = None,
    online_config: OnlineConfig
    | dict[str, Any]
    | None = None,
    offline_backfill_every_hr: int | str | None = None,
    storage_connector: storage_connector.StorageConnector
    | dict[str, Any] = None,
    path: str | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
    online_disk: bool | None = None,
) -> feature_group.FeatureGroup

Create a feature group metadata object.

Example
# connect to the Feature Store
fs = ...

# define the on-demand transformation functions
@udf(int)
def plus_one(value):
    return value + 1

@udf(int)
def plus_two(value):
    return value + 2

# construct list of "transformation functions" on features
transformation_functions = [plus_one("feature1"), plus_two("feature2")]

fg = fs.create_feature_group(
    name='air_quality',
    description='Air Quality characteristics of each day',
    version=1,
    primary_key=['city','date'],
    online_enabled=True,
    event_time='date',
    transformation_functions=transformation_functions,
    online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
    online_disk=True,  # Online data will be stored on disk instead of in memory
    ttl=timedelta(days=7)  # features will be deleted after 7 days
)
Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To persist the feature group and save feature data along the metadata in the feature store, call the save() method with a DataFrame.

PARAMETER DESCRIPTION
name

Name of the feature group to create.

TYPE: str

version

Version of the feature group to create, defaults to None and will create the feature group with incremented version from the last version in the feature store.

TYPE: int | None DEFAULT: None

description

A string describing the contents of the feature group to improve discoverability for Data Scientists.

TYPE: str DEFAULT: ''

online_enabled

Define whether the feature group should be made available also in the online feature store for low latency access.

TYPE: bool DEFAULT: False

time_travel_format

Format used for time travel, defaults to "HUDI".

TYPE: str | None DEFAULT: None

partition_key

A list of feature names to be used as partition key when writing the feature data to the offline storage, defaults to empty list [].

TYPE: list[str] | None DEFAULT: None

primary_key

A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list [], and the feature group won't have any primary key.

TYPE: list[str] | None DEFAULT: None

foreign_key

A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list [], and the feature group won't have any foreign key.

TYPE: list[str] | None DEFAULT: None

embedding_index

EmbeddingIndex. If an embedding index is provided, vector database is used as online feature store. This enables similarity search by using FeatureGroup.find_neighbors.

TYPE: EmbeddingIndex | None DEFAULT: None

hudi_precombine_key

A feature name to be used as a precombine key for the "HUDI" feature group. If feature group has time travel format "HUDI" and hudi precombine key was not specified then the first primary key of the feature group will be used as hudi precombine key.

TYPE: str | None DEFAULT: None

features

Optionally, define the schema of the feature group manually as a list of Feature objects. Defaults to empty list [] and will use the schema information of the DataFrame provided in the save method.

TYPE: list[feature.Feature] | None DEFAULT: None

statistics_config

A configuration object, or a dictionary with keys:

  • enabled to generally enable descriptive statistics computation for this feature group,
  • correlations to turn on feature correlation computation,
  • histograms to compute feature value frequencies, and
  • exact_uniqueness to compute uniqueness, distinctness and entropy.

The values should be booleans indicating the setting. To fully turn off statistics computation pass statistics_config=False. By default, it computes only descriptive statistics.

TYPE: StatisticsConfig | bool | dict | None DEFAULT: None

event_time

Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins.

Note: Event time data type restriction The supported data types for the event time column are: timestamp, date and bigint.

TYPE: str | None DEFAULT: None

stream

Optionally, define whether the feature group should support real time stream writing capabilities. Stream enabled Feature Groups have unified single API for writing streaming features transparently to both online and offline store.

TYPE: bool DEFAULT: False

expectation_suite

Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion.

TYPE: expectation_suite.ExpectationSuite | TypeVar('great_expectations.core.ExpectationSuite') | None DEFAULT: None

parents

Optionally, define the parents of this feature group as the origin where the data is coming from.

TYPE: list[feature_group.FeatureGroup] | None DEFAULT: None

topic_name

Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic.

TYPE: str | None DEFAULT: None

notification_topic_name

Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent.

TYPE: str | None DEFAULT: None

transformation_functions

On-Demand Transformation functions attached to the feature group. It can be a list of list of user defined functions defined using the hopsworks @udf decorator. Defaults to None, no transformations.

TYPE: list[TransformationFunction | HopsworksUdf] | None DEFAULT: None

online_config

Optionally, define configuration which is used to configure online table.

TYPE: OnlineConfig | dict[str, Any] | None DEFAULT: None

offline_backfill_every_hr

If specified, the materialization job will be scheduled to run periodically. The value can be either an integer representing the number of hours between each run or a string representing a cron expression. Set the value to None to avoid scheduling the materialization job. By default, no scheduling is done.

TYPE: int | str | None DEFAULT: None

storage_connector

The storage connector used to establish connectivity with the data source.

TYPE: storage_connector.StorageConnector | dict[str, Any] DEFAULT: None

path

The location within the scope of the storage connector, from where to read the data for the external feature group.

TYPE: str | None DEFAULT: None

data_source

The data source specifying the location of the data. Overrides the path and query arguments when specified.

TYPE: ds.DataSource | dict[str, Any] | None DEFAULT: None

ttl

Optional time-to-live duration for features in this group. Can be specified as:

  • An integer or float representing seconds
  • A timedelta object

This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC.

By default, no TTL is set.

TYPE: float | timedelta | None DEFAULT: None

ttl_enabled

Optionally, enable TTL for this feature group. Defaults to True if ttl is set.

TYPE: bool | None DEFAULT: None

online_disk

Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
feature_group.FeatureGroup

The feature group metadata object.

create_feature_view #

create_feature_view(
    name: str,
    query: Query,
    version: int | None = None,
    description: str | None = "",
    labels: list[str] | None = None,
    inference_helper_columns: list[str] | None = None,
    training_helper_columns: list[str] | None = None,
    transformation_functions: list[
        TransformationFunction | HopsworksUdf
    ]
    | None = None,
    logging_enabled: bool | None = False,
    extra_log_columns: list[feature.Feature]
    | list[dict[str, str]]
    | None = None,
) -> feature_view.FeatureView

Create a feature view metadata object and saved it to hopsworks.

Example
# connect to the Feature Store
fs = ...

# get the feature group instances
fg1 = fs.get_or_create_feature_group(...)
fg2 = fs.get_or_create_feature_group(...)

# construct the query
query = fg1.select_all().join(fg2.select_all())

# define the transformation function as a Hopsworks's UDF
@udf(int)
def plus_one(value):
    return value + 1

# construct list of "transformation functions" on features
transformation_functions = [plus_one("feature1"), plus_one("feature1"))]

feature_view = fs.create_feature_view(
    name='air_quality_fv',
    version=1,
    transformation_functions=transformation_functions,
    query=query
)
Example
# get feature store instance
fs = ...

# define query object
query = ...

# define list of transformation functions
mapping_transformers = ...

# create feature view
feature_view = fs.create_feature_view(
    name='feature_view_name',
    version=1,
    transformation_functions=mapping_transformers,
    query=query
)
Warning

as_of argument in the Query will be ignored because feature view does not support time travel query.

PARAMETER DESCRIPTION
name

Name of the feature view to create.

TYPE: str

query

Feature store Query.

TYPE: Query

version

Version of the feature view to create, defaults to None and will create the feature view with incremented version from the last version in the feature store.

TYPE: int | None DEFAULT: None

description

A string describing the contents of the feature view to improve discoverability for Data Scientists.

TYPE: str | None DEFAULT: ''

labels

A list of feature names constituting the prediction label/feature of the feature view. When replaying a Query during model inference, the label features can be omitted from the feature vector retrieval. Defaults to [], no label.

TYPE: list[str] | None DEFAULT: None

inference_helper_columns

A list of feature names that are not used in training the model itself but can be used during batch or online inference for extra information. Inference helper column name(s) must be part of the Query object. If inference helper column name(s) belong to feature group that is part of a Join with prefix defined, then this prefix needs to be prepended to the original column name when defining inference_helper_columns list. When replaying a Query during model inference, the inference helper columns optionally can be omitted during batch (get_batch_data) and will be omitted during online inference (get_feature_vector(s)). To get inference helper column(s) during online inference use get_inference_helper(s) method. Defaults to `[], no helper columns.

TYPE: list[str] | None DEFAULT: None

training_helper_columns

A list of feature names that are not the part of the model schema itself but can be used during training as a helper for extra information. Training helper column name(s) must be part of the Query object. If training helper column name(s) belong to feature group that is part of a Join with prefix defined, then this prefix needs to prepended to the original column name when defining training_helper_columns list. When replaying a Query during model inference, the training helper columns will be omitted during both batch and online inference. Training helper columns can be optionally fetched with training data. For more details see documentation for feature view's get training data methods. Defaults to [], no training helper columns.

TYPE: list[str] | None DEFAULT: None

transformation_functions

Model Dependent Transformation functions attached to the feature view. It can be a list of list of user defined functions defined using the hopsworks @udf decorator. Defaults to None, no transformations.

TYPE: list[TransformationFunction | HopsworksUdf] | None DEFAULT: None

logging_enabled

If true, enable feature logging for the feature view.

TYPE: bool | None DEFAULT: False

extra_log_columns

Extra columns to be logged in addition to the features used in the feature view. It can be a list of Feature objects or list a dictionaries that contains the the name and type of the columns as keys. Defaults to None, no extra log columns. Setting this argument implicitly enables feature logging.

TYPE: list[feature.Feature] | list[dict[str, str]] | None DEFAULT: None

RETURNS DESCRIPTION
feature_view.FeatureView

The feature view metadata object.

create_on_demand_feature_group #

create_on_demand_feature_group(
    name: str,
    storage_connector: storage_connector.StorageConnector,
    query: str | None = None,
    data_format: str | None = None,
    path: str | None = "",
    options: dict[str, str] | None = None,
    version: int | None = None,
    description: str | None = "",
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    event_time: str | None = None,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    online_enabled: bool = False,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
) -> feature_group.ExternalFeatureGroup

Create an external feature group metadata object.

Deprecated

create_on_demand_feature_group method is deprecated. Use the create_external_feature_group method instead.

Lazy

This method is lazy and does not persist any metadata in the feature store on its own. To persist the feature group metadata in the feature store, call the save() method.

PARAMETER DESCRIPTION
name

Name of the external feature group to create.

TYPE: str

storage_connector

The storage connector used to establish connectivity with the data source.

TYPE: storage_connector.StorageConnector

query

A string containing a SQL query valid for the target data source. The query will be used to pull data from the data sources when the feature group is used.

TYPE: str | None DEFAULT: None

data_format

If the external feature groups refers to a directory with data, the data format to use when reading it.

TYPE: str | None DEFAULT: None

path

The location within the scope of the storage connector, from where to read the data for the external feature group.

TYPE: str | None DEFAULT: ''

options

Additional options to be used by the engine when reading data from the specified storage connector. For example, {"header": True} when reading CSV files with column names in the first row.

TYPE: dict[str, str] | None DEFAULT: None

version

Version of the external feature group to retrieve, defaults to None and will create the feature group with incremented version from the last version in the feature store.

TYPE: int | None DEFAULT: None

description

A string describing the contents of the external feature group to improve discoverability for Data Scientists.

TYPE: str | None DEFAULT: ''

primary_key

A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list [], and the feature group won't have any primary key.

TYPE: list[str] | None DEFAULT: None

foreign_key

A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list [], and the feature group won't have any foreign key.

TYPE: list[str] | None DEFAULT: None

features

Optionally, define the schema of the external feature group manually as a list of Feature objects. Defaults to empty list [] and will use the schema information of the DataFrame resulting by executing the provided query against the data source.

TYPE: list[feature.Feature] | None DEFAULT: None

statistics_config

A configuration object, or a dictionary with keys:

  • "enabled" to generally enable descriptive statistics computation for this external feature group,
  • "correlations" to turn on feature correlation computation,
  • "histograms" to compute feature value frequencies, and
  • "exact_uniqueness" to compute uniqueness, distinctness and entropy.

The values should be booleans indicating the setting. To fully turn off statistics computation pass statistics_config=False. Defaults to None and will compute only descriptive statistics.

TYPE: StatisticsConfig | bool | dict | None DEFAULT: None

event_time

Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins.

Note: Event time data type restriction The supported data types for the event time column are: timestamp, date and bigint.

TYPE: str | None DEFAULT: None

topic_name

Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic.

TYPE: str | None DEFAULT: None

notification_topic_name

Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent.

TYPE: str | None DEFAULT: None

expectation_suite

Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion.

TYPE: expectation_suite.ExpectationSuite | TypeVar('great_expectations.core.ExpectationSuite') | None DEFAULT: None

data_source

The data source specifying the location of the data. Overrides the path and query arguments when specified.

TYPE: ds.DataSource | dict[str, Any] | None DEFAULT: None

online_enabled

Define whether it should be possible to sync the feature group to the online feature store for low latency access.

TYPE: bool DEFAULT: False

ttl

Optional time-to-live duration for features in this group.

Can be specified as:

  • An integer or float representing seconds
  • A timedelta object

This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC. By default no TTL is set.

TYPE: float | timedelta | None DEFAULT: None

ttl_enabled

Optionally, enable TTL for this feature group. Defaults to True if ttl is set.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
feature_group.ExternalFeatureGroup

The external feature group metadata object.

create_training_dataset #

create_training_dataset(
    name: str,
    version: int | None = None,
    description: str | None = "",
    data_format: str | None = "tfrecords",
    coalesce: bool | None = False,
    storage_connector: storage_connector.StorageConnector
    | None = None,
    splits: dict[str, float] | None = None,
    location: str | None = "",
    seed: int | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    label: list[str] | None = None,
    transformation_functions: dict[
        str, TransformationFunction
    ]
    | None = None,
    train_split: str = None,
) -> training_dataset.TrainingDataset

Create a training dataset metadata object.

Deprecated

TrainingDataset is deprecated, use FeatureView instead. From version 3.0 training datasets created with this API are not visibile in the API anymore.

Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To materialize the training dataset and save feature data along the metadata in the feature store, call the save() method with a DataFrame or Query.

Data Formats

The feature store currently supports the following data formats for training datasets:

  1. tfrecord
  2. csv
  3. tsv
  4. parquet
  5. avro
  6. orc

Currently not supported petastorm, hdf5 and npy file formats.

PARAMETER DESCRIPTION
name

Name of the training dataset to create.

TYPE: str

version

Version of the training dataset to retrieve, defaults to None and will create the training dataset with incremented version from the last version in the feature store.

TYPE: int | None DEFAULT: None

description

A string describing the contents of the training dataset to improve discoverability for Data Scientists.

TYPE: str | None DEFAULT: ''

data_format

The data format used to save the training dataset.

TYPE: str | None DEFAULT: 'tfrecords'

coalesce

If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split.

TYPE: bool | None DEFAULT: False

storage_connector

Storage connector defining the sink location for the training dataset, defaults to None, and materializes training dataset on HopsFS.

TYPE: storage_connector.StorageConnector | None DEFAULT: None

splits

A dictionary defining training dataset splits to be created. Keys in the dictionary define the name of the split as str, values represent percentage of samples in the split as float. Currently, only random splits are supported. Defaults to empty dict{}, creating only a single training dataset without splits.

TYPE: dict[str, float] | None DEFAULT: None

location

Path to complement the sink storage connector with, e.g., if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to "", saving the training dataset at the root defined by the storage connector.

TYPE: str | None DEFAULT: ''

seed

Optionally, define a seed to create the random splits with, in order to guarantee reproducability.

TYPE: int | None DEFAULT: None

statistics_config

A configuration object, or a dictionary with keys:

  • "enabled" to generally enable descriptive statistics computation for this feature group,
  • "correlations" to turn on feature correlation computation, and
  • "histograms" to compute feature value frequencies.

The values should be booleans indicating the setting. To fully turn off statistics computation pass statistics_config=False. Defaults to None and will compute only descriptive statistics.

TYPE: StatisticsConfig | bool | dict | None DEFAULT: None

label

A list of feature names constituting the prediction label/feature of the training dataset. When replaying a Query during model inference, the label features can be omitted from the feature vector retrieval. Defaults to [], no label.

TYPE: list[str] | None DEFAULT: None

transformation_functions

A dictionary mapping transformation functions to the features they should be applied to before writing out the training data and at inference time. Defaults to {}, no transformations.

TYPE: dict[str, TransformationFunction] | None DEFAULT: None

train_split

If splits is set, provide the name of the split that is going to be used for training. The statistics of this split will be used for transformation functions if necessary.

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
training_dataset.TrainingDataset

The training dataset metadata object.

create_transformation_function #

create_transformation_function(
    transformation_function: HopsworksUdf,
    version: int | None = None,
) -> TransformationFunction

Create a transformation function metadata object.

Example
# define the transformation function as a Hopsworks's UDF
@udf(int)
def plus_one(value):
    return value + 1

# create transformation function
plus_one_meta = fs.create_transformation_function(
        transformation_function=plus_one,
        version=1
    )

# persist transformation function in backend
plus_one_meta.save()
Lazy

This method is lazy and does not persist the transformation function in the feature store on its own. To materialize the transformation function and save call the save() method of the transformation function metadata object.

PARAMETER DESCRIPTION
transformation_function

Hopsworks UDF.

TYPE: HopsworksUdf

RETURNS DESCRIPTION
TransformationFunction

The TransformationFunction metadata object.

get_external_feature_group #

get_external_feature_group(
    name: str, version: int = None
) -> feature_group.ExternalFeatureGroup

Get an external feature group entity from the feature store.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example
# connect to the Feature Store
fs = ...

external_fg = fs.get_external_feature_group("external_fg_test")
PARAMETER DESCRIPTION
name

Name of the external feature group to get.

TYPE: str

version

Version of the external feature group to retrieve, by defaults to None and will return the version=1.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
feature_group.ExternalFeatureGroup

The external feature group metadata object or None if it does not exist.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_external_feature_groups #

get_external_feature_groups(
    name: str | None = None,
) -> list[feature_group.ExternalFeatureGroup]

Get a list of all external feature groups from the feature store, or all versions of an external feature group.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example
# connect to the Feature Store
fs = ...

external_fgs_list = fs.get_external_feature_groups("external_fg_test")
Example
# connect to the Feature Store
fs = ...

# retrieve all external feature groups available in the feature store
external_fgs_list = fs.get_external_feature_groups()
PARAMETER DESCRIPTION
name

Name of the external feature group to get the versions of; by default it is None and all external feature groups are returned.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[feature_group.ExternalFeatureGroup]

List of external feature group metadata objects.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_feature_group #

get_feature_group(
    name: str, version: int = None
) -> (
    feature_group.FeatureGroup
    | feature_group.ExternalFeatureGroup
    | feature_group.SpineGroup
)

Get a feature group entity from the feature store.

Getting a feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example
# connect to the Feature Store
fs = ...

fg = fs.get_feature_group(
        name="electricity_prices",
        version=1,
    )
PARAMETER DESCRIPTION
name

Name of the feature group to get.

TYPE: str

version

Version of the feature group to retrieve, defaults to None and will return the version=1.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
feature_group.FeatureGroup | feature_group.ExternalFeatureGroup | feature_group.SpineGroup

The feature group metadata object or None if it does not exist.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_feature_groups #

get_feature_groups(
    name: str | None = None,
) -> list[
    feature_group.FeatureGroup
    | feature_group.ExternalFeatureGroup
    | feature_group.SpineGroup
]

Get all feature groups from the feature store, or all versions of a feature group specified by its name.

Getting a feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

Example
# connect to the Feature Store
fs = ...

# retrieve all versions of electricity_prices feature group
fgs_list = fs.get_feature_groups(
        name="electricity_prices"
    )
Example
# connect to the Feature Store
fs = ...

# retrieve all feature groups available in the feature store
fgs_list = fs.get_feature_groups()
PARAMETER DESCRIPTION
name

Name of the feature group to get the versions of; by default it is None and all feature groups are returned.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[feature_group.FeatureGroup | feature_group.ExternalFeatureGroup | feature_group.SpineGroup]

List of feature group metadata objects.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_feature_view #

get_feature_view(
    name: str, version: int = None
) -> feature_view.FeatureView

Get a feature view entity from the feature store.

Getting a feature view from the Feature Store means getting its metadata.

Example
# get feature store instance
fs = ...

# get feature view instance
feature_view = fs.get_feature_view(
    name='feature_view_name',
    version=1
)
PARAMETER DESCRIPTION
name

Name of the feature view to get.

TYPE: str

version

Version of the feature view to retrieve, defaults to None and will return the version=1.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
feature_view.FeatureView

The feature view metadata object or None if it does not exist.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_feature_views #

get_feature_views(
    name: str,
) -> list[feature_view.FeatureView]

Get a list of all versions of a feature view entity from the feature store.

Getting a feature view from the Feature Store means getting its metadata.

Example
# get feature store instance
fs = ...

# get a list of all versions of a feature view
feature_view = fs.get_feature_views(
    name='feature_view_name'
)
PARAMETER DESCRIPTION
name

Name of the feature view to get.

TYPE: str

RETURNS DESCRIPTION
list[feature_view.FeatureView]

List of feature view metadata objects.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_on_demand_feature_group #

get_on_demand_feature_group(
    name: str, version: int = None
) -> feature_group.ExternalFeatureGroup

Get an external feature group entity from the feature store.

Deprecated

get_on_demand_feature_group method is deprecated. Use the get_external_feature_group method instead.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

PARAMETER DESCRIPTION
name

Name of the external feature group to get.

TYPE: str

version

Version of the external feature group to retrieve, defaults to None and will return the version=1.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
feature_group.ExternalFeatureGroup

The external feature group metadata object or None if it does not exist.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_on_demand_feature_groups #

get_on_demand_feature_groups(
    name: str,
) -> list[feature_group.ExternalFeatureGroup]

Get a list of all versions of an external feature group entity from the feature store.

Deprecated

get_on_demand_feature_groups method is deprecated. Use the get_external_feature_groups method instead.

Getting an external feature group from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame or use the Query-API to perform joins between feature groups.

PARAMETER DESCRIPTION
name

Name of the external feature group to get.

TYPE: str

RETURNS DESCRIPTION
list[feature_group.ExternalFeatureGroup]

List of external feature group metadata objects.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_online_storage_connector #

get_online_storage_connector() -> (
    storage_connector.StorageConnector
)

Get the storage connector for the Online Feature Store of the respective project's feature store.

The returned storage connector depends on the project that you are connected to.

Example
# connect to the Feature Store
fs = ...

online_storage_connector = fs.get_online_storage_connector()
RETURNS DESCRIPTION
storage_connector.StorageConnector

JDBC storage connector to the Online Feature Store.

get_or_create_feature_group #

get_or_create_feature_group(
    name: str,
    version: int,
    description: str | None = "",
    online_enabled: bool | None = False,
    time_travel_format: str | None = None,
    partition_key: list[str] | None = None,
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    embedding_index: EmbeddingIndex | None = None,
    hudi_precombine_key: str | None = None,
    features: list[feature.Feature] | None = None,
    statistics_config: StatisticsConfig
    | bool
    | dict
    | None = None,
    expectation_suite: expectation_suite.ExpectationSuite
    | TypeVar("great_expectations.core.ExpectationSuite")
    | None = None,
    event_time: str | None = None,
    stream: bool | None = False,
    parents: list[feature_group.FeatureGroup] | None = None,
    topic_name: str | None = None,
    notification_topic_name: str | None = None,
    transformation_functions: list[
        TransformationFunction | HopsworksUdf
    ]
    | None = None,
    online_config: OnlineConfig
    | dict[str, Any]
    | None = None,
    offline_backfill_every_hr: int | str | None = None,
    storage_connector: storage_connector.StorageConnector
    | dict[str, Any] = None,
    path: str | None = None,
    data_source: ds.DataSource
    | dict[str, Any]
    | None = None,
    ttl: float | timedelta | None = None,
    ttl_enabled: bool | None = None,
    online_disk: bool | None = None,
) -> (
    feature_group.FeatureGroup
    | feature_group.ExternalFeatureGroup
    | feature_group.SpineGroup
)

Get feature group metadata object or create a new one if it doesn't exist.

This method doesn't update existing feature group metadata object.

Example
# connect to the Feature Store
fs = ...

fg = fs.get_or_create_feature_group(
    name="electricity_prices",
    version=1,
    description="Electricity prices from NORD POOL",
    primary_key=["day", "area"],
    online_enabled=True,
    event_time="timestamp",
    transformation_functions=transformation_functions,
    online_config={'online_comments': ['NDB_TABLE=READ_BACKUP=1']},
    online_disk=True, # Online data will be stored on disk instead of in memory
    ttl=timedelta(days=30),
)
Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To persist the feature group and save feature data along the metadata in the feature store, call the insert() method with a DataFrame.

PARAMETER DESCRIPTION
name

Name of the feature group to create.

TYPE: str

version

Version of the feature group to retrieve or create.

TYPE: int

description

A string describing the contents of the feature group to improve discoverability for Data Scientists.

TYPE: str | None DEFAULT: ''

online_enabled

Define whether the feature group should be made available also in the online feature store for low latency access.

TYPE: bool | None DEFAULT: False

time_travel_format

Format used for time travel, defaults to "HUDI".

TYPE: str | None DEFAULT: None

partition_key

A list of feature names to be used as partition key when writing the feature data to the offline storage, defaults to empty list [].

TYPE: list[str] | None DEFAULT: None

primary_key

A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list [], and the feature group won't have any primary key.

TYPE: list[str] | None DEFAULT: None

foreign_key

A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list [], and the feature group won't have any foreign key.

TYPE: list[str] | None DEFAULT: None

embedding_index

EmbeddingIndex. If an embedding index is provided, vector database is used as online feature store. This enables similarity search by using FeatureGroup.find_neighbors.

TYPE: EmbeddingIndex | None DEFAULT: None

hudi_precombine_key

A feature name to be used as a precombine key for the "HUDI" feature group. If feature group has time travel format "HUDI" and hudi precombine key was not specified then the first primary key of the feature group will be used as hudi precombine key.

TYPE: str | None DEFAULT: None

features

Optionally, define the schema of the feature group manually as a list of Feature objects. Defaults to empty list [] and will use the schema information of the DataFrame provided in the save method.

TYPE: list[feature.Feature] | None DEFAULT: None

statistics_config

A configuration object, or a dictionary with keys:

  • enabled to generally enable descriptive statistics computation for this feature group,
  • correlations to turn on feature correlation computation,
  • histograms to compute feature value frequencies, and
  • exact_uniqueness to compute uniqueness, distinctness and entropy.

The values should be booleans indicating the setting. To fully turn off statistics computation pass statistics_config=False. By default, it computes only descriptive statistics.

TYPE: StatisticsConfig | bool | dict | None DEFAULT: None

event_time

Optionally, provide the name of the feature containing the event time for the features in this feature group. If event_time is set the feature group can be used for point-in-time joins.

Note: Event time data type restriction The supported data types for the event time column are: timestamp, date and bigint.

TYPE: str | None DEFAULT: None

stream

Optionally, define whether the feature group should support real time stream writing capabilities. Stream enabled Feature Groups have unified single API for writing streaming features transparently to both online and offline store.

TYPE: bool | None DEFAULT: False

expectation_suite

Optionally, attach an expectation suite to the feature group which dataframes should be validated against upon insertion.

TYPE: expectation_suite.ExpectationSuite | TypeVar('great_expectations.core.ExpectationSuite') | None DEFAULT: None

parents

Optionally, define the parents of this feature group as the origin where the data is coming from.

TYPE: list[feature_group.FeatureGroup] | None DEFAULT: None

topic_name

Optionally, define the name of the topic used for data ingestion. If left undefined it defaults to using project topic.

TYPE: str | None DEFAULT: None

notification_topic_name

Optionally, define the name of the topic used for sending notifications when entries are inserted or updated on the online feature store. If left undefined no notifications are sent.

TYPE: str | None DEFAULT: None

transformation_functions

On-Demand Transformation functions attached to the feature group. It can be a list of list of user defined functions defined using the hopsworks @udf decorator. Defaults to None, no transformations.

TYPE: list[TransformationFunction | HopsworksUdf] | None DEFAULT: None

online_config

Optionally, define configuration which is used to configure online table.

TYPE: OnlineConfig | dict[str, Any] | None DEFAULT: None

offline_backfill_every_hr

If specified, the materialization job will be scheduled to run periodically. The value can be either an integer representing the number of hours between each run or a string representing a cron expression. Set the value to None to avoid scheduling the materialization job. By default, no scheduling is done.

TYPE: int | str | None DEFAULT: None

storage_connector

The storage connector used to establish connectivity with the data source.

TYPE: storage_connector.StorageConnector | dict[str, Any] DEFAULT: None

path

The location within the scope of the storage connector, from where to read the data for the external feature group.

TYPE: str | None DEFAULT: None

data_source

The data source specifying the location of the data. Overrides the path and query arguments when specified.

TYPE: ds.DataSource | dict[str, Any] | None DEFAULT: None

ttl

Optional time-to-live duration for features in this group. Can be specified as:

  • An integer or float representing seconds
  • A timedelta object

This ttl value is added to the event time of the feature group and when the system time exceeds the event time + ttl, the entries will be automatically removed. The system time zone is in UTC.

By default, no TTL is set.

TYPE: float | timedelta | None DEFAULT: None

ttl_enabled

Optionally, enable TTL for this feature group. Defaults to True if ttl is set.

TYPE: bool | None DEFAULT: None

online_disk

Optionally, specify online data storage for this feature group. When set to True data will be stored on disk, instead of in memory. Overrides online_config.table_space. Defaults to using cluster wide configuration 'featurestore_online_tablespace' to identify tablespace for disk storage.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
feature_group.FeatureGroup | feature_group.ExternalFeatureGroup | feature_group.SpineGroup

The feature group metadata object.

get_or_create_feature_view #

get_or_create_feature_view(
    name: str,
    query: Query,
    version: int,
    description: str | None = "",
    labels: list[str] | None = None,
    inference_helper_columns: list[str] | None = None,
    training_helper_columns: list[str] | None = None,
    transformation_functions: dict[
        str, TransformationFunction
    ]
    | None = None,
    logging_enabled: bool | None = False,
    extra_log_columns: list[feature.Feature]
    | list[dict[str, str]]
    | None = None,
) -> feature_view.FeatureView

Get feature view metadata object or create a new one if it doesn't exist.

This method doesn't update existing feature view metadata object.

Example
# connect to the Feature Store
fs = ...

feature_view = fs.get_or_create_feature_view(
    name='bitcoin_feature_view',
    version=1,
    transformation_functions=transformation_functions,
    query=query
)
PARAMETER DESCRIPTION
name

Name of the feature view to create.

TYPE: str

query

Feature store Query.

TYPE: Query

version

Version of the feature view to create.

TYPE: int

description

A string describing the contents of the feature view to improve discoverability for Data Scientists.

TYPE: str | None DEFAULT: ''

labels

A list of feature names constituting the prediction label/feature of the feature view. When replaying a Query during model inference, the label features can be omitted from the feature vector retrieval. Defaults to [], no label.

TYPE: list[str] | None DEFAULT: None

inference_helper_columns

A list of feature names that are not used in training the model itself but can be used during batch or online inference for extra information. Inference helper column name(s) must be part of the Query object. If inference helper column name(s) belong to feature group that is part of a Join with prefix defined, then this prefix needs to be prepended to the original column name when defining inference_helper_columns list. When replaying a Query during model inference, the inference helper columns optionally can be omitted during batch (get_batch_data) and will be omitted during online inference (get_feature_vector(s)). To get inference helper column(s) during online inference use get_inference_helper(s) method. Defaults to `[], no helper columns.

TYPE: list[str] | None DEFAULT: None

training_helper_columns

A list of feature names that are not the part of the model schema itself but can be used during training as a helper for extra information. Training helper column name(s) must be part of the Query object. If training helper column name(s) belong to feature group that is part of a Join with prefix defined, then this prefix needs to prepended to the original column name when defining training_helper_columns list. When replaying a Query during model inference, the training helper columns will be omitted during both batch and online inference. Training helper columns can be optionally fetched with training data. For more details see documentation for feature view's get training data methods. Defaults to [], no training helper columns.

TYPE: list[str] | None DEFAULT: None

transformation_functions

Model Dependent Transformation functions attached to the feature view. It can be a list of list of user defined functions defined using the hopsworks @udf decorator. Defaults to None, no transformations.

TYPE: dict[str, TransformationFunction] | None DEFAULT: None

logging_enabled

If true, enable feature logging for the feature view.

TYPE: bool | None DEFAULT: False

extra_log_columns

Extra columns to be logged in addition to the features used in the feature view. It can be a list of Feature objects or list a dictionaries that contains the the name and type of the columns as keys. Defaults to None, no extra log columns. Setting this argument implicitly enables feature logging.

TYPE: list[feature.Feature] | list[dict[str, str]] | None DEFAULT: None

RETURNS DESCRIPTION
feature_view.FeatureView

The feature view metadata object.

get_or_create_spine_group #

get_or_create_spine_group(
    name: str,
    version: int | None = None,
    description: str | None = "",
    primary_key: list[str] | None = None,
    foreign_key: list[str] | None = None,
    event_time: str | None = None,
    features: list[feature.Feature] | None = None,
    dataframe: pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list] = None,
) -> feature_group.SpineGroup

Create a spine group metadata object.

Instead of using a feature group to save a label/prediction target, you can use a spine together with a dataframe containing the labels. A Spine is essentially a metadata object similar to a feature group, however, the data is not materialized in the feature store. It only containes the needed metadata such as the relevant event time column and primary key columns to perform point-in-time correct joins.

Example
# connect to the Feature Store
fs = ...

spine_df = pd.Dataframe()

spine_group = fs.get_or_create_spine_group(
    name="sales",
    version=1,
    description="Physical shop sales features",
    primary_key=['ss_store_sk'],
    event_time='sale_date',
    dataframe=spine_df,
)

Note that you can inspect the dataframe in the spine group, or replace the dataframe:

spine_group.dataframe.show()

spine_group.dataframe = new_df

The spine can then be used to construct queries, with only one speciality:

Note

Spines can only be used on the left side of a feature join, as this is the base set of entities for which features are to be fetched and the left side of the join determines the event timestamps to compare against.

If you want to use the query for a feature view to be used for online serving, you can only select the label or target feature from the spine. For the online lookup, the label is not required, therefore it is important to only select label from the left feature group, so that we don't need to provide a spine for online serving.

These queries can then be used to create feature views. Since the dataframe contained in the spine is not being materialized, every time you use a feature view created with spine to read data you will have to provide a dataframe with the same structure again.

For example, to generate training data:

X_train, X_test, y_train, y_test = feature_view_spine.train_test_split(0.2, spine=training_data_entities)

Or to get batches of fresh data for batch scoring:

feature_view_spine.get_batch_data(spine=scoring_entities_df).show()

Here you have the chance to pass a different set of entities to generate the training dataset.

Sometimes it might be handy to create a feature view with a regular feature group containing the label, but then at serving time to use a spine in order to fetch features for example only for a small set of primary key values. To do this, you can pass the spine group instead of a dataframe. Just make sure it contains the needed primary key, event time and label column.

feature_view.get_batch_data(spine=spine_group)
PARAMETER DESCRIPTION
name

Name of the spine group to create.

TYPE: str

version

Version of the spine group to retrieve, defaults to None and will create the spine group with incremented version from the last version in the feature store.

TYPE: int | None DEFAULT: None

description

A string describing the contents of the spine group to improve discoverability for Data Scientists.

TYPE: str | None DEFAULT: ''

primary_key

A list of feature names to be used as primary key for the spine group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list [], and the spine group won't have any primary key.

TYPE: list[str] | None DEFAULT: None

foreign_key

A list of feature names to be used as foreign key for the feature group. Foreign key is referencing the primary key of another feature group and can be used as joining key. Defaults to empty list [], and the feature group won't have any foreign key.

TYPE: list[str] | None DEFAULT: None

event_time

Optionally, provide the name of the feature containing the event time for the features in this spine group. If event_time is set the spine group can be used for point-in-time joins.

TYPE: str | None DEFAULT: None

features

Optionally, define the schema of the spine group manually as a list of Feature objects. Defaults to empty list [] and will use the schema information of the DataFrame resulting by executing the provided query against the data source.

Note: Event time data type restriction The supported data types for the event time column are: timestamp, date and bigint.

TYPE: list[feature.Feature] | None DEFAULT: None

dataframe

Spine dataframe with primary key, event time and label column to use for point in time join when fetching features.

TYPE: pd.DataFrame | TypeVar('pyspark.sql.DataFrame') | TypeVar('pyspark.RDD') | np.ndarray | list[list] DEFAULT: None

RETURNS DESCRIPTION
feature_group.SpineGroup

The spine group metadata object.

get_storage_connector #

get_storage_connector(
    name: str,
) -> storage_connector.StorageConnector

Get a previously created storage connector from the feature store.

Storage connectors encapsulate all information needed for the execution engine to read and write to specific storage. This storage can be S3, a JDBC compliant database or the distributed filesystem HOPSFS.

If you want to connect to the online feature store, see the get_online_storage_connector method to get the JDBC connector for the Online Feature Store.

Example
# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("demo_fs_meb10000_Training_Datasets")
PARAMETER DESCRIPTION
name

Name of the storage connector to retrieve.

TYPE: str

RETURNS DESCRIPTION
storage_connector.StorageConnector

Storage connector object.

get_training_dataset #

get_training_dataset(
    name: str, version: int = None
) -> training_dataset.TrainingDataset

Get a training dataset entity from the feature store.

Deprecated

TrainingDataset is deprecated, use FeatureView instead. You can still retrieve old training datasets using this method, but after upgrading the old training datasets will also be available under a Feature View with the same name and version.

It is recommended to use this method only for old training datasets that have been created directly from Dataframes and not with Query objects.

Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.

PARAMETER DESCRIPTION
name

Name of the training dataset to get.

TYPE: str

version

Version of the training dataset to retrieve, defaults to None and will return the version=1.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
training_dataset.TrainingDataset

The training dataset metadata object.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_training_datasets #

get_training_datasets(
    name: str,
) -> list[training_dataset.TrainingDataset]

Get a list of all versions of a training dataset entity from the feature store.

Deprecated

TrainingDataset is deprecated, use FeatureView instead.

Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.

PARAMETER DESCRIPTION
name

Name of the training dataset to get.

TYPE: str

RETURNS DESCRIPTION
list[training_dataset.TrainingDataset]

List of training dataset metadata objects.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

get_transformation_function #

get_transformation_function(
    name: str, version: int | None = None
) -> TransformationFunction

Get transformation function metadata object.

Get transformation function by name

This will default to version 1.

# get feature store instance
fs = ...

# get transformation function metadata object
plus_one_fn = fs.get_transformation_function(name="plus_one")
Get built-in transformation function min max scaler
# get feature store instance
fs = ...

# get transformation function metadata object
min_max_scaler_fn = fs.get_transformation_function(name="min_max_scaler")
Get transformation function by name and version
# get feature store instance
fs = ...

# get transformation function metadata object
min_max_scaler = fs.get_transformation_function(name="min_max_scaler", version=2)

You can define in the feature view transformation functions as dict, where key is feature name and value is online transformation function instance. Then the transformation functions are applied when you read training data, get batch data, or get feature vector(s).

Attach transformation functions to the feature view
# get feature store instance
fs = ...

# define query object
query = ...

# get transformation function metadata object
min_max_scaler = fs.get_transformation_function(name="min_max_scaler", version=1)

# attach transformation functions
feature_view = fs.create_feature_view(
    name='feature_view_name',
    query=query,
    labels=["target_column"],
    transformation_functions=[min_max_scaler("feature1")]
)

Built-in transformation functions are attached in the same way. The only difference is that it will compute the necessary statistics for the specific function in the background. For example min and max values for min_max_scaler; mean and standard deviation for standard_scaler etc.

Attach built-in transformation functions to the feature view
# get feature store instance
fs = ...

# define query object
query = ...

# retrieve transformation functions
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
standard_scaler = fs.get_transformation_function(name="standard_scaler")
robust_scaler = fs.get_transformation_function(name="robust_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# attach built-in transformation functions while creating feature view
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=query,
    labels=["fraud_label"],
    transformation_functions = [
        label_encoder("category_column"),
        robust_scaler("weight"),
        min_max_scaler("age"),
        standard_scaler("salary")
    ]
)
PARAMETER DESCRIPTION
name

Name of transformation function.

TYPE: str

version

Version of transformation function. Optional, if not provided all functions that match to provided name will be retrieved.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
TransformationFunction

The TransformationFunction metadata object.

get_transformation_functions #

get_transformation_functions() -> list[
    TransformationFunction
]

Get all transformation functions metadata objects.

Get all transformation functions
# get feature store instance
fs = ...

# get all transformation functions
list_transformation_fns = fs.get_transformation_functions()
RETURNS DESCRIPTION
list[TransformationFunction]

List of transformation function instances.

sql #

sql(
    query: str,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
    online: bool = False,
    read_options: dict | None = None,
) -> pd.DataFrame | pd.Series | np.ndarray | pl.DataFrame

Execute SQL command on the offline or online feature store database.

Example
# connect to the Feature Store
fs = ...

# construct the query and show head rows
query_res_head = fs.sql("SELECT * FROM `fg_1`").head()
PARAMETER DESCRIPTION
query

The SQL query to execute.

TYPE: str

dataframe_type

The type of the returned dataframe. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine.

TYPE: Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python'] DEFAULT: 'default'

online

Set to true to execute the query against the online feature store.

TYPE: bool DEFAULT: False

read_options

Additional options as key/value pairs to pass to the execution engine.

For spark engine: Dictionary of read options for Spark.

For python engine: If running queries on the online feature store, users can provide an entry {'external': True}, this instructs the library to use the host parameter in the hopsworks.login to establish the connection to the online feature store. If not set, or set to False, the online feature store storage connector is used which relies on the private ip.

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
pd.DataFrame | pd.Series | np.ndarray | pl.DataFrame

DataFrame depending on the chosen type.