Skip to content

Training Dataset#

You can create a TrainingDataset by calling FeatureStore.create_training_dataset and obtain an existing one by calling FeatureStore.get_training_dataset.

TrainingDataset #

Bases: TrainingDatasetBase

feature_store_id property #

feature_store_id: int

ID of the feature store to which this training dataset belongs.

feature_store_name property #

feature_store_name: str

Name of the feature store in which the feature group is located.

id property writable #

id

Training dataset id.

label property writable #

label: str | list[str]

The label/prediction feature of the training dataset.

Can be a composite of multiple features.

query property #

query

Query to generate this training dataset from online feature store.

schema property writable #

schema

Training dataset schema.

serving_keys property #

serving_keys: set[str]

Set of primary key names that is used as keys in input dict object for get_serving_vector method.

statistics property #

statistics

Get computed statistics for the training dataset.

RETURNS DESCRIPTION

Statistics. Object with statistics information.

write_options property writable #

write_options

User provided options to write training dataset.

add_tag #

add_tag(name: str, value)

Attach a tag to a training dataset.

A tag consists of a pair. Tag names are unique identifiers across the whole cluster. The value of a tag can be any valid json - primitives, arrays or json objects.

PARAMETER DESCRIPTION
name

Name of the tag to be added.

TYPE: str

value

Value of the tag to be added.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to add the tag.

compute_statistics #

compute_statistics()

Compute the statistics for the training dataset and save them to the feature store.

delete #

delete()

Delete training dataset and all associated metadata.

Drops only HopsFS data

Note that this operation drops only files which were materialized in HopsFS. If you used a Storage Connector for a cloud storage such as S3, the data will not be deleted, but you will not be able to track it anymore from the Feature Store.

Potentially dangerous operation

This operation drops all metadata associated with this version of the training dataset and and the materialized data in HopsFS.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

In case of a server error.

delete_tag #

delete_tag(name: str)

Delete a tag attached to a training dataset.

PARAMETER DESCRIPTION
name

Name of the tag to be removed.

TYPE: str

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to delete the tag.

get_query #

get_query(online: bool = True, with_label: bool = False)

Returns the query used to generate this training dataset.

PARAMETER DESCRIPTION
online

boolean, optional. Return the query for the online storage, else for offline storage, defaults to True - for online storage.

TYPE: bool DEFAULT: True

with_label

Indicator whether the query should contain features which were marked as prediction label/feature when the training dataset was created, defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

str. Query string for the chosen storage used to generate this training dataset.

get_serving_vector #

get_serving_vector(
    entry: dict[str, Any], external: bool | None = None
)

Returns assembled serving vector from online feature store.

PARAMETER DESCRIPTION
entry

dictionary of training dataset feature group primary key names as keys and values provided by serving application.

TYPE: dict[str, Any]

external

boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION

list List of feature values related to provided primary keys, ordered according to positions of this

features in training dataset query.

get_serving_vectors #

get_serving_vectors(
    entry: dict[str, list[Any]],
    external: bool | None = None,
)

Returns assembled serving vectors in batches from online feature store.

PARAMETER DESCRIPTION
entry

dict of feature group primary key names as keys and value as list of primary keys provided by serving application.

TYPE: dict[str, list[Any]]

external

boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION

List[list] List of lists of feature values related to provided primary keys, ordered according to

positions of this features in training dataset query.

get_tag #

get_tag(name)

Get the tags of a training dataset.

PARAMETER DESCRIPTION
name

Name of the tag to get.

RETURNS DESCRIPTION

tag value

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to retrieve the tag.

get_tags #

get_tags()

Returns all tags attached to a training dataset.

RETURNS DESCRIPTION

Dict[str, obj] of tags.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to retrieve the tags.

init_prepared_statement #

init_prepared_statement(
    batch: bool | None = None, external: bool | None = None
)

Initialise and cache parametrized prepared statement to retrieve feature vector from online feature store.

PARAMETER DESCRIPTION
batch

boolean, optional. If set to True, prepared statements will be initialised for retrieving serving vectors as a batch.

TYPE: bool | None DEFAULT: None

external

boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

insert #

insert(
    features: query.Query
    | pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list],
    overwrite: bool,
    write_options: dict[Any, Any] | None = None,
)

Insert additional feature data into the training dataset.

Deprecated

insert method is deprecated.

This method appends data to the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. The schemas must match for this operation.

This can also be used to overwrite all data in an existing training dataset.

PARAMETER DESCRIPTION
features

Feature data to be materialized.

TYPE: query.Query | pd.DataFrame | TypeVar('pyspark.sql.DataFrame') | TypeVar('pyspark.RDD') | np.ndarray | list[list]

overwrite

Whether to overwrite the entire data in the training dataset.

TYPE: bool

write_options

Additional write options as key-value pairs, defaults to {}. When using the python engine, write_options can contain the following entries: * key spark and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. * key wait_for_job and value True or False to configure whether or not to the insert call should return only after the Hopsworks Job has finished. By default it waits.

TYPE: dict[Any, Any] | None DEFAULT: None

RETURNS DESCRIPTION

Job: When using the python engine, it returns the Hopsworks Job that was launched to create the training dataset.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

Unable to create training dataset metadata.

read #

read(split=None, read_options=None)

Read the training dataset into a dataframe.

It is also possible to read only a specific split.

PARAMETER DESCRIPTION
split

Name of the split to read, defaults to None, reading the entire training dataset. If the training dataset has split, the split parameter is mandatory.

DEFAULT: None

read_options

Additional read options as key/value pairs, defaults to {}.

DEFAULT: None

RETURNS DESCRIPTION

DataFrame: The spark dataframe containing the feature data of the training dataset.

save #

save(
    features: query.Query
    | pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list],
    write_options: dict[Any, Any] | None = None,
)

Materialize the training dataset to storage.

This method materializes the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. From v2.5 onward, filters are saved along with the Query.

Engine Support

Creating Training Datasets from Dataframes is only supported using Spark as Engine.

PARAMETER DESCRIPTION
features

Feature data to be materialized.

TYPE: query.Query | pd.DataFrame | TypeVar('pyspark.sql.DataFrame') | TypeVar('pyspark.RDD') | np.ndarray | list[list]

write_options

Additional write options as key-value pairs, defaults to {}. When using the python engine, write_options can contain the following entries: * key spark and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. * key wait_for_job and value True or False to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.

TYPE: dict[Any, Any] | None DEFAULT: None

RETURNS DESCRIPTION

Job: When using the python engine, it returns the Hopsworks Job that was launched to create the training dataset.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

Unable to create training dataset metadata.

show #

show(n: int, split: str = None)

Show the first n rows of the training dataset.

You can specify a split from which to retrieve the rows.

PARAMETER DESCRIPTION
n

Number of rows to show.

TYPE: int

split

Name of the split to show, defaults to None, showing the first rows when taking all splits together.

TYPE: str DEFAULT: None

update_statistics_config #

update_statistics_config()

Update the statistics configuration of the training dataset.

Change the statistics_config object and persist the changes by calling this method.

RETURNS DESCRIPTION

TrainingDataset. The updated metadata object of the training dataset.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend encounters an issue