Training Dataset#

You can create a TrainingDataset by calling FeatureStore.create_training_dataset and obtain an existing one by calling FeatureStore.get_training_dataset.

TrainingDataset #

Bases: TrainingDatasetBase

feature_store_id `property` #

feature_store_id: int

ID of the feature store to which this training dataset belongs.

feature_store_name `property` #

feature_store_name: str

Name of the feature store in which the feature group is located.

id `property` `writable` #

id

Training dataset id.

label `property` `writable` #

label: str | list[str]

The label/prediction feature of the training dataset.

Can be a composite of multiple features.

query `property` #

query

Query to generate this training dataset from online feature store.

schema `property` `writable` #

schema

Training dataset schema.

serving_keys `property` #

serving_keys: set[str]

Set of primary key names that is used as keys in input dict object for get_serving_vector method.

statistics `property` #

statistics

Get computed statistics for the training dataset.

RETURNS	DESCRIPTION
	`Statistics`. Object with statistics information.

write_options `property` `writable` #

write_options

User provided options to write training dataset.

add_tag #

add_tag(name: str, value)

Attach a tag to a training dataset.

A tag consists of a pair. Tag names are unique identifiers across the whole cluster. The value of a tag can be any valid json - primitives, arrays or json objects.

PARAMETER	DESCRIPTION
`name`	Name of the tag to be added. TYPE: `str`
`value`	Value of the tag to be added.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	in case the backend fails to add the tag.

compute_statistics #

compute_statistics()

Compute the statistics for the training dataset and save them to the feature store.

delete #

delete()

Delete training dataset and all associated metadata.

Drops only HopsFS data

Note that this operation drops only files which were materialized in HopsFS. If you used a Storage Connector for a cloud storage such as S3, the data will not be deleted, but you will not be able to track it anymore from the Feature Store.

Potentially dangerous operation

This operation drops all metadata associated with this version of the training dataset and and the materialized data in HopsFS.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	In case of a server error.

delete_tag #

delete_tag(name: str)

Delete a tag attached to a training dataset.

PARAMETER	DESCRIPTION
`name`	Name of the tag to be removed. TYPE: `str`

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	in case the backend fails to delete the tag.

get_query #

get_query(online: bool = True, with_label: bool = False)

Returns the query used to generate this training dataset.

PARAMETER	DESCRIPTION
`online`	boolean, optional. Return the query for the online storage, else for offline storage, defaults to `True` - for online storage. TYPE: `bool` DEFAULT: `True`
`with_label`	Indicator whether the query should contain features which were marked as prediction label/feature when the training dataset was created, defaults to `False`. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
	`str`. Query string for the chosen storage used to generate this training dataset.

get_serving_vector #

get_serving_vector(
    entry: dict[str, Any], external: bool | None = None
)

Returns assembled serving vector from online feature store.

PARAMETER DESCRIPTION

entry

dictionary of training dataset feature group primary key names as keys and values provided by serving application.

TYPE: dict[str, Any]

external

boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

RETURNS	DESCRIPTION
	`list` List of feature values related to provided primary keys, ordered according to positions of this
	features in training dataset query.

get_serving_vectors #

get_serving_vectors(
    entry: dict[str, list[Any]],
    external: bool | None = None,
)

Returns assembled serving vectors in batches from online feature store.

PARAMETER DESCRIPTION

entry

dict of feature group primary key names as keys and value as list of primary keys provided by serving application.

TYPE: dict[str, list[Any]]

external

boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

RETURNS	DESCRIPTION
	`List[list]` List of lists of feature values related to provided primary keys, ordered according to
	positions of this features in training dataset query.

get_tag #

get_tag(name)

Get the tags of a training dataset.

PARAMETER	DESCRIPTION
`name`	Name of the tag to get.

RETURNS	DESCRIPTION
	tag value

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	in case the backend fails to retrieve the tag.

get_tags #

get_tags()

Returns all tags attached to a training dataset.

RETURNS	DESCRIPTION
	`Dict[str, obj]` of tags.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	in case the backend fails to retrieve the tags.

init_prepared_statement #

init_prepared_statement(
    batch: bool | None = None, external: bool | None = None
)

Initialise and cache parametrized prepared statement to retrieve feature vector from online feature store.

PARAMETER DESCRIPTION

batch

boolean, optional. If set to True, prepared statements will be initialised for retrieving serving vectors as a batch.

TYPE: bool | None DEFAULT: None

external

boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

insert #

insert(
    features: query.Query
    | pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list],
    overwrite: bool,
    write_options: dict[Any, Any] | None = None,
)

Insert additional feature data into the training dataset.

Deprecated

insert method is deprecated.

This method appends data to the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. The schemas must match for this operation.

This can also be used to overwrite all data in an existing training dataset.

PARAMETER	DESCRIPTION
`features`	Feature data to be materialized. TYPE: `query.Query \| pd.DataFrame \| TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| np.ndarray \| list[list]`
`overwrite`	Whether to overwrite the entire data in the training dataset. TYPE: `bool`
`write_options`	Additional write options as key-value pairs, defaults to `{}`. When using the `python` engine, write_options can contain the following entries: * key `spark` and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. * key `wait_for_job` and value `True` or `False` to configure whether or not to the insert call should return only after the Hopsworks Job has finished. By default it waits. TYPE: `dict[Any, Any] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
	`Job`: When using the `python` engine, it returns the Hopsworks Job that was launched to create the training dataset.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	Unable to create training dataset metadata.

read #

read(split=None, read_options=None)

Read the training dataset into a dataframe.

It is also possible to read only a specific split.

PARAMETER	DESCRIPTION
`split`	Name of the split to read, defaults to `None`, reading the entire training dataset. If the training dataset has split, the `split` parameter is mandatory. DEFAULT: `None`
`read_options`	Additional read options as key/value pairs, defaults to `{}`. DEFAULT: `None`

RETURNS	DESCRIPTION
	`DataFrame`: The spark dataframe containing the feature data of the training dataset.

save #

save(
    features: query.Query
    | pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list],
    write_options: dict[Any, Any] | None = None,
)

Materialize the training dataset to storage.

This method materializes the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. From v2.5 onward, filters are saved along with the Query.

Engine Support

Creating Training Datasets from Dataframes is only supported using Spark as Engine.

PARAMETER DESCRIPTION

features

Feature data to be materialized.

write_options

Additional write options as key-value pairs, defaults to {}. When using the python engine, write_options can contain the following entries: * key spark and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. * key wait_for_job and value True or False to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.

TYPE: dict[Any, Any] | None DEFAULT: None

RETURNS	DESCRIPTION
	`Job`: When using the `python` engine, it returns the Hopsworks Job that was launched to create the training dataset.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	Unable to create training dataset metadata.

show #

show(n: int, split: str = None)

Show the first n rows of the training dataset.

You can specify a split from which to retrieve the rows.

PARAMETER	DESCRIPTION
`n`	Number of rows to show. TYPE: `int`
`split`	Name of the split to show, defaults to `None`, showing the first rows when taking all splits together. TYPE: `str` DEFAULT: `None`

update_statistics_config #

update_statistics_config()

Update the statistics configuration of the training dataset.

Change the statistics_config object and persist the changes by calling this method.

RETURNS	DESCRIPTION
	`TrainingDataset`. The updated metadata object of the training dataset.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	in case the backend encounters an issue

Training Dataset#