Training Dataset#
You can create a TrainingDataset by calling FeatureStore.create_training_dataset and obtain an existing one by calling FeatureStore.get_training_dataset.
TrainingDataset #
Bases: TrainingDatasetBase
feature_store_id property #
feature_store_id: int
ID of the feature store to which this training dataset belongs.
feature_store_name property #
feature_store_name: str
Name of the feature store in which the feature group is located.
label property writable #
The label/prediction feature of the training dataset.
Can be a composite of multiple features.
serving_keys property #
Set of primary key names that is used as keys in input dict object for get_serving_vector method.
statistics property #
statistics
Get computed statistics for the training dataset.
| RETURNS | DESCRIPTION |
|---|---|
|
|
add_tag #
add_tag(name: str, value)
Attach a tag to a training dataset.
A tag consists of a
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the tag to be added. TYPE: |
value | Value of the tag to be added.
|
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | in case the backend fails to add the tag. |
compute_statistics #
compute_statistics()
Compute the statistics for the training dataset and save them to the feature store.
delete #
delete()
Delete training dataset and all associated metadata.
Drops only HopsFS data
Note that this operation drops only files which were materialized in HopsFS. If you used a Storage Connector for a cloud storage such as S3, the data will not be deleted, but you will not be able to track it anymore from the Feature Store.
Potentially dangerous operation
This operation drops all metadata associated with this version of the training dataset and and the materialized data in HopsFS.
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | In case of a server error. |
delete_tag #
delete_tag(name: str)
Delete a tag attached to a training dataset.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the tag to be removed. TYPE: |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | in case the backend fails to delete the tag. |
get_query #
Returns the query used to generate this training dataset.
| PARAMETER | DESCRIPTION |
|---|---|
online | boolean, optional. Return the query for the online storage, else for offline storage, defaults to TYPE: |
with_label | Indicator whether the query should contain features which were marked as prediction label/feature when the training dataset was created, defaults to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
|
|
get_serving_vector #
Returns assembled serving vector from online feature store.
| PARAMETER | DESCRIPTION |
|---|---|
entry | dictionary of training dataset feature group primary key names as keys and values provided by serving application. |
external | boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
|
| |
| features in training dataset query. |
get_serving_vectors #
Returns assembled serving vectors in batches from online feature store.
| PARAMETER | DESCRIPTION |
|---|---|
entry | dict of feature group primary key names as keys and value as list of primary keys provided by serving application. |
external | boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
|
| |
| positions of this features in training dataset query. |
get_tag #
get_tag(name)
Get the tags of a training dataset.
| PARAMETER | DESCRIPTION |
|---|---|
name | Name of the tag to get.
|
| RETURNS | DESCRIPTION |
|---|---|
| tag value |
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | in case the backend fails to retrieve the tag. |
get_tags #
get_tags()
Returns all tags attached to a training dataset.
| RETURNS | DESCRIPTION |
|---|---|
|
|
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | in case the backend fails to retrieve the tags. |
init_prepared_statement #
Initialise and cache parametrized prepared statement to retrieve feature vector from online feature store.
| PARAMETER | DESCRIPTION |
|---|---|
batch | boolean, optional. If set to True, prepared statements will be initialised for retrieving serving vectors as a batch. TYPE: |
external | boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the TYPE: |
insert #
insert(
features: query.Query
| pd.DataFrame
| TypeVar("pyspark.sql.DataFrame")
| TypeVar("pyspark.RDD")
| np.ndarray
| list[list],
overwrite: bool,
write_options: dict[Any, Any] | None = None,
)
Insert additional feature data into the training dataset.
Deprecated
insert method is deprecated.
This method appends data to the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. The schemas must match for this operation.
This can also be used to overwrite all data in an existing training dataset.
| PARAMETER | DESCRIPTION |
|---|---|
features | Feature data to be materialized. TYPE: |
overwrite | Whether to overwrite the entire data in the training dataset. TYPE: |
write_options | Additional write options as key-value pairs, defaults to |
| RETURNS | DESCRIPTION |
|---|---|
|
|
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | Unable to create training dataset metadata. |
read #
read(split=None, read_options=None)
Read the training dataset into a dataframe.
It is also possible to read only a specific split.
| PARAMETER | DESCRIPTION |
|---|---|
split | Name of the split to read, defaults to DEFAULT: |
read_options | Additional read options as key/value pairs, defaults to DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
|
|
save #
save(
features: query.Query
| pd.DataFrame
| TypeVar("pyspark.sql.DataFrame")
| TypeVar("pyspark.RDD")
| np.ndarray
| list[list],
write_options: dict[Any, Any] | None = None,
)
Materialize the training dataset to storage.
This method materializes the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. From v2.5 onward, filters are saved along with the Query.
Engine Support
Creating Training Datasets from Dataframes is only supported using Spark as Engine.
| PARAMETER | DESCRIPTION |
|---|---|
features | Feature data to be materialized. TYPE: |
write_options | Additional write options as key-value pairs, defaults to |
| RETURNS | DESCRIPTION |
|---|---|
|
|
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | Unable to create training dataset metadata. |
show #
update_statistics_config #
update_statistics_config()
Update the statistics configuration of the training dataset.
Change the statistics_config object and persist the changes by calling this method.
| RETURNS | DESCRIPTION |
|---|---|
|
|
| RAISES | DESCRIPTION |
|---|---|
hopsworks.client.exceptions.RestAPIError | in case the backend encounters an issue |