Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919

Merged
merged 18 commits into from
Jan 27, 2022

Conversation

morgandu
Copy link
Contributor

@morgandu morgandu commented Dec 17, 2021

  • added batch_serve_to_bq and batch_serve_to_gcs
  • added unit tests
  • added integration tests

… for csv and tfrecord files in Featurestore class
@morgandu morgandu requested a review from sasha-gitg December 17, 2021 04:50
@morgandu morgandu changed the title feat: add batch_serve_to_bq and batch_serve_to_gcs supporting csv and tfrecord destinations in Featurestore class feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs for csv and tfrecord files in Featurestore class Dec 17, 2021
self,
bq_destination_output_uri: str,
entity_type_ids: List[str],
entity_type_destination_fields: Optional[
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument name is a bit confusing. It is to specify what features to read, and (optional) the output column name, right? May be change it to smt related to feature than entity type, e.g feature_to_read or some sort.

Alternatively, we can have a separate struct to define the entity_type and what feature to read within that entity type, and provide a list of that to this function, instead of the separate entity_type_ids and this entity_type_destination_field.

Another option is, we can have a flat list of feature to read, using the full-resource path name. Internally, we can group them by entity type and construct the read request. This could be a little bit repetitive on the calling side, but I think the interface is more intuitive. We'll need to use different struct for 'read_instance' though (e.g Dict instead of List)

Copy link
Contributor Author

@morgandu morgandu Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't read_instances the bq uri or gcs uri(s)? How will a Dict work here?

Regarding renaming the entity_type_destination_fields, how about feature_destination_fields to match the feature_source_fields in EntityType class's ingestion methods?

Regarding the data type and usage of the entity_type_destination_fields:

  1. I think the current implementation is trying to implement the first alternative suggested here. The purpose for having a separate entity_type_ids and entity_type_destination_fields is that if for some entity_types that user want to batch serve all features (without a different output feature name), user won't need to include that entity_type id in the entity_type_destination_fields. However, we still can achieve this and remove the separated entity_type_ids field, by requiring having all entity_type ids specified in the entity_type_destination_fields as keys, with ["*"] value for those all features are to be served without a different output name. For example,
{
    "entity_type_1": ["*"], 
    "entity_type_2": {"feature_1": "feature_1_col"}, 
    "entity_type_3": {"feature_1", "feature_2"}, 
}
  1. the second option is a bit redundant but it is indeed very clear and straight forward
{"...path/to/fs/fs1/et/e1/f/f1": "f11_col", "...path/to/fs/fs1/et/e2/f/f1": "f21_col"}

Will the modification of the current implementation in 1) be less confusing and more appealing for usage?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding [1], you also have the option to support mapping the output column name to something different than feature name. I feel this overloaded argument is hard for users to understand (though I may agree it could be convenient once user get a good grasp of it)

I think it is easier to understand to split into 2 arguments: one to define the feature to read, the other one to define the (overridden) output column name. This also allow user to define the output column only for the feature they want to (i.e they want to read */all features from entity_type_1 and output to the column with the same name, except feature foo will be output to column bar)

With the flat-list, the entity type IDs can be extracted from the feature path name. However, unless we put some constraints that the list have to group by entity type, the order of the entity type in the read_instance could be ambiguous. Thus I think using Dict (e.g {'entity_type_1' : 'user_1234', 'entity_type_2': 'product_xyz' } is clearer than using array (['user_1234', 'product_xyz']). The flat list option also requires all to be explicitly listed, i.e "*" is not supported. This is the biggest cons for that option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed entity_type_ids and entity_type_destination_fields to avoid confusion, added two dict serving_feature_ids for an entity_type / features mapping and feature_destination_fields for feature_id / feature_destination_field_name mapping.


entity_type_ids = ['my_entity_type_id_1', 'my_entity_type_id_2', 'my_entity_type_id_3']

feature_source_fields = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean entity_type_destination_fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the docstring

@morgandu morgandu changed the title feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs for csv and tfrecord files in Featurestore class feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord Jan 15, 2022
Copy link
Member

@sasha-gitg sasha-gitg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

google/cloud/aiplatform/featurestore/featurestore.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/featurestore/featurestore.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/featurestore/featurestore.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/featurestore/featurestore.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/featurestore/featurestore.py Outdated Show resolved Hide resolved
Copy link

@diemtvu diemtvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morgandu morgandu merged commit c840728 into googleapis:main Jan 27, 2022
ivanmkc pushed a commit to ivanmkc/python-aiplatform that referenced this pull request Jan 28, 2022
…d tfrecord (googleapis#919)

* feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs for csv and tfrecord files in Featurestore class

* fix: change entity_type_ids and entity_type_destination_fields to serving_feature_ids and feature_destination_fields

* fix: remove white space

* Update google/cloud/aiplatform/featurestore/featurestore.py

Co-authored-by: sasha-gitg <[email protected]>

* Update google/cloud/aiplatform/featurestore/featurestore.py

Co-authored-by: sasha-gitg <[email protected]>

* Update google/cloud/aiplatform/featurestore/featurestore.py

Co-authored-by: sasha-gitg <[email protected]>

* Update google/cloud/aiplatform/featurestore/featurestore.py

Co-authored-by: sasha-gitg <[email protected]>

* Update google/cloud/aiplatform/featurestore/featurestore.py

Co-authored-by: sasha-gitg <[email protected]>

* fix: Featurestore create method example usage

* fix: get_timestamp_proto for millisecond precision cap

* fix: unit tests for get_timestamp_proto

Co-authored-by: sasha-gitg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants