feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919

morgandu · 2021-12-17T04:50:30Z

added batch_serve_to_bq and batch_serve_to_gcs
added unit tests
added integration tests

… for csv and tfrecord files in Featurestore class

diemtvu · 2022-01-10T17:45:47Z

google/cloud/aiplatform/featurestore/featurestore.py

+        self,
+        bq_destination_output_uri: str,
+        entity_type_ids: List[str],
+        entity_type_destination_fields: Optional[


This argument name is a bit confusing. It is to specify what features to read, and (optional) the output column name, right? May be change it to smt related to feature than entity type, e.g feature_to_read or some sort.

Alternatively, we can have a separate struct to define the entity_type and what feature to read within that entity type, and provide a list of that to this function, instead of the separate entity_type_ids and this entity_type_destination_field.

Another option is, we can have a flat list of feature to read, using the full-resource path name. Internally, we can group them by entity type and construct the read request. This could be a little bit repetitive on the calling side, but I think the interface is more intuitive. We'll need to use different struct for 'read_instance' though (e.g Dict instead of List)

Isn't read_instances the bq uri or gcs uri(s)? How will a Dict work here?

Regarding renaming the entity_type_destination_fields, how about feature_destination_fields to match the feature_source_fields in EntityType class's ingestion methods?

Regarding the data type and usage of the entity_type_destination_fields:

I think the current implementation is trying to implement the first alternative suggested here. The purpose for having a separate entity_type_ids and entity_type_destination_fields is that if for some entity_types that user want to batch serve all features (without a different output feature name), user won't need to include that entity_type id in the entity_type_destination_fields. However, we still can achieve this and remove the separated entity_type_ids field, by requiring having all entity_type ids specified in the entity_type_destination_fields as keys, with ["*"] value for those all features are to be served without a different output name. For example,

{ "entity_type_1": ["*"], "entity_type_2": {"feature_1": "feature_1_col"}, "entity_type_3": {"feature_1", "feature_2"}, }

the second option is a bit redundant but it is indeed very clear and straight forward

{"...path/to/fs/fs1/et/e1/f/f1": "f11_col", "...path/to/fs/fs1/et/e2/f/f1": "f21_col"}

Will the modification of the current implementation in 1) be less confusing and more appealing for usage?

Regarding [1], you also have the option to support mapping the output column name to something different than feature name. I feel this overloaded argument is hard for users to understand (though I may agree it could be convenient once user get a good grasp of it)

I think it is easier to understand to split into 2 arguments: one to define the feature to read, the other one to define the (overridden) output column name. This also allow user to define the output column only for the feature they want to (i.e they want to read */all features from entity_type_1 and output to the column with the same name, except feature foo will be output to column bar)

With the flat-list, the entity type IDs can be extracted from the feature path name. However, unless we put some constraints that the list have to group by entity type, the order of the entity type in the read_instance could be ambiguous. Thus I think using Dict (e.g {'entity_type_1' : 'user_1234', 'entity_type_2': 'product_xyz' } is clearer than using array (['user_1234', 'product_xyz']). The flat list option also requires all to be explicitly listed, i.e "*" is not supported. This is the biggest cons for that option.

removed entity_type_ids and entity_type_destination_fields to avoid confusion, added two dict serving_feature_ids for an entity_type / features mapping and feature_destination_fields for feature_id / feature_destination_field_name mapping.

diemtvu · 2022-01-10T17:48:26Z

google/cloud/aiplatform/featurestore/featurestore.py

+
+                         entity_type_ids = ['my_entity_type_id_1', 'my_entity_type_id_2', 'my_entity_type_id_3']
+
+                         feature_source_fields = {


Do you mean entity_type_destination_fields?

fixed the docstring

…ving_feature_ids and feature_destination_fields

sasha-gitg

LGTM! Nice work!

google/cloud/aiplatform/featurestore/featurestore.py

Co-authored-by: sasha-gitg <[email protected]>

diemtvu

LGTM

…d tfrecord (googleapis#919) * feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs for csv and tfrecord files in Featurestore class * fix: change entity_type_ids and entity_type_destination_fields to serving_feature_ids and feature_destination_fields * fix: remove white space * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * fix: Featurestore create method example usage * fix: get_timestamp_proto for millisecond precision cap * fix: unit tests for get_timestamp_proto Co-authored-by: sasha-gitg <[email protected]>

feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs…

6db4abc

… for csv and tfrecord files in Featurestore class

morgandu requested a review from sasha-gitg December 17, 2021 04:50

morgandu changed the title ~~feat: add batch_serve_to_bq and batch_serve_to_gcs supporting csv and tfrecord destinations in Featurestore class~~ feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs for csv and tfrecord files in Featurestore class Dec 17, 2021

morgandu added 3 commits December 21, 2021 11:33

Merge branch 'main' into mor--feature-store-batch-serve

964fb1e

Merge branch 'main' into mor--feature-store-batch-serve

c6dd8f8

Merge branch 'main' into mor--feature-store-batch-serve

dc83aaa

diemtvu reviewed Jan 10, 2022

View reviewed changes

Merge branch 'main' into mor--feature-store-batch-serve

e1ec440

morgandu changed the title ~~feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs for csv and tfrecord files in Featurestore class~~ feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord Jan 15, 2022

morgandu added 4 commits January 14, 2022 17:29

Merge branch 'main' into mor--feature-store-batch-serve

5781c2b

Merge branch 'main' into mor--feature-store-batch-serve

fed974c

fix: change entity_type_ids and entity_type_destination_fields to ser…

f4c9976

…ving_feature_ids and feature_destination_fields

fix: remove white space

8f85a88

sasha-gitg approved these changes Jan 24, 2022

View reviewed changes

morgandu and others added 9 commits January 25, 2022 09:13

Update google/cloud/aiplatform/featurestore/featurestore.py

758b0dd

Co-authored-by: sasha-gitg <[email protected]>

Update google/cloud/aiplatform/featurestore/featurestore.py

79a340b

Co-authored-by: sasha-gitg <[email protected]>

Update google/cloud/aiplatform/featurestore/featurestore.py

7e0a18b

Co-authored-by: sasha-gitg <[email protected]>

Update google/cloud/aiplatform/featurestore/featurestore.py

a49d2b4

Co-authored-by: sasha-gitg <[email protected]>

Update google/cloud/aiplatform/featurestore/featurestore.py

c191c48

Co-authored-by: sasha-gitg <[email protected]>

Merge branch 'main' into mor--feature-store-batch-serve

c6bf16c

fix: Featurestore create method example usage

d949b10

fix: get_timestamp_proto for millisecond precision cap

90bbfc9

fix: unit tests for get_timestamp_proto

45f0a02

diemtvu approved these changes Jan 27, 2022

View reviewed changes

morgandu merged commit c840728 into googleapis:main Jan 27, 2022

release-please bot mentioned this pull request Jan 27, 2022

chore(main): release 1.10.0 #951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919

feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919

morgandu commented Dec 17, 2021 •

edited

Loading

diemtvu Jan 10, 2022

morgandu Jan 14, 2022 •

edited

Loading

diemtvu Jan 14, 2022

morgandu Jan 25, 2022

diemtvu Jan 10, 2022

morgandu Jan 15, 2022

sasha-gitg left a comment

diemtvu left a comment


		entity_type_ids = ['my_entity_type_id_1', 'my_entity_type_id_2', 'my_entity_type_id_3']

		feature_source_fields = {

feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919

feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919

Conversation

morgandu commented Dec 17, 2021 • edited Loading

diemtvu Jan 10, 2022

Choose a reason for hiding this comment

morgandu Jan 14, 2022 • edited Loading

Choose a reason for hiding this comment

diemtvu Jan 14, 2022

Choose a reason for hiding this comment

morgandu Jan 25, 2022

Choose a reason for hiding this comment

diemtvu Jan 10, 2022

Choose a reason for hiding this comment

morgandu Jan 15, 2022

Choose a reason for hiding this comment

sasha-gitg left a comment

Choose a reason for hiding this comment

diemtvu left a comment

Choose a reason for hiding this comment

morgandu commented Dec 17, 2021 •

edited

Loading

morgandu Jan 14, 2022 •

edited

Loading