-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable ingest from pd.DataFrame #977
feat: enable ingest from pd.DataFrame #977
Conversation
morgandu
commented
Jan 25, 2022
- added ingest_from_df
- added unit tests
- added system tests
'my_feature_id_1': 'my_feature_id_1_source_field', | ||
} | ||
entity_id_field (str): | ||
Optional. Source column that holds entity IDs. If not provided, entity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Does this make sense to have the default value to entity_id
and drop the optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional->Required will also lead to breaking change for ingest_from_bq
and ingest_from_gcs
.
Prefer to use service default over manually defined default.
worker_count: Optional[int] = None, | ||
request_metadata: Optional[Sequence[Tuple[str, str]]] = (), | ||
) -> "EntityType": | ||
"""Ingest feature values from DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also state that a temporary BQ table is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a Note:
section to state the temporary BQ dataset
temp_bq_dataset_name = f"temp_{featurestore_id}_{uuid.uuid4()}".replace( | ||
"-", "_" | ||
) | ||
temp_bq_dataset_id = f"{initializer.global_config.project}.{temp_bq_dataset_name}"[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like self.project
is more consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.project
returns the project number, but bq dataset need a project ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This is an issue moving forward because the project used to construct EntityType
might not match the project ID set in init
or, if project is not set using init
, the project ID returned from google.auth.default
.
Additionally, we don't stop the user from using a project number when setting init
.
We shouldn't block this PR as there is a workaround but we should capture this issue in a ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will address this in a separate ticket/PR
) | ||
|
||
list_movie_features = movie_entity_type.list_features() | ||
assert len(list_movie_features) == 3 | ||
assert "EntityType feature values imported." in caplog.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anyway we can assert the resource has been updated based on the ingest from df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with the read
method which produce a pd from online reading, to validate the feature values before and after ingestion, using a raw data created dataframe. This in turn validated both read
and ingest_from_df
method.
…eleting bq dataset create, add featurestore_extra_require, update ic tests to use online read to validate feature value ingestionfrom df
feature_source_fields: Optional[Dict[str, str]] = None, | ||
entity_id_field: Optional[str] = None, | ||
request_metadata: Optional[Sequence[Tuple[str, str]]] = (), | ||
) -> "EntityType": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to mention whether this function wait for ingestion complete (in other words, do users expected data is available once the function return)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks
* feat: enable ingest from pd.DataFrame * fix: remove bq create_dataset, docstrings, mocks * fix: e2e_base project * fix: delete two optional args, add note for temp bq dataset, revert deleting bq dataset create, add featurestore_extra_require, update ic tests to use online read to validate feature value ingestionfrom df * fix: add a comment of call complete upon ingestion, update unit tests