Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable ingest from pd.DataFrame #977

Merged
merged 6 commits into from
Jan 27, 2022

Conversation

morgandu
Copy link
Contributor

  • added ingest_from_df
  • added unit tests
  • added system tests

@morgandu morgandu requested a review from sasha-gitg January 25, 2022 21:05
google/cloud/aiplatform/featurestore/entity_type.py Outdated Show resolved Hide resolved
'my_feature_id_1': 'my_feature_id_1_source_field',
}
entity_id_field (str):
Optional. Source column that holds entity IDs. If not provided, entity
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Does this make sense to have the default value to entity_id and drop the optional?

Copy link
Contributor Author

@morgandu morgandu Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional->Required will also lead to breaking change for ingest_from_bq and ingest_from_gcs.

Prefer to use service default over manually defined default.

google/cloud/aiplatform/featurestore/entity_type.py Outdated Show resolved Hide resolved
@morgandu morgandu requested a review from ivanmkc January 26, 2022 00:25
worker_count: Optional[int] = None,
request_metadata: Optional[Sequence[Tuple[str, str]]] = (),
) -> "EntityType":
"""Ingest feature values from DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also state that a temporary BQ table is created.

Copy link
Contributor Author

@morgandu morgandu Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a Note: section to state the temporary BQ dataset

temp_bq_dataset_name = f"temp_{featurestore_id}_{uuid.uuid4()}".replace(
"-", "_"
)
temp_bq_dataset_id = f"{initializer.global_config.project}.{temp_bq_dataset_name}"[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like self.project is more consistent here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.project returns the project number, but bq dataset need a project ID

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This is an issue moving forward because the project used to construct EntityType might not match the project ID set in init or, if project is not set using init, the project ID returned from google.auth.default.

Additionally, we don't stop the user from using a project number when setting init.

We shouldn't block this PR as there is a workaround but we should capture this issue in a ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will address this in a separate ticket/PR

tests/system/aiplatform/e2e_base.py Outdated Show resolved Hide resolved
)

list_movie_features = movie_entity_type.list_features()
assert len(list_movie_features) == 3
assert "EntityType feature values imported." in caplog.text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway we can assert the resource has been updated based on the ingest from df?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with the read method which produce a pd from online reading, to validate the feature values before and after ingestion, using a raw data created dataframe. This in turn validated both read and ingest_from_dfmethod.

…eleting bq dataset create, add featurestore_extra_require, update ic tests to use online read to validate feature value ingestionfrom df
@morgandu morgandu requested a review from a team as a code owner January 26, 2022 23:43
feature_source_fields: Optional[Dict[str, str]] = None,
entity_id_field: Optional[str] = None,
request_metadata: Optional[Sequence[Tuple[str, str]]] = (),
) -> "EntityType":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to mention whether this function wait for ingestion complete (in other words, do users expected data is available once the function return)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link

@diemtvu diemtvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks

@morgandu morgandu merged commit 9289f2d into googleapis:main Jan 27, 2022
@morgandu morgandu deleted the mor--feature-store-ingest-from-df branch January 27, 2022 15:56
ivanmkc pushed a commit to ivanmkc/python-aiplatform that referenced this pull request Jan 28, 2022
* feat: enable ingest from pd.DataFrame

* fix: remove bq create_dataset, docstrings, mocks

* fix: e2e_base project

* fix: delete two optional args, add note for temp bq dataset, revert deleting bq dataset create, add featurestore_extra_require, update ic tests to use online read to validate feature value ingestionfrom df

* fix: add a comment of call complete upon ingestion, update unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants