-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add ability to filter GCP project ingestion based on project labels #10242
feat: Add ability to filter GCP project ingestion based on project labels #10242
Conversation
@@ -69,6 +69,9 @@ def __init__(self, **data: Any): | |||
def get_bigquery_client(self) -> bigquery.Client: | |||
client_options = self.extra_client_options | |||
return bigquery.Client(self.project_on_behalf, **client_options) | |||
|
|||
def get_projects_client(self) -> resourcemanager_v3.ProjectsClient: | |||
return resourcemanager_v3.ProjectsClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if this client needs extra permission, or does it not need any new permission?
If it is needed, then can we make sure not to create this client if, in the config, it is not set to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I initially implemented this PR used the list_projects
method but landed on the search_projects
one due to its flexibility in searching for projects based on parents AND labels. Looking at GCP docs, seems like you'll need to ensure the SA has resourcemanager.projects.get
on the projects you want to grab.
Is that your read, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anaghshineh There is a easy way to test this out by creating a service account without this permission and try to run this code. If this does indeed require a new permission then please
- update the docs with the new roles/permissions required
- make this an optional thing disabled by default. We don't want every orgs ingestion to break due to this new permission. Some orgs have heavier checks on getting new permissions. We don't want them to get stuck on permissions when this gets in.
- Please add in https://github.com/datahub-project/datahub/blob/master/docs/how/updating-datahub.md that there is new option that requires new permissions
As this is open source we have to be careful w.r.t. not breaking other orgs ingestion as much as possible.
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py
Outdated
Show resolved
Hide resolved
…gquery_config.py Co-authored-by: Tamas Nemeth <[email protected]>
…:anaghshineh/datahub into bq-filter-ingestion-based-on-proj-labels
@@ -1,16 +1,17 @@ | |||
# Generated requirements file. Run ./regenerate-base-requirements.sh to regenerate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't make changes in this file. This file is updated separately. This is mainly a caching mechanism for our builds.
proj_client: resourcemanager_v3.Client = config.get_projects_client() | ||
assert proj_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this under a config disabled by default. We don't want folks ingestions to break due to new permissions required as mentioned in other comment.
Closing as #11169 was merged |
What Changed?
We've wanted to have more control over project ingestion at my organization. We've been maintaining a "whitelist" of projects we'd like to ingest, but this is becoming unmanageable as we create more projects. We'd like to leverage consistent project labels in GCP to drive ingestion. This PR takes a first pass at allowing folks to define
project_labels
in their BigQuery ingestion recipes. It uses the GCP Resource Managersearch_projects
method to query for projects with the given labels. This does not introduce any breaking changes, as ingestion will continue to favorproject_ids
& will respect any name patterns.Example:
Checklist