-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/bigquery): Generate platform resource entities for BigQuery labels #11602
feat(ingest/bigquery): Generate platform resource entities for BigQuery labels #11602
Conversation
@@ -55,6 +62,7 @@ | |||
METADATA_EXTRACTION, | |||
PROFILING, | |||
) | |||
from datahub.metadata._urns.urn_defs import TagUrn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler
from datahub.metadata.urns import TagUrn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, nice catch, my ide tried to be too smart.
if not self.graph: | ||
return None | ||
if self.platform_resource_cache.get(primary_key): | ||
return self.platform_resource_cache.get(primary_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe store the result of the first get in a variable - so you don't need to query the cache more than once
|
||
def get_platform_resource(self, primary_key: str) -> Optional[PlatformResource]: | ||
# if graph is not available we always create a new PlatformResource | ||
if not self.graph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the scenarios where this can be true? Can we recover from this? Should we assert and fail here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was, what if I wanted to write aspects to the file and load them to DataHub from the file?
In this case, even though this will always recreate the platform resources, it will not fail, and you can use it in an air-gapped environment.
) | ||
# platform_resource:PlatformResource = PlatformResource.from_datahub(key=platform_resource_key, graph_client=graph) | ||
platform_resource: Optional[PlatformResource] = ( | ||
platform_resources[0] if platform_resources else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not safe - since there might be multiple matches with different platforms and platform_instances - and if we want to use the primary key to get to the resource - we should check for match on platform and platform_instance - simplest check is filter on the key.id to match the platform resource key you would create otherwise.
Let's figure out why PlatformResource.from_datahub doesn't work when the resource is not present.. and fix that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR fixes this.
) -> Iterable[MetadataChangeProposalWrapper]: | ||
platform_resource_key = PlatformResourceKey( | ||
platform="bigquery", | ||
resource_type="BigQueryLabelnfo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo - BigQueryLabelnfo
-> BigQueryLabelInfo
datahub_urn: str | ||
managed_by_datahub: bool | ||
key: str | ||
value: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be value: Optional[str] = None
if new_bq_label_info == existing_info: | ||
return | ||
|
||
logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you find a matching platform resource - but the mapped tag urn is different, then you have to do one of two things
- Mark this datahub tag as "unable to sync due to conflict" - and skip it from now on
- Create a new bigquery label based on the datahub tag using a guid algorithm - to ensure that you don't have a conflict.
- Continue on by mapping the BQ Label to this tag instead (what this code is doing)
3 is risky and will lead to flip flopping and confusing behavior.
I think you should do 2.
fd86a01
to
3a32b71
Compare
3a32b71
to
7f1d791
Compare
|
||
if new_bq_label_info.datahub_urn == existing_info.datahub_urn: | ||
logger.warning( | ||
f"Duplicate BigQueryLabelInfo found for {bigquery_label}. Skipping..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add more information
Add conflict resolution config to handle lowercased tags with mixed case urns
"aspectName": "dataPlatformInstance", | ||
"aspect": { | ||
"json": { | ||
"platform": "urn:li:dataPlatform:bigquery" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surprised that the golden file doesn't have the platformInstance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The smoke test error doesn't seem to be related to this change |
…ry labels (datahub-project#11602) Co-authored-by: Shirshanka Das <[email protected]>
…ry labels (datahub-project#11602) Co-authored-by: Shirshanka Das <[email protected]>
Checklist