Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): adds platform instance capability to glue connector #4130

Merged

Conversation

sgomezvillamor
Copy link
Contributor

@sgomezvillamor sgomezvillamor commented Feb 13, 2022

This is enabling platform instance capability to the glue connector.

Additionally, underlying_platform field has been deprecated in favour of the standard platform one.

During the development of this task, it was noted that other connectors were badly setting the factory as the instance for the container key. This was fixed too and so the scope of the PR went a little bit beyond the glue connector.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@github-actions
Copy link

github-actions bot commented Feb 13, 2022

Unit Test Results (metadata ingestion)

       5 files         5 suites   45m 54s ⏱️
   366 tests    366 ✔️   0 💤 0
1 677 runs  1 646 ✔️ 31 💤 0

Results for commit 9a032b4.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Feb 13, 2022

Unit Test Results (build & test)

  92 files  +17    92 suites  +17   22m 3s ⏱️ +52s
674 tests +49  615 ✔️ +50  59 💤 ±0  0  - 1 

Results for commit 9a032b4. ± Comparison against base commit 2d82531.

♻️ This comment has been updated with latest results.

@sgomezvillamor sgomezvillamor marked this pull request as ready for review February 13, 2022 17:51
@sgomezvillamor
Copy link
Contributor Author

Just for the curious, the new test replicates the previous one with the simple addition of the platform_instance config. This is noted in the diff of the golden files:

(venv) ➜  sgomezvillamor-datahub git:(glue-platform-instance) ✗ diff metadata-ingestion/tests/unit/glue/glue_mces_golden.json metadata-ingestion/tests/unit/glue/glue_mces_platform_instance_golden.json 
6c6
<         "urn": "urn:li:dataset:(urn:li:dataPlatform:glue,flights-database.avro,PROD)",
---
>         "urn": "urn:li:dataset:(urn:li:dataPlatform:glue,some_instance_name.flights-database.avro,PROD)",
236c236
<         "urn": "urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_jsons_markers,PROD)",
---
>         "urn": "urn:li:dataset:(urn:li:dataPlatform:glue,some_instance_name.test-database.test_jsons_markers,PROD)",
403c403
<         "urn": "urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_parquet,PROD)",
---
>         "urn": "urn:li:dataset:(urn:li:dataPlatform:glue,some_instance_name.test-database.test_parquet,PROD)",
734c734
<                 "urn:li:dataset:(urn:li:dataPlatform:glue,flights-database.avro,PROD)"
---
>                 "urn:li:dataset:(urn:li:dataPlatform:glue,some_instance_name.flights-database.avro,PROD)"
959c959
<                 "urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_parquet,PROD)"
---
>                 "urn:li:dataset:(urn:li:dataPlatform:glue,some_instance_name.test-database.test_parquet,PROD)"

@rslanka
Copy link
Contributor

rslanka commented Feb 17, 2022

Looks great overall! Please address the minor comments if possible.

@sgomezvillamor
Copy link
Contributor Author

Thanks for the review @rslanka, addressed all your comments in 0ecf83d
Merging master and fixing conflicts...

@rslanka
Copy link
Contributor

rslanka commented Feb 17, 2022

@sgomezvillamor, this is good to go once the merge with master is pushed.

…tform-instance

# Conflicts:
#	metadata-ingestion/source_docs/glue.md
#	metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
Copy link
Contributor

@rslanka rslanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sgomezvillamor
Copy link
Contributor Author

@rslanka I have merged master and it should pass tests now. However, I noted that domains and containers were added and I have some doubts. If you have a look at the glue_mces_platform_instance_golden.json, the generated mcp for containers look as follows:

  {
    "auditHeader": null,
    "entityType": "container",
    "entityUrn": "urn:li:container:0b9f1f731ecf6743be6207fec3dc9cba",
    "entityKeyAspect": null,
    "changeType": "UPSERT",
    "aspectName": "containerProperties",
    "aspect": {
      "value": "{\"customProperties\": {\"platform\": \"glue\", \"instance\": \"PROD\", \"database\": \"flights-database\"}, \"name\": \"flights-database\"}",
      "contentType": "application/json"
    },
    "systemMetadata": null
  },
  {
    "auditHeader": null,
    "entityType": "container",
    "entityUrn": "urn:li:container:0b9f1f731ecf6743be6207fec3dc9cba",
    "entityKeyAspect": null,
    "changeType": "UPSERT",
    "aspectName": "dataPlatformInstance",
    "aspect": {
      "value": "{\"platform\": \"urn:li:dataPlatform:glue\"}",
      "contentType": "application/json"
    },
    "systemMetadata": null
  },

In the first one it calls my attention instance: PROD because we usually refer that as factory, otherwise it may be confused with the platform instance. But my main concern is: I should probably add the plaltformInstance to that customProperties or at least include it in the name, WDYT?
I understand this issue should be fixed here https://github.com/sgomezvillamor/datahub/blob/master/metadata-ingestion/src/datahub/emitter/mcp_builder.py#L93 However not sure where that dataclasses came from, can you help me to understand that?

In the second, it clearly states the aspect is dataPlatformInstance however it just mentions glue. I need to fix to the platform instance itself.
In this case, the issue is here https://github.com/sgomezvillamor/datahub/blob/master/metadata-ingestion/src/datahub/emitter/mcp_builder.py#L106 Again I need some help to connect the dots here 😅

@sgomezvillamor
Copy link
Contributor Author

Actually I have found some other examples were IMO instance is badly set as the env 🤔 https://github.com/sgomezvillamor/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/athena.py#L107-L109

@sgomezvillamor
Copy link
Contributor Author

Solving this

  {
    "auditHeader": null,
    "entityType": "container",
    "entityUrn": "urn:li:container:0b9f1f731ecf6743be6207fec3dc9cba",
    "entityKeyAspect": null,
    "changeType": "UPSERT",
    "aspectName": "dataPlatformInstance",
    "aspect": {
      "value": "{\"platform\": \"urn:li:dataPlatform:glue\"}",
      "contentType": "application/json"
    },
    "systemMetadata": null
  },

so it includes the platform instance, requires to change this https://github.com/sgomezvillamor/datahub/blob/585aad1aac09516c9e23ab96bf1a751c0c35cdb8/metadata-ingestion/src/datahub/emitter/mcp_builder.py#L99-L107 so make_dataplatform_instance_urn is used instead. Not sure if you want me to fix that in this PR because that will impact many other sources. Maybe it's better to address it in a different PR. WDYT?

@sgomezvillamor
Copy link
Contributor Author

I think this is ready to merge (as soon as all tests pass)

I can address this and this in separated PRs

@sgomezvillamor
Copy link
Contributor Author

Well... I couldn't wait 😅 and solved last two issues with 26ef3ba and 59050d4

So with this I'm done for today. I would appreciate if you could review @rslanka Thanks!

@sgomezvillamor
Copy link
Contributor Author

@rslanka Just noted your approval got dismissed with latest updates, so if you could review again ☺️

@sgomezvillamor
Copy link
Contributor Author

@shirshanka If you could please have a look at this 🙏

…tform-instance

# Conflicts:
#	metadata-ingestion/tests/integration/sql_server/mssql_mces_golden.json
entityType="container",
changeType=ChangeTypeClass.UPSERT,
entityUrn=f"{container_urn}",
# entityKeyAspect=ContainerKeyClass(guid=schema_container_key.guid()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

Copy link
Contributor

@rslanka rslanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# entityKeyAspect=ContainerKeyClass(guid=schema_container_key.guid()),
aspectName="dataPlatformInstance",
aspect=DataPlatformInstance(
platform=f"{make_dataplatform_instance_urn(container_key.platform, container_key.instance)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be changed to:

aspect=DataPlatformInstance(
  platform=f"{make_data_platform_urn(...)}",  # as before
  instance=f"{make_data_platform_instance_urn(container_key..."
...

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small change.. and another "wait until other PR is green-lighted"

@sgomezvillamor
Copy link
Contributor Author

Thanks @shirshanka ! Once PR #4279 is merged I will solve the conflicts and address your comment 👍

# Conflicts:
#	metadata-ingestion/setup.py
#	metadata-ingestion/src/datahub/emitter/mcp_builder.py
#	metadata-ingestion/src/datahub/ingestion/source/sql/athena.py
#	metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py
#	metadata-ingestion/tests/integration/clickhouse/clickhouse_mces_golden.json
#	metadata-ingestion/tests/unit/test_mcp_builder.py
@sgomezvillamor
Copy link
Contributor Author

@shirshanka As agreed: #4279 got merged and I updated the PR. So this is ready for review again.
CC'ing @treff7es

@shirshanka shirshanka assigned treff7es and unassigned rslanka and jjoyce0510 Mar 18, 2022
@treff7es treff7es self-requested a review March 21, 2022 21:35
@treff7es treff7es requested a review from shirshanka March 22, 2022 09:05
Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@shirshanka shirshanka merged commit bdf17f5 into datahub-project:master Mar 31, 2022
@sgomezvillamor sgomezvillamor deleted the glue-platform-instance branch March 31, 2022 09:57
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants