Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support orm in databuilder #10

Merged
merged 4 commits into from
Dec 3, 2020
Merged

Support orm in databuilder #10

merged 4 commits into from
Dec 3, 2020

Conversation

xuan616
Copy link
Member

@xuan616 xuan616 commented Oct 26, 2020

No description provided.

@xuan616 xuan616 requested a review from a team as a code owner October 26, 2020 17:02
@@ -0,0 +1,85 @@
- Feature Name: Support ORM in databuilder
- Start Date: 2020-10-26
- RFC PR: [amundsen-io/rfcs#0000](https://github.com/amundsen-io/rfcs/pull/0000) (after opening the RFC PR, update this with a link to it and update the file name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- RFC PR: [amundsen-io/rfcs#0000](https://github.com/amundsen-io/rfcs/pull/0000) (after opening the RFC PR, update this with a link to it and update the file name)
- RFC PR: [amundsen-io/rfcs#0010](https://github.com/amundsen-io/rfcs/pull/10)

@feng-tao
Copy link
Member

feng-tao commented Nov 4, 2020

@crazy-2020 Is the rfc ready to take a look (e.g how's ORM model look like compared to the graph model)? If so, I will take another look. Sorry for the late.

@xuan616
Copy link
Member Author

xuan616 commented Nov 4, 2020

Thanks @feng-tao yes, it is ready for review. Please let me know if you have any concern.

Copy link

@criccomini criccomini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good draft! Will wait for Amundsen folks to weigh in.

- RFC PR: [amundsen-io/rfcs#0010](https://github.com/amundsen-io/rfcs/pull/10)
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now)

# <RFC title>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: needs a title

@xuan616 xuan616 changed the title Initial proposal to support orm in databuilder Support orm in databuilder Nov 6, 2020
Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also to make development easier, it would be good to put the ORM/alembic into a separate repo like amundsen-io/amundsenrds (name could be you decided)? I could add a team for you to have access.

N/A
## Reference-level Explanation (aka Technical Details)

- Still use the existing models but add model translators to support ORM schema definition and map node/relation data to the record that can be better consumed by relational DBs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI amundsen-io/amundsendatabuilder#380 which makes databuilder record more generic is just merged


- metric(metric_key, metric_name, metric_description_key, metric_description, metric_type_key, metric_type_name, dashboard_key, dashboard_name)

- table(table_key, database_key, database_name, cluster_key, cluster_name, schema_key, schema_name, schema_description_key, schema_description, schema_programmatic_description_key, schema_programmatic_description, table_name, table_is_view, table_description_key, table_description, table_programmatic_description_key, table_programmatic_description, last_updated_key, last_updated_timestamp, last_updated_timestamp_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the table single table with that schema? what is the primary key? also some of the fields may be gathered from other sources e.g programmatic descriptions?

I wonder how do you get all the attributes in single extraction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative is that instead of having a denormalized schema like this. We could have the same schema as graph model:
E.G table model will only have https://github.com/amundsen-io/amundsendatabuilder/blob/d91c0c562c6e560dbc2a687e5dfb0d4cbc4a1c88/databuilder/models/table_metadata.py#L408-L422 which the model stays the same as graph model.

Pro side: 1) less change for databuilder; We build PK on label like what we have in graph model, we put FK relationship when serde the graph relationship. 2) Easier to implement
Con side: it will have multiple joins with less performance. I think it should be ok given we only need all the metadata for single table / single entity every time.

Copy link
Member Author

@xuan616 xuan616 Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_key is the pk of 'table'. Metadata from multiple extraction will update table.
I agree with you on keeping orm model same as graph model for those Pros if the use case is only about a single entity every time. I will update the orm schema accordingly in rfc.

Do you mean treating each relationship as an entity to make implementation easier?(e.g., table(key, name, is_view), schema_table(schema_key, table_key), schema(key, name))? Thanks @feng-tao

Copy link
Member

@feng-tao feng-tao Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For relationship, I am thinking of just solving it with FK instead of a separate table.


- tag(tag_key, tag_type, entity_key)

- lineage(upstream_key, downstream_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't worry about this one for now as it is still being updated.


- badge(badge_key, badge_category, entity_key)

- Due to the reuse of graph db models, add new transformers for entities to convert 'node' and 'relationship' data to 'record' with model translators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you convert the graph relationship into FK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will add FK from relationship

## Unresolved questions

- The initial relational DB schema plan was made according to the current graph DB models and it may need some polishment from Amundsen staff who are more familiar with real use cases.
- Regarding how to map graph db models to ORM models, it would be great if there is a dynamic way for the model conversion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but aside from performance, some of the metadata can't be fetched/gathered in single extraction as they could come from different sources.

I am mostly curious on how to get all those info into single table row with just one extraction.

@feng-tao
Copy link
Member

feng-tao commented Nov 7, 2020

also the main unknown questions would be how do you gather all the metadata just with single extraction for the ORM model as it is pretty denormalized. Or what will be the PK for model? and how do you update / cleanup the tables.

@feng-tao
Copy link
Member

feng-tao commented Nov 7, 2020

another question is that will you run the db upgrade/db migration in metadata proxy?

@xuan616
Copy link
Member Author

xuan616 commented Nov 9, 2020

@feng-tao Thanks for view. I will update the orm model to make it same as the existing graph db models and it will be with marked PKs. For db migration, may I know the pro of running db upgrade in metadata proxy? or should it be in databuilder or the coming separate orm repo?

@feng-tao
Copy link
Member

normally you bootstrap the db upgrade in a service (e..g metadata) bootstrap stage while databuilder is used mostly as a library in Airflow or any other workflow ETL orchestration . I'am thinking of the ORM db repo as a library which we could reference it in both databuilder and metadata.

@feng-tao Thanks for view. I will update the orm model to make it same as the existing graph db models and it will be with marked PKs. For db migration, may I know the pro of running db upgrade in metadata proxy? or should it be in databuilder or the coming separate orm repo?

@feng-tao
Copy link
Member

feng-tao commented Nov 10, 2020

Based on offline discussion, we may also want to consider how to clean up the db ORM tables instead of pure upsert to handle stale metadata.

@feng-tao
Copy link
Member

https://github.com/amundsen-io/amundsenrds is created

Copy link
Member

@Golodhros Golodhros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Golodhros Golodhros merged commit b327e7d into amundsen-io:master Dec 3, 2020
@dorianj
Copy link
Contributor

dorianj commented Dec 3, 2020

Am I correct in reading that the overall intention is to support RDBMS's as a backend alongside neo4j, Atlas, and Neptune?

If so, is it possible to start RFCing the changes needed for the rest of the systems to support such a world, before we get deep into implementation on this particular portion? The databuilder piece is a totally reasonable way to start, but I think some discussion on how this fits into the architecture, and what commitments we're making by doing so, would be worth considering.

WDYT?

@feng-tao
Copy link
Member

feng-tao commented Dec 3, 2020

@Golodhros I think @crazy-2020 is still iterating the RFC, but in high level, we shouldn't merge it until it is finalized.

@xuan616
Copy link
Member Author

xuan616 commented Dec 3, 2020

yeah, I am still working on proving the viability of possible changes in databuilder to support mysql as backend store. According to the comments in this PR, I am making orm models similar to the existing graph db ones to make the implementation easier. Since the generic graph db relationship/node/serializer was recently merged, I will change the implementation details accordingly. I am going to add RFCs for different modules(e.g., orm models, new serializer/dataloader/publisher). @Golodhros Could you help undo the merge, or you can remove the merged file and I will add
a new RFC for orm models in next PR. No worries, I am okay with either way. Thanks.

@Golodhros
Copy link
Member

@Golodhros I think @crazy-2020 is still iterating the RFC, but in high level, we shouldn't merge it until it is finalized.

Oh, I see, my misunderstanding as I saw it as "Active".
I will revert it and reopen.

Golodhros pushed a commit that referenced this pull request Dec 3, 2020
Golodhros pushed a commit that referenced this pull request Dec 3, 2020
This reverts commit b327e7d.

Signed-off-by: Marcos Iglesias <[email protected]>
Golodhros added a commit that referenced this pull request Dec 3, 2020
This reverts commit b327e7d.

Signed-off-by: Marcos Iglesias <[email protected]>
@Golodhros
Copy link
Member

Sorry @crazy-2020, I have reverted the commit but I don't know how to restore this PR keeping the comments. I guess you would need to create a new PR. Sorry for the inconvenience!

@xuan616
Copy link
Member Author

xuan616 commented Dec 3, 2020

Sorry @crazy-2020, I have reverted the commit but I don't know how to restore this PR keeping the comments. I guess you would need to create a new PR. Sorry for the inconvenience!

Np, I can add a new one

allisonsuarez pushed a commit that referenced this pull request Feb 23, 2021
* Initial proposal to support orm in databuilder

* Add owner and usage in the schema section

* Add dashboard and badge in the schema section

* Fix some typos

Signed-off-by: Allison Suarez Miranda <[email protected]>
allisonsuarez pushed a commit that referenced this pull request Feb 23, 2021
This reverts commit b327e7d.

Signed-off-by: Marcos Iglesias <[email protected]>
Signed-off-by: Allison Suarez Miranda <[email protected]>
allisonsuarez pushed a commit that referenced this pull request May 5, 2021
* Initial proposal to support orm in databuilder

* Add owner and usage in the schema section

* Add dashboard and badge in the schema section

* Fix some typos

Signed-off-by: Allison Suarez Miranda <[email protected]>
allisonsuarez pushed a commit that referenced this pull request May 5, 2021
This reverts commit b327e7d.

Signed-off-by: Marcos Iglesias <[email protected]>
Signed-off-by: Allison Suarez Miranda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants