-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support orm in databuilder #10
Conversation
000-orm-in-databuilder.md
Outdated
@@ -0,0 +1,85 @@ | |||
- Feature Name: Support ORM in databuilder | |||
- Start Date: 2020-10-26 | |||
- RFC PR: [amundsen-io/rfcs#0000](https://github.com/amundsen-io/rfcs/pull/0000) (after opening the RFC PR, update this with a link to it and update the file name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- RFC PR: [amundsen-io/rfcs#0000](https://github.com/amundsen-io/rfcs/pull/0000) (after opening the RFC PR, update this with a link to it and update the file name) | |
- RFC PR: [amundsen-io/rfcs#0010](https://github.com/amundsen-io/rfcs/pull/10) |
@crazy-2020 Is the rfc ready to take a look (e.g how's ORM model look like compared to the graph model)? If so, I will take another look. Sorry for the late. |
Thanks @feng-tao yes, it is ready for review. Please let me know if you have any concern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good draft! Will wait for Amundsen folks to weigh in.
000-orm-in-databuilder.md
Outdated
- RFC PR: [amundsen-io/rfcs#0010](https://github.com/amundsen-io/rfcs/pull/10) | ||
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now) | ||
|
||
# <RFC title> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: needs a title
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also to make development easier, it would be good to put the ORM/alembic into a separate repo like amundsen-io/amundsenrds (name could be you decided)? I could add a team for you to have access.
N/A | ||
## Reference-level Explanation (aka Technical Details) | ||
|
||
- Still use the existing models but add model translators to support ORM schema definition and map node/relation data to the record that can be better consumed by relational DBs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI amundsen-io/amundsendatabuilder#380 which makes databuilder record more generic is just merged
|
||
- metric(metric_key, metric_name, metric_description_key, metric_description, metric_type_key, metric_type_name, dashboard_key, dashboard_name) | ||
|
||
- table(table_key, database_key, database_name, cluster_key, cluster_name, schema_key, schema_name, schema_description_key, schema_description, schema_programmatic_description_key, schema_programmatic_description, table_name, table_is_view, table_description_key, table_description, table_programmatic_description_key, table_programmatic_description, last_updated_key, last_updated_timestamp, last_updated_timestamp_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the table
single table with that schema? what is the primary key? also some of the fields may be gathered from other sources e.g programmatic descriptions?
I wonder how do you get all the attributes in single extraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative is that instead of having a denormalized schema like this. We could have the same schema as graph model:
E.G table model will only have https://github.com/amundsen-io/amundsendatabuilder/blob/d91c0c562c6e560dbc2a687e5dfb0d4cbc4a1c88/databuilder/models/table_metadata.py#L408-L422 which the model stays the same as graph model.
Pro side: 1) less change for databuilder; We build PK on label like what we have in graph model, we put FK relationship when serde the graph relationship. 2) Easier to implement
Con side: it will have multiple joins with less performance. I think it should be ok given we only need all the metadata for single table / single entity every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_key is the pk of 'table'. Metadata from multiple extraction will update table.
I agree with you on keeping orm model same as graph model for those Pros if the use case is only about a single entity every time. I will update the orm schema accordingly in rfc.
Do you mean treating each relationship as an entity to make implementation easier?(e.g., table(key, name, is_view), schema_table(schema_key, table_key), schema(key, name))? Thanks @feng-tao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For relationship, I am thinking of just solving it with FK instead of a separate table.
|
||
- tag(tag_key, tag_type, entity_key) | ||
|
||
- lineage(upstream_key, downstream_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't worry about this one for now as it is still being updated.
|
||
- badge(badge_key, badge_category, entity_key) | ||
|
||
- Due to the reuse of graph db models, add new transformers for entities to convert 'node' and 'relationship' data to 'record' with model translators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will you convert the graph relationship into FK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will add FK from relationship
## Unresolved questions | ||
|
||
- The initial relational DB schema plan was made according to the current graph DB models and it may need some polishment from Amundsen staff who are more familiar with real use cases. | ||
- Regarding how to map graph db models to ORM models, it would be great if there is a dynamic way for the model conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, but aside from performance, some of the metadata can't be fetched/gathered in single extraction as they could come from different sources.
I am mostly curious on how to get all those info into single table row with just one extraction.
also the main unknown questions would be how do you gather all the metadata just with single extraction for the ORM model as it is pretty denormalized. Or what will be the PK for model? and how do you update / cleanup the tables. |
another question is that will you run the db upgrade/db migration in metadata proxy? |
@feng-tao Thanks for view. I will update the orm model to make it same as the existing graph db models and it will be with marked PKs. For db migration, may I know the pro of running db upgrade in metadata proxy? or should it be in databuilder or the coming separate orm repo? |
normally you bootstrap the db upgrade in a service (e..g metadata) bootstrap stage while databuilder is used mostly as a library in Airflow or any other workflow ETL orchestration . I'am thinking of the ORM db repo as a library which we could reference it in both databuilder and metadata.
|
Based on offline discussion, we may also want to consider how to clean up the db ORM tables instead of pure upsert to handle stale metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Am I correct in reading that the overall intention is to support RDBMS's as a backend alongside neo4j, Atlas, and Neptune? If so, is it possible to start RFCing the changes needed for the rest of the systems to support such a world, before we get deep into implementation on this particular portion? The databuilder piece is a totally reasonable way to start, but I think some discussion on how this fits into the architecture, and what commitments we're making by doing so, would be worth considering. WDYT? |
@Golodhros I think @crazy-2020 is still iterating the RFC, but in high level, we shouldn't merge it until it is finalized. |
yeah, I am still working on proving the viability of possible changes in databuilder to support mysql as backend store. According to the comments in this PR, I am making orm models similar to the existing graph db ones to make the implementation easier. Since the generic graph db relationship/node/serializer was recently merged, I will change the implementation details accordingly. I am going to add RFCs for different modules(e.g., orm models, new serializer/dataloader/publisher). @Golodhros Could you help undo the merge, or you can remove the merged file and I will add |
Oh, I see, my misunderstanding as I saw it as "Active". |
This reverts commit b327e7d.
This reverts commit b327e7d. Signed-off-by: Marcos Iglesias <[email protected]>
This reverts commit b327e7d. Signed-off-by: Marcos Iglesias <[email protected]>
Sorry @crazy-2020, I have reverted the commit but I don't know how to restore this PR keeping the comments. I guess you would need to create a new PR. Sorry for the inconvenience! |
Np, I can add a new one |
* Initial proposal to support orm in databuilder * Add owner and usage in the schema section * Add dashboard and badge in the schema section * Fix some typos Signed-off-by: Allison Suarez Miranda <[email protected]>
This reverts commit b327e7d. Signed-off-by: Marcos Iglesias <[email protected]> Signed-off-by: Allison Suarez Miranda <[email protected]>
* Initial proposal to support orm in databuilder * Add owner and usage in the schema section * Add dashboard and badge in the schema section * Fix some typos Signed-off-by: Allison Suarez Miranda <[email protected]>
This reverts commit b327e7d. Signed-off-by: Marcos Iglesias <[email protected]> Signed-off-by: Allison Suarez Miranda <[email protected]>
No description provided.