Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata query functionality #122

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

buggythepirate
Copy link

@buggythepirate buggythepirate commented Mar 14, 2022

Extends #111

Purpose:

  • Querying metadata from delta-sharing directly without using the rest_client.py
  • Keeping Table origin information in metadata query results (e.g. for processing in data catalogs, etc.)

Implemented:

  1. 2 functions in Sharing client (list metadata in table, list metadata for all tables)
  2. Changes to rest client classes (QueryTableMetadataResponse, QueryTableVersionResponse, ListFilesInTableResponse) in order to have table origin information in the responses
  3. updates to tests.

Signed-off-by: Stefan Salzl [email protected]

@buggythepirate
Copy link
Author

buggythepirate commented Mar 23, 2022

Hi @zsxwing , what do think of these changes to delta-sharing? Any feedback?

@zsxwing
Copy link
Member

zsxwing commented Mar 24, 2022

Thanks for the contribution. Adding a new metadata method looks good. Could you elaborate this Keeping Table origin information in metadata query results (e.g. for processing in data catalogs, etc.)? If you pass the table object into the method, you should already know it?

@buggythepirate
Copy link
Author

buggythepirate commented Mar 25, 2022

@zsxwing
Sorry English is not my first language, so I might not get my point across exactly. But I'll try 😭

TL;DR:
My reasoning for adding the table class to the return value is:

  • A nice return value contains all the information needed for table metadata and origin
  • For a user of the library the convenience function query_all_table_metadata hides details away.

Long version:
When querying for the metadata, I would like to return a complete class that contains all necessary information for pushing the metadata to a data catalog, i.e.

  • Table origin (i.e. the Table class (containing Table name, schema, share)
  • Metadata information

So taking the extended return class, I can immediately push the information to a data catalog, without creating a object combining Table origin & metadata information. Everything is already in place.

This is especially important for the query_all_table_metadata function, where you don't query for a specific table but you ask for all tables. If you do not return the table in the response, you lose the table origin information. Then you either stuck with doing something like

  • querytablesmetadata = [self.query_table_metadata(table=table) for table in tables] or
  • running list_all_tables after running query_all_table_metadata

Thus, the query_all_table_metadata function would be completely pointless.

P.S: I have also added a small convenience function for querying version history as well.
P.P.S: If you are ok with the changes, would you mind releasing new version to PyPi, I would like to use the delta-sharing library in another open source project for ingesting metadata.

@buggythepirate
Copy link
Author

Hi @zsxwing

Did you have time to look at my answer? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants