Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch all table/view names across schemas with one Hive metastore call (when Hive views enabled) #17127

Merged

Conversation

huberty89
Copy link
Contributor

Description

This PR replaces #16847 but drops last commits because of a regression - I will try to reintroduce that with a fix in separate PR.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Apr 19, 2023
@github-actions github-actions bot added hive Hive connector tests:hive labels Apr 19, 2023
@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch from c356de6 to b6c1819 Compare April 20, 2023 08:13
@huberty89 huberty89 requested a review from atanasenko April 20, 2023 10:04
@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch 4 times, most recently from de9770a to 6c27aa7 Compare April 24, 2023 08:10
@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch from 6c27aa7 to 049acd8 Compare April 24, 2023 11:36
@huberty89
Copy link
Contributor Author

@findepi Could you take a look?

@findepi
Copy link
Member

findepi commented Apr 25, 2023

discussed offline

@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch 2 times, most recently from de05a3b to 143d40c Compare May 2, 2023 15:19
@huberty89
Copy link
Contributor Author

I feel I need to run TestHiveMetastoreMetadataQueriesAccessOperations for both cases:

  • HiveMetastore with implemented getAllTables() and getAllViews() - so this will cover ThriftMetastore
  • the backward compatible test - so those changes does not brake/slowdown for example GlueHiveMetastore

WDYT? @kokosing @Praveen2112 @findepi

@huberty89
Copy link
Contributor Author

Discussed offline

@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch from 143d40c to d3c6433 Compare May 4, 2023 11:11
Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Need to run another round of review on TestHiveMetastoreMetadataQueriesAccessOperations

Comment on lines 65 to 68
/**
* @return List of table names from all schemas or Optional.empty if operation is not supported
*/
Optional<List<SchemaTableName>> getAllTables();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there is any specific reason for placing this below List<String> getAllTables(String databaseName); can we swap the places ?

@@ -1117,6 +1145,70 @@ public void testDropTable()
assertEquals(mockClient.getAccessCount(), 5);
}

@Test
public void testAllDatabases()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract it as a different commit.

@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch from d3c6433 to e5f6141 Compare May 5, 2023 09:09
List<String> getTablesWithParameter(String databaseName, String parameterKey, String parameterValue);

List<String> getAllViews(String databaseName);

/**
* @return List of view names from all schemas or Optional.empty if operation is not supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this include materialized views? let's document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines +890 to +894
// Without translateHiveViews, Hive views are represented as tables in Trino,
// and they should not be returned from ThriftHiveMetastore.getAllViews() call
if (!translateHiveViews) {
return Optional.empty();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could remove this IF and adjust the code elsewhere.
We should follow-up on this.

@findepi
Copy link
Member

findepi commented May 5, 2023

the fact that TestHiveMetastoreMetadataQueriesAccessOperations.java is refactored in the commit introducing the improvement makes it hard to see what is the effect of the improvement on the invocation counts.

let's maybe introduce the dataprovider in TestHiveMetastoreMetadataQueriesAccessOperations.java as a prep commit (with some TODO / commit comment saying this is just a preparation step)

@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch 2 times, most recently from eb807cc to 934fd05 Compare May 9, 2023 16:52
@huberty89
Copy link
Contributor Author

@findepi discussed offline: about filtering material views from views - this PR is only about performance improvement, I would not change current behavior even if it not correct - we can do it as a followup
cc @kokosing

@huberty89 huberty89 force-pushed the huberty89/batch-get-table-in-thrift-metastore branch from 934fd05 to d15cdae Compare May 11, 2023 12:33
@findepi findepi force-pushed the huberty89/batch-get-table-in-thrift-metastore branch from d15cdae to 9282667 Compare May 11, 2023 19:43
Hubert Łojek and others added 4 commits May 11, 2023 21:44
So that they don't clash with subsequent commits new variables

Co-authored-by: skrzypo987 <[email protected]>
Introduce trueFalse data provider so improvement will be easy to see.
Big calls listing metadata, like `SELECT * FROM information_schema.tables`, are
common in BI tools, especially at the start. Trino then fetches all schemas
and iterated over them to list tables and views. With n schemas it has to make
2*n calls to the underlying metastore. For a large number of schemas this may
take several minutes if metastore communication overhead is big.

This commit introduces a way to fetch all this metadata in one call, using
`getTableMeta` thrift call. Other metastore implementations do not support
this feature. This thrift call is hidden behind an alternative call.

HiveMetastore interface now features two new methods - for fetching all tables
and all views. The `CachingHiveMetastore` has two additional singleton cache
objects containing full list of tables and views.

Co-authored-by: skrzypo987 <[email protected]>
@findepi findepi force-pushed the huberty89/batch-get-table-in-thrift-metastore branch from 9282667 to b44a4a5 Compare May 11, 2023 19:44
@findepi findepi merged commit d585eee into trinodb:master May 12, 2023
@github-actions github-actions bot added this to the 418 milestone May 12, 2023
@mosabua
Copy link
Member

mosabua commented May 17, 2023

@huberty89 and @findepi we had to invent our own release notes entry since none was suggested in the PR description or the release notes ticket or the release notes PR. Please provide more clarity in future PRs. We ended up with

## Hive connector

* Improve performance of querying `information_schema.tables` when using the
  Hive metastore. ({issue}`17127`)

However we are not clear if this applies to all other connectors that use the Hive metastore and therefore need to add this entry to Delta Lake, Iceberg and Hudi connector sections.

And we are also not sure about your mention about "when Hive views enabled" .. does that limit the scope to only the Hive connector after all?

cc @martint @colebow

@huberty89
Copy link
Contributor Author

huberty89 commented May 18, 2023

@mosabua

However we are not clear if this applies to all other connectors that use the Hive metastore and therefore need to add this entry to Delta Lake, Iceberg and Hudi connector sections.

This optimization applies in Hive with HMS, for the rest of connectors which uses HMS I need to verify if further work is needed.

Comment on lines +890 to +894
// Without translateHiveViews, Hive views are represented as tables in Trino,
// and they should not be returned from ThriftHiveMetastore.getAllViews() call
if (!translateHiveViews) {
return Optional.empty();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the impl of this method differs from getAllViews(String databaseName). that one seems to return tables with PRESTO_VIEW_FLAG set to true when translation is disabled instead of empty result.

was this intentional? Or am I misreading the code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the initial implementation was supposed to work only when translateHiveViews or when !translateHiveViews (not sure which one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

6 participants