Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CachingHiveMetastore.getTableColumnStatistics not effective for some queries #21081

Open
losipiuk opened this issue Mar 14, 2024 · 1 comment

Comments

@losipiuk
Copy link
Member

losipiuk commented Mar 14, 2024

After 52a17f1 we are keying cache entries in CachingHiveMetastore on set of columns (previously stats for all the columns were pulled from metastore).
As a result we may end up with more roundtrips to metastore for a query which happens to consult HiveMetastore multiple times for different set of columns of a single table.
In case communication with metastore is costly it causes performance regression.

Edit: actually the caching was on per-column basis already before 52a17f1 since #16203, yet 52a17f1 changes call pattern so we observe more calls to CachingHiveMetastore sometimes. E.g. for query:

    CREATE TABLE test_self_join_table  AS SELECT 2 AS age, 0 parent, 3 AS id";
    SELECT child.age, parent.age FROM test_self_join_table child JOIN test_self_join_table parent ON child.parent = parent.id";

cc: @dain @findepi

@dain
Copy link
Member

dain commented Mar 14, 2024

Without knowing that a table is used multiple times in a query, the only solution would be to always load all columns. The problem is all columns is, what happens if the table has 10k columns and now you are pulling a ton of data.

Maybe we can say something like, if the table has less than 50 columns, we fetch all of them. For reference, with glue we batch load up to 100 column stats in one shot. The design is similar to what we do in ORC/Parquet, where if the file is < 8MB we just read the whole thing in one shot to avoid the extra IOs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants