You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After 52a17f1 we are keying cache entries in CachingHiveMetastore on set of columns (previously stats for all the columns were pulled from metastore).
As a result we may end up with more roundtrips to metastore for a query which happens to consult HiveMetastore multiple times for different set of columns of a single table.
In case communication with metastore is costly it causes performance regression.
Edit: actually the caching was on per-column basis already before 52a17f1 since #16203, yet 52a17f1 changes call pattern so we observe more calls to CachingHiveMetastore sometimes. E.g. for query:
CREATETABLEtest_self_join_tableASSELECT2AS age, 0 parent, 3AS id"; SELECT child.age, parent.age FROM test_self_join_table child JOIN test_self_join_table parent ON child.parent = parent.id";
Without knowing that a table is used multiple times in a query, the only solution would be to always load all columns. The problem is all columns is, what happens if the table has 10k columns and now you are pulling a ton of data.
Maybe we can say something like, if the table has less than 50 columns, we fetch all of them. For reference, with glue we batch load up to 100 column stats in one shot. The design is similar to what we do in ORC/Parquet, where if the file is < 8MB we just read the whole thing in one shot to avoid the extra IOs.
After 52a17f1 we are keying cache entries in CachingHiveMetastore on set of columns (previously stats for all the columns were pulled from metastore).
As a result we may end up with more roundtrips to metastore for a query which happens to consult HiveMetastore multiple times for different set of columns of a single table.
In case communication with metastore is costly it causes performance regression.
Edit: actually the caching was on per-column basis already before 52a17f1 since #16203, yet 52a17f1 changes call pattern so we observe more calls to
CachingHiveMetastore
sometimes. E.g. for query:cc: @dain @findepi
The text was updated successfully, but these errors were encountered: