Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get stats from hive metastore only for the necessary columns #16203

Conversation

lukasz-stec
Copy link
Member

@lukasz-stec lukasz-stec commented Feb 21, 2023

Description

HiveTableHandle contains the list of projected columns. Since only projected columns will be available to the engine, HiveMetadata.getTableStatistics can return statistics limited to those columns.
This also requires CachingHiveMetastore to support caching statistics for a subset of table columns.

Benchmarks

I tested the change on the glue metastore and table with 100 columns (details below) with caching stats disabled (hive.metastore-stats-cache-ttl=0s) and a simple query (select count(c1) from test_col_stats_part) where planning dominates.

There is a difference of about 20% for planning time/elapsed (so visible)

baseline 2.251s on average and only required stats 1.853s

This would have a potentially higher impact on a metastore under heavy load.

The table used:

create table test_col_stats_part (c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39, c40, c41, c42, c43, c44, c45, c46, c47, c48, c49, c50, c51, c52, c53, c54, c55, c56, c57, c58, c59, c60, c61, c62, c63, c64, c65, c66, c67, c68, c69, c70, c71, c72, c73, c74, c75, c76, c77, c78, c79, c80, c81, c82, c83, c84, c85, c86, c87, c88, c89, c90, c91, c92, c93, c94, c95, c96, c97, c98, c99, c100, shipdate)
 WITH (                                              
    format = 'ORC',                                  
    partitioned_by = ARRAY['shipdate']               
 )
 as select 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, shipdate from hive.tpch_sf10_dec_orc.lineitem;

I also ran standard tpch/tpcds on scale factor 1000, orc, partitioned. The results show mostly no change (except for some variability)

image

sf1k-orc-part-240223.pdf

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve query planning performance for queries scanning tables with a larger number of columns. ({issue}`16203`)

@cla-bot cla-bot bot added the cla-signed label Feb 21, 2023
@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch 5 times, most recently from 7a75b11 to 29b2138 Compare February 23, 2023 10:46
@lukasz-stec lukasz-stec changed the title DRAFT Get stats from hive metastore only for the necessary columns Get stats from hive metastore only for the necessary columns Feb 23, 2023
@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch 3 times, most recently from 5d763ff to ee3f5fb Compare February 23, 2023 23:36
@lukasz-stec lukasz-stec marked this pull request as ready for review February 24, 2023 07:49
@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch from ee3f5fb to 5cfb83d Compare February 28, 2023 11:52
Copy link
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CA

@lukasz-stec lukasz-stec requested a review from Dith3r February 28, 2023 11:52
@github-actions github-actions bot added the hive Hive connector label Feb 28, 2023
@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch from 5cfb83d to fede049 Compare March 13, 2023 10:49
@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch from fede049 to d777bb7 Compare March 13, 2023 13:49
Copy link
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca

@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch from d777bb7 to d954002 Compare March 13, 2023 15:00
@lukasz-stec
Copy link
Member Author

rebased on the latest master (auto rebase were causing compilation errors)

Before applyProjection is called or when projection pushdown is disabled,
we should assume that all table columns are to be projected for populating
HiveTableHandle#projectedColumns
@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch from d954002 to 7341c8f Compare March 14, 2023 08:00
Copy link
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca

HiveTableHandle contains the list of projected columns. Since only
projected columns will be available to the engine,
HiveMetadata#getTableStatistics can return statistics limited to those
columns. This also requires CachingHiveMetastore to support caching
statistics for a subset of table columns.
@lukasz-stec lukasz-stec force-pushed the ls/067-hive-only-necessary-columns-stats branch from 7341c8f to 6280a6d Compare March 14, 2023 08:44
List<Column> requestedColumns = table.getDataColumns().stream()
.filter(column -> requestedColumnNames.contains(column.getName()))
.collect(toImmutableList());
table = Table.builder(table).setDataColumns(requestedColumns).build();
Copy link
Member

@findepi findepi Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're building fake table object, hoping the callee will trust the table, and not eg use knowledge about column list from some other place

that's an opaque design, something explicit would probably be better.
Why not pass Set<String> requestedColumns (or Optional thereof`) down the call chain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass Set requestedColumns (or Optional thereof`) down the call chain?

That's an excellent idea. It also applies to the getPartitionStatistics, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants