Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read spark generated statistics in hive connector #16120

Merged
merged 2 commits into from
Feb 28, 2023

Conversation

Dith3r
Copy link
Member

@Dith3r Dith3r commented Feb 15, 2023

Description

Use spark generated statistics as a fallback to hive statistics. This will help in environments where table statistics are already generated by Apache Spark and users either don't want to re-run statistics generation through ANALYZE in Trino or don't want to give HMS write permissions to Trino.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Use table statistics generated by Apache Spark when statistics have not been generated by the Trino hive connector during cost based optimization of queries. The catalog configuration property `hive.metastore.thrift.use-spark-table-statistics-fallback` can be set to `false` to disable this feature. ({issue}`16120`)

@cla-bot cla-bot bot added the cla-signed label Feb 15, 2023
@Dith3r Dith3r force-pushed the ke/spark-stats branch 3 times, most recently from c272e12 to 2e8cd62 Compare February 16, 2023 08:54
@Dith3r Dith3r marked this pull request as ready for review February 16, 2023 08:54
@Dith3r Dith3r force-pushed the ke/spark-stats branch 3 times, most recently from 68fa9f2 to a3a91f6 Compare February 17, 2023 12:29
@findinpath findinpath added the needs-docs This pull request requires changes to the documentation label Feb 17, 2023
@findinpath
Copy link
Contributor

Are there potential downsides in adding this fallback functionality?

cc: @raunaqmorarka

@Dith3r Dith3r force-pushed the ke/spark-stats branch 2 times, most recently from 60cf4ae to 8fb947d Compare February 17, 2023 14:33
@raunaqmorarka
Copy link
Member

Are there potential downsides in adding this fallback functionality?

cc: @raunaqmorarka

We don't do any extra metadata calls and parsing logic is very simple. The only downside is that the Spark SQL generated stats might be wrong or incomplete, in that case the new flag can be used to disable the fallback or the usual trino/hive statistics can be generated to avoid the fallback.

@findinpath
Copy link
Contributor

Build is red

Error: src/main/java/io/trino/plugin/hive/metastore/thrift/ThriftSparkMetastoreUtil.java:[48] (regexp) RegexpMultiline: Line has trailing whitespace

")", tableName1));

onSpark().executeQuery(format("INSERT INTO %s VALUES " +
"(120, 32760, 2147483640, 9223372036854775800, 123.340, 234.560, CAST(343.0 AS DECIMAL(10, 0)), CAST(345.670 AS DECIMAL(10, 5)), TIMESTAMP '2015-05-10 12:15:30', DATE '2015-05-08', 'p1 varchar', CAST('varchar10' AS VARCHAR(10)), CAST('p1 char10' AS CHAR(10)), false, CAST('p1 binary' as BINARY))," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be beneficial to add in this test another row with non-null values.

As can be seen from here io.trino.plugin.hive.metastore.thrift.ThriftSparkMetastoreUtil#fromMetastoreDistinctValuesCount(long, long, long) the ndv count is set to 1 so a future maintainer of the code may be a bit puzzled to see that the NDV doesn't correspond to the number of actual distinct values.

@Dith3r Dith3r force-pushed the ke/spark-stats branch 3 times, most recently from 79261aa to 1d4b117 Compare February 23, 2023 09:14
@raunaqmorarka raunaqmorarka force-pushed the ke/spark-stats branch 2 times, most recently from add8439 to 4d36e4c Compare February 28, 2023 03:48
@raunaqmorarka raunaqmorarka merged commit 5dc41d4 into trinodb:master Feb 28, 2023
@github-actions github-actions bot added this to the 409 milestone Feb 28, 2023
@Dith3r Dith3r deleted the ke/spark-stats branch March 27, 2023 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs needs-docs This pull request requires changes to the documentation performance
Development

Successfully merging this pull request may close these issues.

5 participants