Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HiveQueryRunner#main more user friendly #11151

Merged
merged 4 commits into from
Feb 26, 2022

Conversation

arhimondr
Copy link
Contributor

Description

Change HiveQueryRunner#main configuration

  • Provision tpcds connector by default
  • Disable security to avoid unnecessary complexity of role management when trying to debug simple things
  • Use standard column names and types in tpch schema to allow running unmodified tpch queries

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Tests

How would you describe this change to a non-technical end user or system administrator?

Non end user visible change

Related issues, pull requests, and links

-

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@@ -358,6 +400,9 @@ public static void main(String[] args)
.setHiveProperties(ImmutableMap.of())
.setInitialTables(TpchTable.getTables())
.setBaseDataDir(baseDataDir)
.setSecurity(ALLOW_ALL)
.setTpchColumnNaming(STANDARD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we converge everywhere to standard, this is useful.
until then, it's a no go to be default in query runner's main.
Query runner's main must allow running test queries from the tests themselves, and tests currently rely on the simplified names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Yeah, let's keep the defaults. What do you think about commenting out these two lines so the standard naming can be easily enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with commented out lines.

queryRunner.createCatalog("tpch", "tpch", tpchCatalogProperties);

queryRunner.installPlugin(new TpcdsPlugin());
queryRunner.createCatalog("tpcds", "tpcds");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we think it's generally useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveQueryRunner#main is a convenient way to start pseudo distributed cluster for testing. Having tpcds connector deploy allows to run tpcds queries and play with the tpcds dataset. Is the concern here that tpcds will be using unnecessary resources when running integration tests? If that's the case and if we think that the resources used are significant I can make this conditional and only enable it in HiveQueryRunner#main

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am more concerned about startup time. but probably it doesn't cost much.

(i almost never use tpcds data set, so i am surprised this is useful outside of benchmarks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to use tpcds dataset for project Tardigrade testing

@arhimondr arhimondr force-pushed the hive-query-runner-tests branch from 0591a49 to 8671cd0 Compare February 23, 2022 19:54
@arhimondr
Copy link
Contributor Author

Updated

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except "Support decimal columns in Tpch statistics provider"

@@ -356,6 +357,9 @@ private static ColumnStatistics toColumnStatistics(ColumnStatisticsData stats, T
if (min.isEmpty() || max.isEmpty()) {
return Optional.empty();
}
if (columnType instanceof DecimalType) {
return Optional.empty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a weird change. The commit says "Support decimal columns in Tpch statistics provider", but it looks like removing support for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this line it fails on DecimalType being not supported. Perhaps a better solution would be to try to return values, but that requires a bigger refactor. I need this change just to unblock running tpch queries with decimal columns enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a better solution would be to try to return values

agreed

but that requires a bigger refactor.

i hoped not. The actual ranges should be the same, and io.trino.spi.statistics.ColumnStatistics is already "normalized" into doubles, so we should be able use the min/max values we have both for doubles and decimals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you are probably right, let me take a closer look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yeah, the stats for decimal columns are available and it was super easy to propagate them. Thanks for the suggestion. Updated.

It is useful to have tpcds connector enabled when trying to debug tpcds
queries locally
To avoid dealing with roles and permissions when debugging
To allow running unmodified tpch queries when debugging
@arhimondr arhimondr force-pushed the hive-query-runner-tests branch from 8671cd0 to 69a6841 Compare February 25, 2022 22:37
@findepi findepi merged commit ebf2852 into trinodb:master Feb 26, 2022
@github-actions github-actions bot added this to the 372 milestone Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants