Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve planning time for Iceberg simple SELECT queries with optional LIMIT #17347

Merged
merged 8 commits into from
May 11, 2023

Conversation

findepi
Copy link
Member

@findepi findepi commented May 4, 2023

No description provided.

@cla-bot cla-bot bot added the cla-signed label May 4, 2023
@findepi findepi force-pushed the findepi/select-limit branch 2 times, most recently from f27f169 to 1c646c2 Compare May 4, 2023 11:20
@github-actions github-actions bot added delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector tests:hive labels May 4, 2023
@findepi findepi force-pushed the findepi/select-limit branch 3 times, most recently from bb48796 to 5e8ead4 Compare May 4, 2023 12:39
Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple nit-picks, and test failures look relevant. Neat improvement though

@alexjo2144
Copy link
Member

I noticed you had to override the default executor for the table scan in the tests. Is the change still useful with the default behavior?

@findepi
Copy link
Member Author

findepi commented May 5, 2023

I noticed you had to override the default executor for the table scan in the tests. Is the change still useful with the default behavior?

yes, i am using direct executor for determinism

Without that there still is improvement, but because ParallelIterable is used, the number of opened files was not deterministic.

I am still concerned about the fact we're opening 20 manifests, while 3 should suffice.
I didn't track yet where this number comes from, but it seems constant.

@findepi
Copy link
Member Author

findepi commented May 5, 2023

and test failures look relevant

I will drop Abstain from loading stats if they were not needed commit (move to #17347)
it is not needed for SELECT flow (unless someone enables collect_plan_statistics_for_all_queries, but then it's wrong).

I will add EXPLAIN ANALYZE test case too because of

return plan(analysis, stage, analysis.getStatement() instanceof ExplainAnalyze || isCollectPlanStatisticsForAllQueries(session));
line

@findepi
Copy link
Member Author

findepi commented May 5, 2023

AC

@findepi findepi force-pushed the findepi/select-limit branch from bdd89a6 to 50409fc Compare May 5, 2023 08:31
@findepi
Copy link
Member Author

findepi commented May 5, 2023

I am still concerned about the fact we're opening 20 manifests, while 3 should suffice.
I didn't track yet where this number comes from, but it seems constant.

Found it. See code comment just added (https://github.com/trinodb/trino/compare/bdd89a6efc32f7fef1ba29eeb7e60cbab8b52761..50409fc8c53c99bc4f7d5810e7a346730bd84f4c)

@findepi findepi force-pushed the findepi/select-limit branch from 50409fc to 987c8d8 Compare May 5, 2023 12:50
@findepi
Copy link
Member Author

findepi commented May 5, 2023

rebased to resolve conflict

@findepi
Copy link
Member Author

findepi commented May 5, 2023

check-commit (54302eb0c86a8130968406c38c4c0352e1610f10)

Fetching the repository
  /usr/bin/git -c protocol.version=2 fetch --prune --progress --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
  Error: fatal: unable to access 'https://github.com/trinodb/trino/': The requested URL returned error: 429
  The process '/usr/bin/git' failed with exit code 128
  Waiting 10 seconds before trying again
  /usr/bin/git -c protocol.version=2 fetch --prune --progress --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
  Error: fatal: unable to access 'https://github.com/trinodb/trino/': The requested URL returned error: 429
  The process '/usr/bin/git' failed with exit code 128
  Waiting 16 seconds before trying again
  /usr/bin/git -c protocol.version=2 fetch --prune --progress --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
  Error: fatal: unable to access 'https://github.com/trinodb/trino/': The requested URL returned error: 429
  Error: The process '/usr/bin/git' failed with exit code 128

ignoring

@findepi
Copy link
Member Author

findepi commented May 5, 2023

We probably would get advantages similar to achieved here without modifying the connector by tuning query.min-schedule-split-batch-size down. I think it's still good change to have in, without any downsides i can see

findepi added 7 commits May 10, 2023 17:21
When `assertThat(...).containsExactlyInAnyOrderElementsOf` fails, it
prints diff, requiring to count elements in the diff. Reuse more
friendly assertions previously used for metastore access tests only.
There were two defaults -- one in the query runner (used by some tests)
and one TEST_SESSION (used by most tests).
Previously we called `Scan.planFiles` and transformed it with
`TableScanUtil.splitFiles`. The new code iterates over output from
`Scan.planFiles` and splits each individually. As the advantage, this
introduces a scope within which we know a file has been processed.
@findepi findepi force-pushed the findepi/select-limit branch from 987c8d8 to 0f82b0f Compare May 10, 2023 15:23
@findepi
Copy link
Member Author

findepi commented May 10, 2023

(rebased to resolve conflicts)

if (!fileTasksIterator.hasNext()) {
// This is the last task for this file
if (!fileHasAnyDeletions) {
// There were no deletions, so we produces splits covering the whole file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// There were no deletions, so we produces splits covering the whole file
// There were no deletions, so we produce splits covering the whole file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! changing to "produced"

@findepi findepi force-pushed the findepi/select-limit branch from 0f82b0f to 6dc58d1 Compare May 11, 2023 20:00
@findepi findepi merged commit a30ee90 into trinodb:master May 11, 2023
@findepi findepi deleted the findepi/select-limit branch May 11, 2023 20:06
@github-actions github-actions bot added this to the 418 milestone May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

5 participants