Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve estimation of row count from partition samples #11333

Merged
merged 1 commit into from
Mar 8, 2022

Conversation

raunaqmorarka
Copy link
Member

Description

Reduce the possiblity of estimation errors in averageRowsPerPartition
and rowCount due to a couple of outliers by excluding the
min and max rowCount values from the calculation of
avg rows per partition.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

hive connector statistics

How would you describe this change to a non-technical end user or system administrator?

improves estimates for partitioned hive tables

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@raunaqmorarka
Copy link
Member Author

TPC benchmark results for partitioned sf1000 orc
Rowcount skew fix sf1000 orc partitioned.pdf

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Copy link
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert here, but seems legit.

Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Reduce the possiblity of estimation errors in averageRowsPerPartition
and rowCount due to a couple of outliers by excluding the
min and max rowCount values from the calculation of
avg rows per partition.
@sopel39
Copy link
Member

sopel39 commented Mar 8, 2022

lgtm % mind automation

@raunaqmorarka
Copy link
Member Author

Test failure due to #11368

@sopel39 sopel39 merged commit 84f0b69 into trinodb:master Mar 8, 2022
@raunaqmorarka raunaqmorarka deleted the rowcount-skew branch March 8, 2022 12:47
@sopel39 sopel39 mentioned this pull request Mar 8, 2022
@github-actions github-actions bot added this to the 373 milestone Mar 8, 2022
branimir-vujicic added a commit to axiomq/presto that referenced this pull request Mar 20, 2022
branimir-vujicic added a commit to axiomq/presto that referenced this pull request Apr 22, 2022
Reduce the possibility of estimation errors in averageRowsPerPartition
and rowCount due to a couple of outliers by excluding the
min and max rowCount values from the calculation of
avg rows per partition.

Cherry-pick of trinodb/trino#11333

Co-authored-by: Raunaq Morarka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants