Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: fix buckets for INT2 and INT4 #88083

Merged
merged 1 commit into from
Sep 20, 2022
Merged

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Sep 17, 2022

Previously, if we needed to create "outer" histogram buckets (which is the case when minimum and maximum values in the column weren't sampled yet they contributed to the distinct count) for INT2 and INT4 types, we would use the values that exceeded the supported range for those types. This could lead to incorrect estimation later on when those "outer" buckets are used during the costing as well as the histograms would need to be manually edited to be injected. This is now fixed by handling these two types separately.

Fixes: #76887.

Epic: CRDB-14510

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thank you. Sorry I didn't fix this!

Whenever we get around to doing #83730 we can remove the outer bucket stuff altogether. But this is a good fix until then.

I worry the other users of Datum.IsMin, IsMax, Min, and Max are also buggy. Seems like those methods should all take a types.T argument. I keep meaning to investigate that.

Reviewed 4 of 11 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)


-- commits line 12 at r1:
nit: add "Fixes: #76887." here, too.


pkg/sql/stats/histogram.go line 356 at r1 (raw file):

		// INT2 and INT4 require special handling.
		// TODO(yuzefovich): other types might need it too, but it's less
		// pressing to fix that.

I think FLOAT4 is also affected. The TODO seems fine.

Previously, if we needed to create "outer" histogram buckets (which is
the case when minimum and maximum values in the column weren't sampled
yet they contributed to the distinct count) for INT2 and INT4 types, we
would use the values that exceeded the supported range for those types.
This could lead to incorrect estimation later on when those "outer"
buckets are used during the costing as well as the histograms would need
to be manually edited to be injected. This is now fixed by handling
these two types separately.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry the other users of Datum.IsMin, IsMax, Min, and Max are also buggy. Seems like those methods should all take a types.T argument. I keep meaning to investigate that.

I feel like ints are the most common (only?) types for which we have this kind of problem - they all have the same type family with three different widths, yet internally we more often than not just use int64. For other types we either use the precise datum (e.g. for decimals) or the "largest" datum internally (float64 for both FLOAT4 and FLOAT8). There definitely can be something else, but I don't recall encountering other issues. (As an aside, at some point I remember talking to Jordan and discussing whether we should just define separate type families for INT2 and INT4 as well as separate DInt2 and DInt4 datums - for example, this would making things cleaner in the generated code for the vectorized engine and - possibly - improve our compatibility story with Postgres. I think Oliver looked into something related too, like two years ago.)

I'm hopeful this should alleviate vast majority of the cases.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @rytaft)


-- commits line 12 at r1:

Previously, michae2 (Michael Erickson) wrote…

nit: add "Fixes: #76887." here, too.

I actually deliberately don't do this - if the issue is mentioned in the commit message, then every time the commit is amended and force-pushed, a new "mention" line appears on the issue itself. This probably doesn't matter for this case, but for some commits I might have dozens of force-pushes, so those "mentions" just clutter the issue. Having the "fixes" line on the PR description is sufficient IMO - it'll close the issue accordingly and then it's simple to go from commit to PR to issue if necessary.


pkg/sql/stats/histogram.go line 356 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

I think FLOAT4 is also affected. The TODO seems fine.

Probably for FLOAT4 we are safe because under the hood we still use FLOAT8.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2, @rytaft, and @yuzefovich)


-- commits line 12 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I actually deliberately don't do this - if the issue is mentioned in the commit message, then every time the commit is amended and force-pushed, a new "mention" line appears on the issue itself. This probably doesn't matter for this case, but for some commits I might have dozens of force-pushes, so those "mentions" just clutter the issue. Having the "fixes" line on the PR description is sufficient IMO - it'll close the issue accordingly and then it's simple to go from commit to PR to issue if necessary.

TIL!

@craig
Copy link
Contributor

craig bot commented Sep 20, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 20, 2022

Build succeeded:

@craig craig bot merged commit 5862e9b into cockroachdb:master Sep 20, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 20, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 906cc48 to blathers/backport-release-22.1-88083: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt: histogram upper bounds should not be outside of min/max values for a type
3 participants