-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: fix buckets for INT2 and INT4 #88083
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Sorry I didn't fix this!
Whenever we get around to doing #83730 we can remove the outer bucket stuff altogether. But this is a good fix until then.
I worry the other users of Datum.IsMin
, IsMax
, Min
, and Max
are also buggy. Seems like those methods should all take a types.T
argument. I keep meaning to investigate that.
Reviewed 4 of 11 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)
-- commits
line 12 at r1:
nit: add "Fixes: #76887." here, too.
pkg/sql/stats/histogram.go
line 356 at r1 (raw file):
// INT2 and INT4 require special handling. // TODO(yuzefovich): other types might need it too, but it's less // pressing to fix that.
I think FLOAT4
is also affected. The TODO seems fine.
Previously, if we needed to create "outer" histogram buckets (which is the case when minimum and maximum values in the column weren't sampled yet they contributed to the distinct count) for INT2 and INT4 types, we would use the values that exceeded the supported range for those types. This could lead to incorrect estimation later on when those "outer" buckets are used during the costing as well as the histograms would need to be manually edited to be injected. This is now fixed by handling these two types separately. Release note: None
2f5a09f
to
906cc48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry the other users of
Datum.IsMin
,IsMax
,Min
, andMax
are also buggy. Seems like those methods should all take atypes.T
argument. I keep meaning to investigate that.
I feel like ints are the most common (only?) types for which we have this kind of problem - they all have the same type family with three different widths, yet internally we more often than not just use int64
. For other types we either use the precise datum (e.g. for decimals) or the "largest" datum internally (float64
for both FLOAT4 and FLOAT8). There definitely can be something else, but I don't recall encountering other issues. (As an aside, at some point I remember talking to Jordan and discussing whether we should just define separate type families for INT2 and INT4 as well as separate DInt2
and DInt4
datums - for example, this would making things cleaner in the generated code for the vectorized engine and - possibly - improve our compatibility story with Postgres. I think Oliver looked into something related too, like two years ago.)
I'm hopeful this should alleviate vast majority of the cases.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @rytaft)
Previously, michae2 (Michael Erickson) wrote…
nit: add "Fixes: #76887." here, too.
I actually deliberately don't do this - if the issue is mentioned in the commit message, then every time the commit is amended and force-pushed, a new "mention" line appears on the issue itself. This probably doesn't matter for this case, but for some commits I might have dozens of force-pushes, so those "mentions" just clutter the issue. Having the "fixes" line on the PR description is sufficient IMO - it'll close the issue accordingly and then it's simple to go from commit to PR to issue if necessary.
pkg/sql/stats/histogram.go
line 356 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I think
FLOAT4
is also affected. The TODO seems fine.
Probably for FLOAT4
we are safe because under the hood we still use FLOAT8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2, @rytaft, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I actually deliberately don't do this - if the issue is mentioned in the commit message, then every time the commit is amended and force-pushed, a new "mention" line appears on the issue itself. This probably doesn't matter for this case, but for some commits I might have dozens of force-pushes, so those "mentions" just clutter the issue. Having the "fixes" line on the PR description is sufficient IMO - it'll close the issue accordingly and then it's simple to go from commit to PR to issue if necessary.
TIL!
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 906cc48 to blathers/backport-release-22.1-88083: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Previously, if we needed to create "outer" histogram buckets (which is the case when minimum and maximum values in the column weren't sampled yet they contributed to the distinct count) for INT2 and INT4 types, we would use the values that exceeded the supported range for those types. This could lead to incorrect estimation later on when those "outer" buckets are used during the costing as well as the histograms would need to be manually edited to be injected. This is now fixed by handling these two types separately.
Fixes: #76887.
Epic: CRDB-14510
Release note: None