Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: fix another floating point precision error in statisticsBuilder #38730

Merged
merged 1 commit into from
Jul 9, 2019

Conversation

rytaft
Copy link
Collaborator

@rytaft rytaft commented Jul 8, 2019

This commit fixes a floating point precision error in the
statisticsBuilder code for estimating the distinct count of an
unconstrained column in a SELECT or JOIN expression.

Prior to this commit, the code was estimating that the probability
of a row being filtered out was 1-selectivity. If the selectivity is
very small, however, this results in probability=1. This commit changes
the logic so now we set the probability equal to 0.9999999 if it would
otherwise be equal to 1. This ensures that the estimated distinct count
is always greater than 0 if the row count is greater than 0.

Fixes #38375

Release note: None

@rytaft rytaft requested review from justinj and RaduBerinde July 8, 2019 15:36
@rytaft rytaft requested a review from a team as a code owner July 8, 2019 15:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@justinj justinj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, if this problem is happening often maybe it would make sense to have a SetDistinctCount which includes this kind of safeguard automatically? I guess it's pretty situational, though

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @justinj and @RaduBerinde)

@rytaft
Copy link
Collaborator Author

rytaft commented Jul 8, 2019

TFTR! Yea, when @RaduBerinde and I discussed adding this assertion a while ago we decided that each operator would have a better idea about how to set the distinct count than a global function. It is a bit annoying to keep getting these bugs, but I think it's helped fix a bunch of issues. If they don't disappear before the release I can a SetDistinctCount function so customers don't run into this....

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde and @rytaft)


pkg/sql/opt/props/statistics.go, line 197 at r1 (raw file):

		// 1 to avoid setting the distinct count to 0 (since the row count is
		// non-zero).
		p = 0.9999999

Would it make more sense to come up with a reasonable "final value" for DistinctCount instead of passing this value through the formula below?

@rytaft
Copy link
Collaborator Author

rytaft commented Jul 8, 2019


pkg/sql/opt/props/statistics.go, line 197 at r1 (raw file):

Previously, RaduBerinde wrote…

Would it make more sense to come up with a reasonable "final value" for DistinctCount instead of passing this value through the formula below?

Not sure what that value would be... do you have any ideas? This approach at least seemed likely to produce something in the correct range. If it's larger than the row count it will get truncated to the row count in finalizeFromRowCount.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)


pkg/sql/opt/props/statistics.go, line 197 at r1 (raw file):

Previously, rytaft wrote…

Not sure what that value would be... do you have any ideas? This approach at least seemed likely to produce something in the correct range. If it's larger than the row count it will get truncated to the row count in finalizeFromRowCount.

If this case is logically a case where selectivity is effectively 0, then DistinctCount should be effectively 0, so it would be set to a very small value (I think we used 1e-7 in other places).

By the way, should the check be if p > 0.9999999 ? It seems conceivable that we might get a value under 1 but very very close, and it would get treated as 1 below. Also, seems odd that selectivity=0 would get a bigger value than selectivity=1e-8

This commit fixes a floating point precision error in the
statisticsBuilder code for estimating the distinct count of an
unconstrained column in a SELECT or JOIN expression.

Prior to this commit, the code was estimating that the probability
of a row being filtered out was 1-selectivity. If the selectivity is
very small, however, this results in probability=1 and estimated
distinct count=0. This commit changes the logic so now we set the
distinct count equal to 1e-10 if it would otherwise be equal to 0.

Fixes cockroachdb#38375

Release note: None
@rytaft
Copy link
Collaborator Author

rytaft commented Jul 9, 2019


pkg/sql/opt/props/statistics.go, line 197 at r1 (raw file):

Previously, RaduBerinde wrote…

If this case is logically a case where selectivity is effectively 0, then DistinctCount should be effectively 0, so it would be set to a very small value (I think we used 1e-7 in other places).

By the way, should the check be if p > 0.9999999 ? It seems conceivable that we might get a value under 1 but very very close, and it would get treated as 1 below. Also, seems odd that selectivity=0 would get a bigger value than selectivity=1e-8

Yea that makes sense (I decided to directly check distinct count instead, but same idea). I also changed the other spot to check < epsilon and for good measure made them both 1e-10.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)


pkg/sql/opt/props/statistics.go, line 197 at r1 (raw file):

Previously, rytaft wrote…

Yea that makes sense (I decided to directly check distinct count instead, but same idea). I also changed the other spot to check < epsilon and for good measure made them both 1e-10.

Nice, I like it.


pkg/sql/opt/props/statistics.go, line 192 at r2 (raw file):

	// when d << n.
	c.DistinctCount = d - d*math.Pow(1-selectivity, n/d)
	const epsilon = 1e-10

[nit] this could be reused across this file, e.g. distinctCountEpsilon

@rytaft
Copy link
Collaborator Author

rytaft commented Jul 9, 2019


pkg/sql/opt/props/statistics.go, line 192 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] this could be reused across this file, e.g. distinctCountEpsilon

The other case is for selectivity, not distinct count. How about StatsEpsilon (needs to be exported since these are different packages)?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)


pkg/sql/opt/props/statistics.go, line 192 at r2 (raw file):

Previously, rytaft wrote…

The other case is for selectivity, not distinct count. How about StatsEpsilon (needs to be exported since these are different packages)?

Ah, right. Never mind then, they can stay separate.

Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)


pkg/sql/opt/props/statistics.go, line 192 at r2 (raw file):

Previously, RaduBerinde wrote…

Ah, right. Never mind then, they can stay separate.

Sound good.

craig bot pushed a commit that referenced this pull request Jul 9, 2019
38730: opt: fix another floating point precision error in statisticsBuilder r=rytaft a=rytaft

This commit fixes a floating point precision error in the
statisticsBuilder code for estimating the distinct count of an
unconstrained column in a `SELECT` or `JOIN` expression.

Prior to this commit, the code was estimating that the probability
of a row being filtered out was 1-selectivity. If the selectivity is
very small, however, this results in probability=1. This commit changes
the logic so now we set the probability equal to 0.9999999 if it would
otherwise be equal to 1. This ensures that the estimated distinct count
is always greater than 0 if the row count is greater than 0.

Fixes #38375

Release note: None

Co-authored-by: Rebecca Taft <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 9, 2019

Build succeeded

@craig craig bot merged commit 2c1e193 into cockroachdb:master Jul 9, 2019
rytaft added a commit to rytaft/cockroach that referenced this pull request Jul 10, 2019
…lCounts

A previous PR (cockroachdb#38730) updated the logic in selectivityFromNullCounts
to compare the result of floating point arithmetic with a small constant
epsilon. For consistency, this commit adds similar logic in
joinSelectivityFromNullCounts.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 10, 2019
38795: opt: update joinSelectivityFromNullCounts to match selectivityFromNullCounts r=rytaft a=rytaft

A previous PR (#38730) updated the logic in `selectivityFromNullCounts`
to compare the result of floating point arithmetic with a small constant
`epsilon`. For consistency, this commit adds similar logic in
`joinSelectivityFromNullCounts`.

Release note: None

Co-authored-by: Rebecca Taft <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: internal error: estimated distinct count must be non-zero
4 participants