-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builtins: use Youngs-Crammer algorithm for aggregation functions #55268
Conversation
cc @yuzefovich |
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
The tests for multiple nodes and distributed sql mode "on" are failing because correlation calculation depends on the order of the values. Consider the following example:
My understanding, based on the design doc, is that it's possible to define an order on the output stream, so that intervals from different nodes would merge as if the operation was executed locally (i.e. rows will be ordered based on ordinal number). @yuzefovich could you please confirm that my understanding is correct and advice how I can define the order on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on and for the thorough investigation of this precision issue! I agree with your assessment that the order of float operands seems to matter, so we need to think of some remedy.
My understanding, based on the design doc, is that it's possible to define an order on the output stream, so that intervals from different nodes would merge as if the operation was executed locally (i.e. rows will be ordered based on ordinal number).
I think your understanding is correct, but I don't like the idea of requiring some order in order to get the deterministic result when talking about such tiny precision (beyond 15 digits as documented by Postgres). I'll confirm with my colleagues tomorrow, but I think we should just modify the tests to allow for some deviation, i.e. instead of something like
query R
SELECT corr(y, x)::decimal FROM statistics_agg_test
----
0.045228963191363145
we would have something like
query B
SELECT abs(corr(y, x)::decimal-0.045228963191363145) < power(10, -15) FROM statistics_agg_test
----
true
My hesitation of imposing an ordering comes from the fact that it'll likely have performance implications (we might need to plan a sort before an aggregator), and we're already satisfying the precision requirements mentioned by Postgres. I believe some non-determinism is acceptable in this case. What do you think?
Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mneverov)
pkg/sql/sem/builtins/aggregate_builtins.go, line 1739 at r1 (raw file):
} /*
nit: we tend to use //
for all comments, not /*
and */
.
pkg/sql/sem/builtins/aggregate_builtins.go, line 1758 at r1 (raw file):
* to minimize code space instead. */
nit: we usually don't have an empty line between a comment and a struct.
pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):
*/ type regressionAccumulate struct {
super nit: I'd probably do s/accumulate/accumulator/
.
pkg/sql/sem/builtins/aggregate_builtins.go, line 1892 at r1 (raw file):
} err = a.regAcc.add(y, x)
nit: could combine these two lines into one.
pkg/sql/sem/builtins/aggregate_builtins.go, line 1910 at r1 (raw file):
// Reset implements tree.AggregateFunc interface. func (a *corrAggregate) Reset(context.Context) { a.regAcc.n = 0
nit: could replace all the lines with this
a.regAcc = regressionAccumulate{}
Another idea that was suggested by @asubiotto is to adjust the logic test harness itself in order to introduce a separate type marker for floats (separate from decimals) that allows for 10e-15 deviation, I'll look into that shortly. |
I opened up #55530, and I would be interested if you could try it out (note that you'll need to modify the |
@yuzefovich thanks!
I haven't ever needed 16 digits precision, so I'm ok with this. Maybe, it is worth to mention the deviation in the docs? |
55530: logictest: introduce a separate matcher for floats r=yuzefovich a=yuzefovich This commit adds another type annotation `F` intended to be used for floating point numbers which have a separate matcher that allows for some deviation in the precision. Namely, the matcher checks that the expected and the actual numbers match in 15 significant decimal digits. Note that I think the matching algorithm introduces some rounding errors, so it might not work perfectly in all scenarios, yet it should be good enough in rare test cases when we need it. Informs: #55268. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>
#55530 has just merged.
I think that floating point precision is a pretty confusing topic, and talking about it in the docs might only add to the confusion. I'd expect most people to not care whether there is non-determinism (depending on the distribution of the execution) in the result with |
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
d5d8ab1
to
6b24abc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):
Previously, yuzefovich wrote…
super nit: I'd probably do
s/accumulate/accumulator/
.
sorry didn't get it, could you please explain this one if it is still relevant with the latest changes?
pkg/sql/sem/builtins/aggregate_builtins.go, line 1892 at r1 (raw file):
Previously, yuzefovich wrote…
nit: could combine these two lines into one.
👍
@yuzefovich could you please have another look? The PR got quite large so I split it in two: this one ports the algorithm from Postgres to reuse in other aggregation functions later. I found some tests for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to have this PR as a standalone, and then work on covar_pop
in a different PR.
nit: you should update the PR title to the commit title.
Reviewed 4 of 4 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mneverov)
pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):
Previously, mneverov (Max Neverov) wrote…
sorry didn't get it, could you please explain this one if it is still relevant with the latest changes?
The notation s/abc/xyz/
means "replace abc
with xyz
" (comes from sed
command).
My nit comes from the fact that I think that "accumulate" is not a noun (unlike "aggregate"), and for the struct names we prefer nouns, so regressionAccumulator
or regressionAccumulatorBase
would sound better to me.
pkg/sql/sem/builtins/aggregate_builtins.go, line 1789 at r3 (raw file):
// Reset implements tree.AggregateFunc interface. func (a *regressionAccumulateBase) Reset(context.Context) { a.n = 0
nit: all these lines could be replaced with *a = regressionAccumulateBase{}
.
pkg/sql/sem/builtins/aggregate_builtins.go, line 1802 at r3 (raw file):
// Size implements tree.AggregateFunc interface. func (a *regressionAccumulateBase) Size() int64 { return sizeOfCorrAggregate
nit: we should define a separate sizeOfRegressionAccumulateBase
and use it here.
pkg/sql/sem/builtins/aggregate_builtins_test.go, line 75 at r3 (raw file):
} func pivotArgs(args ...[]tree.Datum) [][]tree.Datum {
nit: why is this function called "pivot"? It seems to me it's more like "flattenArgs".
This commit replaces existing algorithm for correlation calculation with the Youngs-Crammer algorithm ported from Postgresql to reduce rounding errors. It introduces the base structure for aggregate functions for statistics. It also amends aggregates builtin tests so functions with several arguments can be tested. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):
Previously, yuzefovich wrote…
The notation
s/abc/xyz/
means "replaceabc
withxyz
" (comes fromsed
command).My nit comes from the fact that I think that "accumulate" is not a noun (unlike "aggregate"), and for the struct names we prefer nouns, so
regressionAccumulator
orregressionAccumulatorBase
would sound better to me.
got it, thanks.
pkg/sql/sem/builtins/aggregate_builtins.go, line 1789 at r3 (raw file):
Previously, yuzefovich wrote…
nit: all these lines could be replaced with
*a = regressionAccumulateBase{}
.
missed that, fixed, thanks
pkg/sql/sem/builtins/aggregate_builtins_test.go, line 75 at r3 (raw file):
Previously, yuzefovich wrote…
nit: why is this function called "pivot"? It seems to me it's more like "flattenArgs".
Thanks, "flattenArgs" is indeed describes better what it does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r=yuzefovich
Reviewed 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
Build succeeded: |
builtins: use Youngs-Crammer algorithm for aggregation functions
This commit replaces existing algorithm for correlation calculation with the Youngs-Crammer algorithm ported from Postgresql to reduce rounding errors. It introduces the base structure for aggregate functions for statistics.
It also amends aggregates builtin tests so functions with several arguments can be tested.
Release note: None
Relates to #41274