Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builtins: use Youngs-Crammer algorithm for aggregation functions #55268

Merged
merged 1 commit into from
Oct 17, 2020

Conversation

mneverov
Copy link
Contributor

@mneverov mneverov commented Oct 6, 2020

builtins: use Youngs-Crammer algorithm for aggregation functions

This commit replaces existing algorithm for correlation calculation with the Youngs-Crammer algorithm ported from Postgresql to reduce rounding errors. It introduces the base structure for aggregate functions for statistics.
It also amends aggregates builtin tests so functions with several arguments can be tested.

Release note: None

Relates to #41274

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 6, 2020

CLA assistant check
All committers have signed the CLA.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Oct 6, 2020
@blathers-crl blathers-crl bot requested a review from rohany October 6, 2020 20:13
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rohany
Copy link
Contributor

rohany commented Oct 6, 2020

cc @yuzefovich

@rohany rohany requested review from yuzefovich and removed request for rohany October 6, 2020 20:36
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Oct 6, 2020
@blathers-crl
Copy link

blathers-crl bot commented Oct 7, 2020

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@yuzefovich yuzefovich requested a review from a team October 7, 2020 21:05
@mneverov
Copy link
Contributor Author

The tests for multiple nodes and distributed sql mode "on" are failing because correlation calculation depends on the order of the values.
This is true for the old method as well as for the Youngs-Crammer algorithm ported from the Postgres.

Consider the following example:

-- Postgres

CREATE TABLE aggtest
(
    a double precision,
    b double precision
);

insert into aggtest (a, b)
values (0, 0.09561),
       (12, 124.78),
       (42, 324.78),
       (42, 324.78),
       (42, 384.78),
       (42, 324.78),
       (56, 7.8),
       (56, 7.8),
       (56, 7.8),
       (56, 7.8),
       (56, 324.78),
       (87, 127.001),
       (100, 99.097),
       (100, 99.097),
       (100, 99.097);

SELECT corr(b, a)
FROM aggtest;

-- PG:             -0.14162191454732073
-- Old algorithm:  -0.14162191454732112
-- Youngs-Crammer: -0.14162191454732073

truncate aggtest;

-- the same values in different order
insert into aggtest (a, b)
values (56, 7.8),
       (56, 7.8),
       (42, 324.78),
       (56, 7.8),
       (100, 99.097),
       (42, 324.78),
       (56, 7.8),
       (100, 99.097),
       (100, 99.097),
       (42, 324.78),
       (56, 324.78),
       (42, 384.78),
       (12, 124.78),
       (0, 0.09561),
       (87, 127.001)
;

SELECT corr(b, a)
FROM aggtest;

-- PG:             -0.1416219145473208
-- Old algorithm:  -0.14162191454732093
-- Youngs-Crammer: -0.1416219145473208

My understanding, based on the design doc, is that it's possible to define an order on the output stream, so that intervals from different nodes would merge as if the operation was executed locally (i.e. rows will be ordered based on ordinal number).

@yuzefovich could you please confirm that my understanding is correct and advice how I can define the order on the corr op?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on and for the thorough investigation of this precision issue! I agree with your assessment that the order of float operands seems to matter, so we need to think of some remedy.

My understanding, based on the design doc, is that it's possible to define an order on the output stream, so that intervals from different nodes would merge as if the operation was executed locally (i.e. rows will be ordered based on ordinal number).

I think your understanding is correct, but I don't like the idea of requiring some order in order to get the deterministic result when talking about such tiny precision (beyond 15 digits as documented by Postgres). I'll confirm with my colleagues tomorrow, but I think we should just modify the tests to allow for some deviation, i.e. instead of something like

query R
SELECT corr(y, x)::decimal FROM statistics_agg_test
----
0.045228963191363145

we would have something like

query B
SELECT abs(corr(y, x)::decimal-0.045228963191363145) < power(10, -15) FROM statistics_agg_test
----
true

My hesitation of imposing an ordering comes from the fact that it'll likely have performance implications (we might need to plan a sort before an aggregator), and we're already satisfying the precision requirements mentioned by Postgres. I believe some non-determinism is acceptable in this case. What do you think?

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mneverov)


pkg/sql/sem/builtins/aggregate_builtins.go, line 1739 at r1 (raw file):

}

/*

nit: we tend to use // for all comments, not /* and */.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1758 at r1 (raw file):

 * to minimize code space instead.
 */

nit: we usually don't have an empty line between a comment and a struct.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

 */

type regressionAccumulate struct {

super nit: I'd probably do s/accumulate/accumulator/.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1892 at r1 (raw file):

	}

	err = a.regAcc.add(y, x)

nit: could combine these two lines into one.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1910 at r1 (raw file):

// Reset implements tree.AggregateFunc interface.
func (a *corrAggregate) Reset(context.Context) {
	a.regAcc.n = 0

nit: could replace all the lines with this

a.regAcc = regressionAccumulate{}

@yuzefovich
Copy link
Member

Another idea that was suggested by @asubiotto is to adjust the logic test harness itself in order to introduce a separate type marker for floats (separate from decimals) that allows for 10e-15 deviation, I'll look into that shortly.

@yuzefovich
Copy link
Member

I opened up #55530, and I would be interested if you could try it out (note that you'll need to modify the query invocation to use F type instead of R for floating point numbers that are non-deterministic).

@mneverov
Copy link
Contributor Author

I opened up #55530, and I would be interested if you could try it out (note that you'll need to modify the query invocation to use F type instead of R for floating point numbers that are non-deterministic).

@yuzefovich thanks!
Just tried it and all logic tests for aggregate testcase pass. Will update this PR when #55530 is merged into the master.

My hesitation of imposing an ordering comes from the fact that it'll likely have performance implications (we might need to plan a sort before an aggregator), and we're already satisfying the precision requirements mentioned by Postgres. I believe some non-determinism is acceptable in this case. What do you think?

I haven't ever needed 16 digits precision, so I'm ok with this. Maybe, it is worth to mention the deviation in the docs?

craig bot pushed a commit that referenced this pull request Oct 15, 2020
55530: logictest: introduce a separate matcher for floats r=yuzefovich a=yuzefovich

This commit adds another type annotation `F` intended to be used for
floating point numbers which have a separate matcher that allows for
some deviation in the precision. Namely, the matcher checks that the
expected and the actual numbers match in 15 significant decimal
digits. Note that I think the matching algorithm introduces some
rounding errors, so it might not work perfectly in all scenarios, yet it
should be good enough in rare test cases when we need it.

Informs: #55268.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich
Copy link
Member

#55530 has just merged.

I haven't ever needed 16 digits precision, so I'm ok with this. Maybe, it is worth to mention the deviation in the docs?

I think that floating point precision is a pretty confusing topic, and talking about it in the docs might only add to the confusion. I'd expect most people to not care whether there is non-determinism (depending on the distribution of the execution) in the result with float type, so I'd probably not call it out.

@blathers-crl
Copy link

blathers-crl bot commented Oct 16, 2020

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@mneverov mneverov force-pushed the pop_covar branch 2 times, most recently from d5d8ab1 to 6b24abc Compare October 16, 2020 19:28
Copy link
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

Previously, yuzefovich wrote…

super nit: I'd probably do s/accumulate/accumulator/.

sorry didn't get it, could you please explain this one if it is still relevant with the latest changes?


pkg/sql/sem/builtins/aggregate_builtins.go, line 1892 at r1 (raw file):

Previously, yuzefovich wrote…

nit: could combine these two lines into one.

👍

@mneverov
Copy link
Contributor Author

@yuzefovich could you please have another look?

The PR got quite large so I split it in two: this one ports the algorithm from Postgres to reuse in other aggregation functions later. I found some tests for corr missing.
The actual implementation of covar_pop I can do as a separate PR or an another commit to this one and then squash after the review.

@mneverov mneverov changed the title WIP Support aggregate functions Support aggregate functions Oct 17, 2020
@mneverov mneverov marked this pull request as ready for review October 17, 2020 05:47
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :lgtm: with some nits.

I think it's reasonable to have this PR as a standalone, and then work on covar_pop in a different PR.

nit: you should update the PR title to the commit title.

Reviewed 4 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mneverov)


pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

Previously, mneverov (Max Neverov) wrote…

sorry didn't get it, could you please explain this one if it is still relevant with the latest changes?

The notation s/abc/xyz/ means "replace abc with xyz" (comes from sed command).

My nit comes from the fact that I think that "accumulate" is not a noun (unlike "aggregate"), and for the struct names we prefer nouns, so regressionAccumulator or regressionAccumulatorBase would sound better to me.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1789 at r3 (raw file):

// Reset implements tree.AggregateFunc interface.
func (a *regressionAccumulateBase) Reset(context.Context) {
	a.n = 0

nit: all these lines could be replaced with *a = regressionAccumulateBase{}.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1802 at r3 (raw file):

// Size implements tree.AggregateFunc interface.
func (a *regressionAccumulateBase) Size() int64 {
	return sizeOfCorrAggregate

nit: we should define a separate sizeOfRegressionAccumulateBase and use it here.


pkg/sql/sem/builtins/aggregate_builtins_test.go, line 75 at r3 (raw file):

}

func pivotArgs(args ...[]tree.Datum) [][]tree.Datum {

nit: why is this function called "pivot"? It seems to me it's more like "flattenArgs".

This commit replaces existing algorithm for correlation calculation with the Youngs-Crammer algorithm ported from Postgresql to reduce rounding errors. It introduces the base structure for aggregate functions for statistics.
It also amends aggregates builtin tests so functions with several arguments can be tested.

Release note: None
@mneverov mneverov changed the title Support aggregate functions builtins: use Youngs-Crammer algorithm for aggregation functions Oct 17, 2020
Copy link
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)


pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

Previously, yuzefovich wrote…

The notation s/abc/xyz/ means "replace abc with xyz" (comes from sed command).

My nit comes from the fact that I think that "accumulate" is not a noun (unlike "aggregate"), and for the struct names we prefer nouns, so regressionAccumulator or regressionAccumulatorBase would sound better to me.

got it, thanks.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1789 at r3 (raw file):

Previously, yuzefovich wrote…

nit: all these lines could be replaced with *a = regressionAccumulateBase{}.

missed that, fixed, thanks


pkg/sql/sem/builtins/aggregate_builtins_test.go, line 75 at r3 (raw file):

Previously, yuzefovich wrote…

nit: why is this function called "pivot"? It seems to me it's more like "flattenArgs".

Thanks, "flattenArgs" is indeed describes better what it does

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r=yuzefovich

Reviewed 2 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@craig
Copy link
Contributor

craig bot commented Oct 17, 2020

Build succeeded:

@craig craig bot merged commit d752fa2 into cockroachdb:master Oct 17, 2020
@mneverov mneverov deleted the pop_covar branch October 18, 2020 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants