Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] avg() is not pushed down where other aggreagates are #13336

Closed
FranckPachot opened this issue Jul 15, 2022 · 10 comments
Closed

[YSQL] avg() is not pushed down where other aggreagates are #13336

FranckPachot opened this issue Jul 15, 2022 · 10 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@FranckPachot
Copy link
Contributor

FranckPachot commented Jul 15, 2022

Jira Link: DB-2969

Description

The aggregate avg() is not pushed down but could be because sum() and count() are.

Example on 2.15:

create table demo as select n from generate_series(1,100000) n union all select null from generate_series(1,100000) ;
select min(n),max(n),avg(n) from demo;
explain (analyze, verbose, summary off) select min(n),max(n),avg(n) from demo;

Result:

yugabyte=> create table demo as select n from generate_series(1,100000) n union all select null from generate_series(1,100000) ;
SELECT 200000
yugabyte=> select min(n),max(n),avg(n) from demo;
 min |  max   |        avg
-----+--------+--------------------
   1 | 100000 | 50000.500000000000
(1 row)

yugabyte=> explain (analyze, verbose, summary off) select min(n),max(n),avg(n) from demo;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=107.50..107.51 rows=1 width=40) (actual time=727.498..727.498 rows=1 loops=1)
   Output: min(n), max(n), avg(n)
   ->  Seq Scan on public.demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=4.368..708.866 rows=200000 loops=1)
         Output: n
(4 rows)

A workaround is replacing with sum() and avg():

select min(n),max(n),sum(n)/count(n)::numeric n from demo;
explain (analyze, verbose, summary off) select min(n),max(n),sum(n)/count(n)::numeric n from demo;

This is aggregated on the tservers:

yugabyte=> select min(n),max(n),sum(n)/count(n)::numeric n from demo;
 min |  max   |         n
-----+--------+--------------------
   1 | 100000 | 50000.500000000000
(1 row)

yugabyte=> explain (analyze, verbose, summary off) select min(n),max(n),sum(n)/count(n)::numeric n from demo;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=110.00..110.02 rows=1 width=40) (actual time=214.047..214.047 rows=1 loops=1)
   Output: min(n), max(n), ((sum(n))::numeric / (count(n))::numeric)
   ->  Seq Scan on public.demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=214.029..214.033 rows=3 loops=1)
         Output: n
         Partial Aggregate: true
(5 rows)
@tedyu
Copy link
Contributor

tedyu commented Jul 16, 2022

Currently yb_agg_pushdown_supported() doesn't accept avg for pushdown.
Even if it does,

        /* Aggtranstype is a supported YB key type and is not INTERNAL or NUMERIC. */
        if (!YbDataTypeIsValidForKey(aggref->aggtranstype) ||
            aggref->aggtranstype == INTERNALOID ||
            aggref->aggtranstype == NUMERICOID)
            return;

aggref->aggtranstype for the above SQL is INT8ARRAYOID which would be invalid for key, resulting in yb_agg_pushdown_supported returning aggstate->yb_pushdown_supported as false.

@tedyu
Copy link
Contributor

tedyu commented Jul 16, 2022

Currently DocExprExecutor uses yb::QLValuePB to store intermediate result (such as partial sum).
yb::QLValuePB represents a single value.

If avg is pushed down, it seems we need to use another protobuf message (Maybe QLSeqValuePB) for storing partial result:
current count and current sum

Though refactoring is needed in PgsqlReadOperation::PopulateResultSet

  QLExprResult result;
  for (const PgsqlExpressionPB& expr : request_.targets()) {

where

using QLExprResult = ExprResult<QLValuePB>;

@tedyu
Copy link
Contributor

tedyu commented Jul 16, 2022

Another option (without resorting to QLSeqValuePB) is to keep track of partial sum and partial row count in the loop of PgsqlReadOperation::ExecuteScalar using local QLValuePB.
Coming out of the loop, compute the average and store it in result_buffer.

@tedyu
Copy link
Contributor

tedyu commented Jul 16, 2022

POC showing the idea written above:
push-avg-map.txt

@tedyu
Copy link
Contributor

tedyu commented Jul 17, 2022

cc @sushantrmishra

@myang2021
Copy link
Contributor

It could be that AVG has different semantics if done by each tablet. If a 3-tablet table has 30 rows, (n1 + n2 + ... + n30)/30 may not be the same as ((n1 + n2 + ... + n10)/10 + (n11 + n12 + ... + n20) / 10 + (n21 + n22 + ... + n30) / 10) / 3, especially when floating point numbers are considered that can involve rounding. In contrast, MIN, COUNT, MAX, SUM do not have such issue. To ensure same result, we may need to compute SUM and COUNT first, and then when coming back to postgres, compute AVG by doing SUM/COUNT. But I am not sure even that can ensure same result as done by PG considering potential SUM overflow case.

@myang2021
Copy link
Contributor

myang2021 commented Jul 18, 2022

For example, for the given test case, the following yield the same.

yugabyte=# \timing
Timing is on.
yugabyte=# select avg(n) from demo;
 50000.500000000000

Time: 13254.918 ms (00:13.255)
yugabyte=# select sum(n)::numeric/count(n) from demo;
 50000.500000000000

Time: 7538.618 ms (00:07.539)

If we can prove that avg(n) == sum(n)::numeric/count(n) always holds, then we can do a PG parse tree level rewrite to work around the problem that AVG cannot be pushed down.

@andrei-mart
Copy link
Contributor

((n1 + n2 + ... + n10)/10 + (n11 + n12 + ... + n20) / 10 + (n21 + n22 + ... + n30) / 10) / 3

In that particular example avg is correct, as ((n1+...+n10)/10+(n11+...+n20)/10+(n21+...+n30)/10)/3 == (n1+...+n30)/30, however, in general case of uneven distribution it is not, as ((n1+...+n9)/9+(n10+...+n20)/11+(n21+...+n30)/10)/3 != (n1+...+n30)/30

That means sums and counts must be kept separate and be combined by summing of separate components until very end where one is divided by other: ((n1+...+n9)+(n10+...+n20)+(n21+...+n30)/(9+11+10)) == (n1+...+n30)/30

@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jul 20, 2022
FranckPachot added a commit to FranckPachot/ClickBench that referenced this issue Jul 29, 2022
@tedyu
Copy link
Contributor

tedyu commented Aug 28, 2022

It seems we can modify src/postgres/src/backend/parser/gram.y by specializing avg with sum over count.
This would achieve pushdown with minimal code changes.

If this proposal is adopted, I can send out a patch.

@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Feb 6, 2023
yzhong-yb added a commit that referenced this issue Mar 27, 2023
Summary:
Avg is more difficult to push down than other aggregates because it takes
two values in its transition function--the running count, and the running sum.
These two values must be maintained and sent back to postgres separately,
as sending back only the average is not enough for postgres to aggregate
the results of each tablet. Postgres uses INT8ARRAYOID as the transition
type for avg (of smaller ints) because of this. This means that avg cannot
be pushed down in a straightforward manner, as DocDB currently cannot
send back data of type INT8ARRAYOID.

There are several possible solutions, and some discussion of them can be found
in the issue: #13336

This revision substitutes avg with a count and a sum at the pggate level, both
of which can be pushed down, and merges the results back together afterwards.
This only pushes down the avg of int2s and int4s, as larger types use a different
transition type to handle overflow.

Test Plan:
```
./yb_build.sh --java-test org.yb.pgsql.TestPgRegressAggregates
```
is an existing test for aggregates, including avg.

```
./yb_build.sh --java-test org.yb.pgsql.TestPgSelect#testAggregatePushdowns
```
checks for aggregate pushdown happening in the right situations.

Basic demo of pushdown:
```
yugabyte=# create table aggtest (a int, b int);
CREATE TABLE
yugabyte=# insert into aggtest values (1, 2), (2, 4), (3, 10);
INSERT 0 3
yugabyte=# select avg(a), avg(b) from aggtest;
        avg         |        avg
--------------------+--------------------
 2.0000000000000000 | 5.3333333333333333
(1 row)

yugabyte=# explain (analyze, dist) select avg(a), avg(b) from aggtest;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=105.00..105.02 rows=1 width=64) (actual time=3.811..3.812 rows=1 loops=1)
   ->  Seq Scan on aggtest  (cost=0.00..100.00 rows=1000 width=8) (actual time=3.754..3.763 rows=1 loops=1)
         Partial Aggregate: true
         Storage Table Read Requests: 1
         Storage Table Execution Time: 2.858 ms
 Planning Time: 0.238 ms
 Execution Time: 4.412 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 2.858 ms
 Peak Memory Usage: 22 kB
(11 rows)
```

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Differential Revision: https://phabricator.dev.yugabyte.com/D22976
premkumr pushed a commit to premkumr/yugabyte-db that referenced this issue Apr 10, 2023
Summary:
Avg is more difficult to push down than other aggregates because it takes
two values in its transition function--the running count, and the running sum.
These two values must be maintained and sent back to postgres separately,
as sending back only the average is not enough for postgres to aggregate
the results of each tablet. Postgres uses INT8ARRAYOID as the transition
type for avg (of smaller ints) because of this. This means that avg cannot
be pushed down in a straightforward manner, as DocDB currently cannot
send back data of type INT8ARRAYOID.

There are several possible solutions, and some discussion of them can be found
in the issue: yugabyte#13336

This revision substitutes avg with a count and a sum at the pggate level, both
of which can be pushed down, and merges the results back together afterwards.
This only pushes down the avg of int2s and int4s, as larger types use a different
transition type to handle overflow.

Test Plan:
```
./yb_build.sh --java-test org.yb.pgsql.TestPgRegressAggregates
```
is an existing test for aggregates, including avg.

```
./yb_build.sh --java-test org.yb.pgsql.TestPgSelect#testAggregatePushdowns
```
checks for aggregate pushdown happening in the right situations.

Basic demo of pushdown:
```
yugabyte=# create table aggtest (a int, b int);
CREATE TABLE
yugabyte=# insert into aggtest values (1, 2), (2, 4), (3, 10);
INSERT 0 3
yugabyte=# select avg(a), avg(b) from aggtest;
        avg         |        avg
--------------------+--------------------
 2.0000000000000000 | 5.3333333333333333
(1 row)

yugabyte=# explain (analyze, dist) select avg(a), avg(b) from aggtest;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=105.00..105.02 rows=1 width=64) (actual time=3.811..3.812 rows=1 loops=1)
   ->  Seq Scan on aggtest  (cost=0.00..100.00 rows=1000 width=8) (actual time=3.754..3.763 rows=1 loops=1)
         Partial Aggregate: true
         Storage Table Read Requests: 1
         Storage Table Execution Time: 2.858 ms
 Planning Time: 0.238 ms
 Execution Time: 4.412 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 2.858 ms
 Peak Memory Usage: 22 kB
(11 rows)
```

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Differential Revision: https://phabricator.dev.yugabyte.com/D22976
@jasonyb
Copy link
Contributor

jasonyb commented May 25, 2023

Close by d01a6ed

@jasonyb jasonyb closed this as completed May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

7 participants