Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COALESCE appears to unconditionally evaluate all expressions in some cases #82498

Closed
bnaecker opened this issue Jun 7, 2022 · 7 comments
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. docs-todo O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner

Comments

@bnaecker
Copy link

bnaecker commented Jun 7, 2022

Describe the problem

The documentation for the COALESCE function states:

Arguments to the right of the first non-null argument are not evaluated.

It appears that this is not always the case, and that all arguments may be evaluated in some situations.

To Reproduce

To reproduce this, we can start a basic CockroachDB shell with:

$ cockroach demo

Then at the SQL shell, run:

[email protected]:26257/movr> select coalesce((select 1), (select max(x) from generate_series(1, 10000000) as x));
  coalesce
------------
         1
(1 row)


Time: 1.395s total (execution 1.394s / network 0.001s)

[email protected]:26257/movr>

That query takes significantly longer than one would expect, and appears to be evaluating the entire select max(x) .... argument, despite the fact that the first argument is non-null.

We can see that it is indeed evaluating that subquery by looking at the explain output:

[email protected]:26257/movr> explain analyze select coalesce((select 1), (select max(x) from generate_series(1, 10000000) as x));
                                           info
------------------------------------------------------------------------------------------
  planning time: 357µs
  execution time: 1.2s
  distribution: local
  vectorized: true
  maximum memory usage: 102 KiB
  network usage: 0 B (0 messages)
  regions: us-east1

  • root
  │
  ├── • values
  │     nodes: n1
  │     regions: us-east1
  │     actual row count: 1
  │     size: 1 column, 1 row
  │
  ├── • subquery
  │   │ id: @S1
  │   │ original sql: (SELECT 1)
  │   │ exec mode: one row
  │   │
  │   └── • values
  │         nodes: n1
  │         regions: us-east1
  │         actual row count: 1
  │         size: 1 column, 1 row
  │
  └── • subquery
      │ id: @S2
      │ original sql: (SELECT max(x) FROM ROWS FROM (generate_series(1, 10000000)) AS x)
      │ exec mode: one row
      │
      └── • group (scalar)
          │ nodes: n1
          │ regions: us-east1
          │ actual row count: 1
          │ estimated row count: 1
          │
          └── • project set
              │ nodes: n1
              │ regions: us-east1
              │ actual row count: 10,000,000
              │ estimated row count: 10
              │
              └── • emptyrow
                    nodes: n1
                    regions: us-east1
                    actual row count: 1
(48 rows)


Time: 1.244s total (execution 1.244s / network 0.000s)

[email protected]:26257/movr>

Expected behavior

I would expect the second subquery not to run at all, based on the documentation. It looks like that is indeed true for a "simpler" version of the statement:

[email protected]:26257/movr> explain analyze select coalesce(1, (select max(x) from generate_series(1, 10000000) as x));
               info
-----------------------------------
  planning time: 192µs
  execution time: 370µs
  distribution: local
  vectorized: true
  maximum memory usage: 10 KiB
  network usage: 0 B (0 messages)
  regions: us-east1

  • values
    nodes: n1
    regions: us-east1
    actual row count: 1
    size: 1 column, 1 row
(13 rows)


Time: 1ms total (execution 1ms / network 0ms)

[email protected]:26257/movr>

The only difference here is the first argument to COALESCE, which is SELECT 1 in the first case, and just the literal 1 in the
latter. I would expect this query to run in the same time in both cases, based on the documentation.

On the other hand, if there is a more nuanced description of when such expressions are evaluated, it would be good to update the documentation to reflect that.

Additional data / screenshots

N/A

Environment:

  • CockroachDB version 21.2.9
  • Server OS: illumos
  • Client app is cockroach sql

The full buildinfo for us is:

$ cockroach version
Build Tag:        v21.2.9
Build Time:       2022/04/28 04:02:42
Distribution:     OSS
Platform:         illumos amd64 (x86_64-pc-solaris2.11)
Go Version:       go1.16.10
C Compiler:       gcc 10.3.0
Build Commit ID:  11787edfcfc157a0df951abc34684e4e18b3ef20
Build Type:       release

Additional context

We're using queries with the COALESCE statement to conditionally run certain subqueries. Those queries may be expensive, so we'd like to only run them if required. By placing them in the later arguments to COALESCE, the documentation suggests that should be achievable.

As an additional piece of context, it does appear that such conditional evaluation, even when both arguments are subqueries, does work in other cases:

[email protected]:26257/movr> explain analyze select if(1 = 1, (select 1), (select max(x) from generate_series(1, 10000000) as x));
               info
-----------------------------------
  planning time: 224µs
  execution time: 661µs
  distribution: local
  vectorized: true
  maximum memory usage: 10 KiB
  network usage: 0 B (0 messages)
  regions: us-east1

  • root
  │
  ├── • values
  │     nodes: n1
  │     regions: us-east1
  │     actual row count: 1
  │     size: 1 column, 1 row
  │
  └── • subquery
      │ id: @S1
      │ original sql: (SELECT 1)
      │ exec mode: one row
      │
      └── • values
            nodes: n1
            regions: us-east1
            actual row count: 1
            size: 1 column, 1 row
(26 rows)


Time: 1ms total (execution 1ms / network 0ms)

However, I should note that the IF function also appears to evaluate both arguments, so long as the condition itself is a subquery:

[email protected]:26257/movr> explain analyze select if((select 1 = 1), 1, (select max(x) from generate_series(1, 10000000) as x));
                                           info
------------------------------------------------------------------------------------------
  planning time: 358µs
  execution time: 1.7s
  distribution: local
  vectorized: true
  maximum memory usage: 102 KiB
  network usage: 0 B (0 messages)
  regions: us-east1

  • root
  │
  ├── • values
  │     nodes: n1
  │     regions: us-east1
  │     actual row count: 1
  │     size: 1 column, 1 row
  │
  ├── • subquery
  │   │ id: @S1
  │   │ original sql: (SELECT 1 = 1)
  │   │ exec mode: one row
  │   │
  │   └── • values
  │         nodes: n1
  │         regions: us-east1
  │         actual row count: 1
  │         size: 1 column, 1 row
  │
  └── • subquery
      │ id: @S2
      │ original sql: (SELECT max(x) FROM ROWS FROM (generate_series(1, 10000000)) AS x)
      │ exec mode: one row
      │
      └── • group (scalar)
          │ nodes: n1
          │ regions: us-east1
          │ actual row count: 1
          │ estimated row count: 1
          │
          └── • project set
              │ nodes: n1
              │ regions: us-east1
              │ actual row count: 10,000,000
              │ estimated row count: 10
              │
              └── • emptyrow
                    nodes: n1
                    regions: us-east1
                    actual row count: 1
(48 rows)


Time: 1.749s total (execution 1.749s / network 0.000s)

[email protected]:26257/movr>

Jira issue: CRDB-16422

@bnaecker bnaecker added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 7, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 7, 2022

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-queries (found keywords: vectorized,plan)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jun 7, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jun 7, 2022
@mgartner mgartner self-assigned this Jun 7, 2022
@mgartner
Copy link
Collaborator

mgartner commented Jun 7, 2022

Thanks for the report @bnaecker. I agree that the documentation is misleading. As an immediate step, we'll work to make it more clear.

I'll add some more context to explain the behaviors you're seeing.

In your first example, select coalesce((select 1), (select max(x) from generate_series(1, 10000000) as x));, the inner select clauses are uncorrelated subqueries which do not reference any columns from the outer part of the query. In many cases, the optimizer can inline uncorrelated subqueries into the query plan. When uncorrelated subqueries aren't inlined, they are eagerly evaluated and the results are cached before the execution of the outer query. In this example, select 1 and select max(x) from generate_series(1, 10000000) as x are executed first, then the select coalesce(...) outer query is executed, accessing the cached subquery results when necessary.

In your later example, select if(1 = 1, (select 1), (select max(x) from generate_series(1, 10000000) as x)), the conditional part of the if operator is an expression that can be evaluated before execution-time. The optimizer folds 1 = 1 to true, and then folds the if operator to simply (select 1), eliminating the expensive subquery from the query plan.

When you change the if conditional to a subquery, select if((select 1 = 1), 1, (select max(x) from generate_series(1, 10000000) as x));, the optimizer can no longer fold the conditional to true, so the if and the expensive subquery remain.

As far as I can tell, there is nothing incorrect about this behavior, but it would be great to improve the performance of queries like this. Maybe there's a way to inline these subqueries in a way that ensures the expensive part of the query won't run unless necessary. Alternatively, we could explore evaluating subqueries lazily instead of eagerly.

@mgartner
Copy link
Collaborator

mgartner commented Jun 7, 2022

Alternatively, we could explore evaluating subqueries lazily instead of eagerly.

It appears that Postgres behaves this way:

marcus=# EXPLAIN ANALYZE SELECT max(x) FROM generate_series(1, 10000000) AS x;
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=125000.00..125000.01 rows=1 width=4) (actual time=2626.653..2626.654 rows=1 loops=1)
   ->  Function Scan on generate_series x  (cost=0.00..100000.00 rows=10000000 width=4) (actual time=873.188..1844.573 rows=10000000 loops=1)
 Planning Time: 0.034 ms
 Execution Time: 2633.763 ms
(4 rows)



marcus=# EXPLAIN ANALYZE SELECT COALESCE((SELECT 1), (SELECT max(x) FROM generate_series(1, 10000000) AS x));
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Result  (cost=125000.02..125000.03 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
   InitPlan 2 (returns $1)
     ->  Aggregate  (cost=125000.00..125000.01 rows=1 width=4) (never executed)
           ->  Function Scan on generate_series x  (cost=0.00..100000.00 rows=10000000 width=4) (never executed)
 Planning Time: 0.053 ms
 Execution Time: 0.025 ms
(8 rows)

@bnaecker
Copy link
Author

bnaecker commented Jun 7, 2022

Thanks @mgartner! I agree with you that this doesn't seem to be a correctness issue, and I didn't mean to imply that. I was mostly confused about the documentation, which doesn't match the implementation. Updating that to indicate that non-chosen branches are probably not executed, and also a warning to always verify the final query, would be great.

Thanks also for the context about correlated vs. uncorrelated subqueries, that's very helpful. Would you be able to recommend another way to write this query that's more likely to result in pruned subqueries? It seems like the optimizer I've experimented with both IF and CASE expressions, all of which appear to fully evaluate the subqueries prior to evaluating the actual conditional or branch.

@mgartner
Copy link
Collaborator

mgartner commented Jun 7, 2022

I can rewrite your particular query to be faster:

-- Executes in <1ms.
SELECT * FROM (
  SELECT 1
  UNION ALL
  SELECT max(x) FROM generate_series(1, 10000000) AS x
) AS s(i)
WHERE i IS NOT NULL
LIMIT 1;

-- Executes in ~1.5s.
SELECT * FROM (
  SELECT NULL
  UNION ALL
  SELECT max(x) FROM generate_series(1, 10000000) AS x
) AS s(i)
WHERE i IS NOT NULL
LIMIT 1;

But this approach doesn't generalize to all queries, so I don't think it's really what you are looking for.

@bnaecker
Copy link
Author

bnaecker commented Jun 7, 2022

Thanks, that's an interesting approach I hadn't considered. You're right it won't generalize, but that's my problem! Thanks again, and let me know if there's any further context or data you need about the issue at hand. Appreciate your time!

@mgartner
Copy link
Collaborator

Docs PR: cockroachdb/docs#14329
Issue tracking lazy evaluation of subqueries: #20298

mgartner added a commit to mgartner/cockroach that referenced this issue Jun 22, 2022
craig bot pushed a commit that referenced this issue Jun 23, 2022
82703: builtins: remove certain overloads for to_timestamp r=mgartner a=otan

Paving the way for this to be backported - remove some overloads that
were added to remove ambiguity for the purpose of better backwards
compat.

Release note: None

83196: opt: clarify exception to conditional evaluation guarantee r=mgartner a=mgartner

See #20298 and #82498 for additional context.

Release note: None

Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. docs-todo O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner
Projects
Archived in project
Development

No branches or pull requests

2 participants