opt: avoid estimated row counts of 0 #32578

jordanlewis · 2018-11-23T22:03:21Z

Given the simple table:

[email protected]:65097/defaultdb> create table a (a int);
CREATE TABLE

Check out the following two queries. I expect the output to look like the first one, always, and never the second one.

Good plan (only a single sort):

[email protected]:65097/defaultdb> explain select * from a order by a limit 1 offset 100;
       tree      | field  | description
+----------------+--------+-------------+
  limit          |        |
   │             | count  | 1
   │             | offset | 100
   └── sort      |        |
        │        | order  | +a
        └── scan |        |
                 | table  | a@primary
                 | spans  | ALL
(8 rows)

Bad plan (two sorts):

[email protected]:65097/defaultdb> explain select * from a order by a limit 1 offset 9999;
            tree           | field  | description
+--------------------------+--------+-------------+
  limit                    |        |
   │                       | count  | 1
   └── sort                |        |
        │                  | order  | +a
        └── limit          |        |
             │             | offset | 9999
             └── sort      |        |
                  │        | order  | +a
                  └── scan |        |
                           | table  | a@primary
                           | spans  | ALL
(11 rows)

As far as I can tell, the table should already be sorted after the offset - so there should be no reason to put another sort in between the offset and the limit.

The text was updated successfully, but these errors were encountered:

andy-kimball · 2018-11-26T18:56:25Z

@RaduBerinde, can you take a look?

RaduBerinde · 2018-11-26T19:21:58Z

The estimated row count after "offset 9999" is 0, which makes the sort "free", and the plan with the sort is considered first. There are two fixes (we probably want both):

add a constant cpu cost per operator (which would reflect the overhead of setting up the execution for the operator)
make the row count small but not 0; though I don't know exactly where this value would come from.

  limit                                        
   ├── columns: a:1(int)                       
   ├── internal-ordering: +1                   
   ├── cardinality: [0 - 1]                    
   ├── stats: [rows=0]                         
   ├── cost: 1249.31569                        
   ├── key: ()                                 
   ├── fd: ()-->(1)                            
   ├── sort                                    
   │    ├── columns: a:1(int)                  
   │    ├── stats: [rows=0]                    
   │    ├── cost: 1249.31569                   
   │    ├── ordering: +1                       
   │    └── offset                             
   │         ├── columns: a:1(int)             
   │         ├── internal-ordering: +1         
   │         ├── stats: [rows=0]               
   │         ├── cost: 1249.31569              
   │         ├── sort                          
   │         │    ├── columns: a:1(int)        
   │         │    ├── stats: [rows=1000]       
   │         │    ├── cost: 1249.31569         
   │         │    ├── ordering: +1             
   │         │    ├── prune: (1)               
   │         │    └── scan a                   
   │         │         ├── columns: a:1(int)   
   │         │         ├── stats: [rows=1000]  
   │         │         ├── cost: 1030          
   │         │         └── prune: (1)          
   │         └── const: 9999 [type=int]        
   └── const: 1 [type=int]

Add a one-time cpuCostFactor to all operators. This reflects the time taken to set up execution for the operator, and will result in plans with fewer operators all else being equal (e.g. when the estimated row count is 0). Informs cockroachdb#32578. Release note: None

32616: opt: add one-time cost for all operators r=RaduBerinde a=RaduBerinde Add a one-time cpuCostFactor to all operators. This reflects the time taken to set up execution for the operator, and will result in plans with fewer operators all else being equal (e.g. when the estimated row count is 0). Informs #32578. Release note: None Note: while this fixes the specific issue mentioned and seems like a reasonable idea on its own, I think having row count = 0 is also problematic because everything above that operator won't be optimized properly. 0 should be reserved for the case where we know for sure there are 0 rows. I don't know how to make that happen without making the "count" something more complicated (e.g. a count + a variance, or a confidence interval). Co-authored-by: Radu Berinde <[email protected]>

RaduBerinde · 2018-11-27T15:42:36Z

The specific case reported here is fixed, however I'm going to repurpose the issue to track the larger problem of 0 row counts.

We should avoid estimated row counts of 0 short of situations where we know it's exactly 0. A row count of 0 prevents the optimizer from making reasonable choices since operators have the same cost.

Note that in this example, without stats, we are subtracting a definite count from an "arbitrary unit" which is also problematic.

andy-kimball · 2019-05-09T23:07:09Z

I think we could fix this without causing ripples by always adding 1 to the estimated row count. That way, it's always > 0, but still smoothly preserves relative costs between plans.

RaduBerinde · 2019-05-10T14:05:17Z

Every operator would add 1? So for example, a project would have 1 more row than its input? Or maybe just in those cases where we are making an estimation?

Related to this, I think that whenever we are using real table stats that say 0 rows, we should override them to 1 row (or a few rows).

andy-kimball · 2019-05-13T17:46:46Z

I was mostly thinking about table stats. I'm suggesting we could always add 1 to them so we never get an estimate of 0 rows. I'm assuming that most (all?) other estimates will be non-zero as long as they have non-zero inputs. The only operators that should return 0 rows would be those where we know there are zero rows, like a Select with a False filter.

RaduBerinde · 2019-05-13T17:54:05Z

Oh, yeah, that makes sense.

RaduBerinde · 2019-05-21T12:10:11Z

@rytaft - putting this on your plate. It is possible that #37611 makes the problem of 0 row stats worse. We should do the simple fix of adding 1 to table stats that are zero.

This commit improves our statistics estimates so that we never estimate zero rows unless the row count is provably zero (e.g., SELECT ... WHERE false). We want to avoid estimating zero rows since the stats may be stale, and we can end up with weird and inefficient plans if we estimate zero rows. Therefore, this commit changes the logic in the statisticsBuilder so that a row count of 0 is replaced with 1, unless that would be inconsistent with the cardinality. This commit also updates all estimates for distinct count and null count to ensure that they are never larger than the row count. We also ensure that there is at least one distinct or null value if row count > 0. Fixes cockroachdb#32578 Release note: None

This commit improves our statistics estimates so that we never estimate zero rows unless the row count is provably zero (e.g., SELECT ... WHERE false). We want to avoid estimating zero rows since the stats may be stale, and we can end up with weird and inefficient plans if we estimate zero rows. This commit also updates all estimates for distinct count and null count to ensure that they are never larger than the row count. We also ensure that there is at least one distinct or null value if row count > 0. Fixes cockroachdb#32578 Release note: None

37729: opt: avoid estimating row count = 0 r=rytaft a=rytaft This commit improves our statistics estimates so that we never estimate zero rows unless the row count is provably zero (e.g., `SELECT ... WHERE false`). We want to avoid estimating zero rows since the stats may be stale, and we can end up with weird and inefficient plans if we estimate zero rows. This commit also updates all estimates for distinct count and null count to ensure that they are never larger than the row count. We also ensure that there is at least one distinct or null value if row count > 0. Fixes #32578 Release note: None Co-authored-by: Rebecca Taft <[email protected]>

jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-optimizer SQL logical planning and optimizations. labels Nov 23, 2018

jordanlewis assigned andy-kimball Nov 23, 2018

andy-kimball assigned RaduBerinde Nov 26, 2018

RaduBerinde mentioned this issue Nov 26, 2018

opt: add one-time cost for all operators #32616

Merged

RaduBerinde changed the title ~~opt: sorts, limits and offsets sometimes produce an inefficient double-sort plan~~ opt: avoid estimated row counts of 0 Nov 27, 2018

RaduBerinde assigned rytaft and unassigned RaduBerinde and andy-kimball May 21, 2019

rytaft mentioned this issue May 22, 2019

opt: avoid estimating row count = 0 #37729

Merged

craig bot closed this as completed in #37729 May 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: avoid estimated row counts of 0 #32578

opt: avoid estimated row counts of 0 #32578

jordanlewis commented Nov 23, 2018

andy-kimball commented Nov 26, 2018

RaduBerinde commented Nov 26, 2018

RaduBerinde commented Nov 27, 2018

andy-kimball commented May 9, 2019

RaduBerinde commented May 10, 2019

andy-kimball commented May 13, 2019

RaduBerinde commented May 13, 2019 •

edited

Loading

RaduBerinde commented May 21, 2019

opt: avoid estimated row counts of 0 #32578

opt: avoid estimated row counts of 0 #32578

Comments

jordanlewis commented Nov 23, 2018

andy-kimball commented Nov 26, 2018

RaduBerinde commented Nov 26, 2018

RaduBerinde commented Nov 27, 2018

andy-kimball commented May 9, 2019

RaduBerinde commented May 10, 2019

andy-kimball commented May 13, 2019

RaduBerinde commented May 13, 2019 • edited Loading

RaduBerinde commented May 21, 2019

RaduBerinde commented May 13, 2019 •

edited

Loading