Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: interval type not using same comparison as sql equality for aggregation or insertion #79549

Open
michae2 opened this issue Apr 6, 2022 · 8 comments
Labels
A-sql-pgcompat Semantic compatibility with PostgreSQL A-sql-typing SQLtype inference, typing rules, type compatibility. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-sql-queries SQL Queries Team

Comments

@michae2
Copy link
Collaborator

michae2 commented Apr 6, 2022

It appears that the INTERVAL type is not using the same comparison for SQL equality as for aggregation. These queries illustrate the problem:

SELECT INTERVAL '1 day';
SELECT INTERVAL '24 hours';
SELECT INTERVAL '1 day' = INTERVAL '24 hours';
SELECT DISTINCT i FROM (VALUES (INTERVAL '1 day'), (INTERVAL '24 hours')) v(i);
SELECT i, COUNT(*) FROM (VALUES (INTERVAL '1 day'), (INTERVAL '24 hours')) v(i) GROUP BY i;
CREATE TABLE t (i INTERVAL PRIMARY KEY);
INSERT INTO t VALUES (INTERVAL '1 day');
INSERT INTO t VALUES (INTERVAL '24 hours');

In PostgreSQL, the two interval values are equal, group together in the same bucket, and count as the same key during insertion. PostgreSQL 14.2:

michae2=# SELECT INTERVAL '1 day';
 interval
----------
 1 day
(1 row)

michae2=# SELECT INTERVAL '24 hours';
 interval
----------
 24:00:00
(1 row)

michae2=# SELECT INTERVAL '1 day' = INTERVAL '24 hours';
 ?column?
----------
 t
(1 row)

michae2=# SELECT DISTINCT i FROM (VALUES (INTERVAL '1 day'), (INTERVAL '24 hours')) v(i);
   i
-------
 1 day
(1 row)

michae2=# SELECT i, COUNT(*) FROM (VALUES (INTERVAL '1 day'), (INTERVAL '24 hours')) v(i) GROUP BY i;
   i   | count
-------+-------
 1 day |     2
(1 row)

michae2=# CREATE TABLE t (i INTERVAL PRIMARY KEY);
CREATE TABLE

michae2=# INSERT INTO t VALUES (INTERVAL '1 day');
INSERT 0 1

michae2=# INSERT INTO t VALUES (INTERVAL '24 hours');
2022-04-06 15:07:42.453 PDT [27581] ERROR:  duplicate key value violates unique constraint "t_pkey"
2022-04-06 15:07:42.453 PDT [27581] DETAIL:  Key (i)=(24:00:00) already exists.
2022-04-06 15:07:42.453 PDT [27581] STATEMENT:  INSERT INTO t VALUES (INTERVAL '24 hours');
ERROR:  duplicate key value violates unique constraint "t_pkey"
DETAIL:  Key (i)=(24:00:00) already exists.

But in CockroachDB, the two interval values are equal, but do not group together in the same bucket or count as the same key during insertion. CockroachDB v22.1.0-beta.1, same for both vectorized engine and row-based engine:

[email protected]:26257/defaultdb> SELECT INTERVAL '1 day';
SELECT INTERVAL '24 hours';
SELECT INTERVAL '1 day' = INTERVAL '24 hours';
SELECT DISTINCT i FROM (VALUES (INTERVAL '1 day'), (INTERVAL '24 hours')) v(i);
SELECT i, COUNT(*) FROM (VALUES (INTERVAL '1 day'), (INTERVAL '24 hours')) v(i) GROUP BY i;
  interval
------------
  1 day
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

  interval
------------
  24:00:00
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

  ?column?
------------
    true
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

     i
------------
  1 day
  24:00:00
(2 rows)


Time: 0ms total (execution 0ms / network 0ms)

     i     | count
-----------+--------
  1 day    |     1
  24:00:00 |     1
(2 rows)


Time: 1ms total (execution 1ms / network 0ms)

[email protected]:26257/defaultdb> CREATE TABLE t (i INTERVAL PRIMARY KEY);
CREATE TABLE


Time: 5ms total (execution 5ms / network 0ms)

[email protected]:26257/defaultdb> INSERT INTO t VALUES (INTERVAL '1 day');
INSERT 1


Time: 7ms total (execution 7ms / network 0ms)

[email protected]:26257/defaultdb> INSERT INTO t VALUES (INTERVAL '24 hours');
INSERT 1


Time: 1ms total (execution 1ms / network 0ms)

Jira issue: CRDB-14903

Epic: CRDB-20062

@michae2 michae2 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-pgcompat Semantic compatibility with PostgreSQL A-sql-typing SQLtype inference, typing rules, type compatibility. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-sql-queries SQL Queries Team labels Apr 6, 2022
@rafiss
Copy link
Collaborator

rafiss commented Apr 7, 2022

Nice find! I wonder if the explanation of the problem in #57876 is related to this bug.

@mgartner
Copy link
Collaborator

mgartner commented Apr 12, 2022

This made me think of composite sensitivity:

  • // CanBeCompositeSensitive returns true if a scalar expression could return
    // logically different results because of non-logical differences in outer
    // columns with composite type.
    //
    // Composite values are values that contain more information than the logical
    // value (i.e. the key encoding). Examples are decimals (1.0 = 1.00) and
    // collated strings ('foo' COLLATE en_u_ks_level1 = 'FOO' COLLATE
    // en_u_ks_level1).
    //
    // An example of a composite-sensitive expression is `d::string`, where d is a
    // DECIMAL.
    //
    // A formal definition:
    // Let (c1,c2,...) be the outer columns of the scalar expression. Let
    // f(x1,x2,..) be the result of the scalar expression for the given outer
    // column values. The expression is composite insensitive if, for any two
    // sets of values (x1,x2,...) and (y1,y2,...)
    // (x1=y1 AND x2=y2 AND ...) => f(x1,x2,...) = f(y1,y2,...)
    //
    // Note that this doesn't mean that the final results are always *identical*
    // just that they are logically equal.
    //
    // This property is used to determine when a scalar expression can be copied,
    // with outer column variable references changed to refer to other columns that
    // are known to be equal to the original columns.
  • SELECT from a table with hash sharded PK has unexpected index join in query plan #67170 (comment)

And now I'm afraid that there's a number of other related bugs:

  1. The optimizer thinks an expression like i::STRING where i is an INTERVAL is composite insensitive, when it is actually composite sensitive. I'm still trying to work out what the implications of this are - I'll try to produce an incorrect result based on this.
  2. crdb_internal.datums_to_bytes returns different values for '1 day'::INTERVAL and '24 hours'::INTERVAL. This causes incorrect query results for tables with hash-sharded indexes on INTERVAL columns. cc @chengxiong-ruan @ajwerner

Example of (2):

defaultdb> CREATE TABLE t (k INT PRIMARY KEY, i INTERVAL, INDEX (i) USING HASH WITH bucket_count=8);
CREATE TABLE

defaultdb> INSERT INTO t VALUES (1, '24 hours');
INSERT 1

defaultdb> SELECT * FROM t WHERE i = '1 day'::INTERVAL;
  k | i
----+----
(0 rows)

defaultdb> SELECT * FROM t@primary WHERE i = '1 day'::INTERVAL;

  k |    i
----+-----------
  1 | 24:00:00
(1 row)

@mgartner
Copy link
Collaborator

The problem in (2) above is actually much broader than hash-sharded indexes. It looks like you can get incorrect results with any index on an interval column. On v21.2.7:

defaultdb> CREATE TABLE t (k INT PRIMARY KEY, i INTERVAL, INDEX (i));
CREATE TABLE

defaultdb> INSERT INTO t VALUES (1, '24 hours');
INSERT 1

defaultdb> SELECT * FROM t WHERE i = '1 day'::INTERVAL;
  k | i
----+----
(0 rows)

defaultdb> SELECT * FROM t@primary WHERE i = '1 day'::INTERVAL;

  k |    i
----+-----------
  1 | 24:00:00
(1 row)

@michae2
Copy link
Collaborator Author

michae2 commented Apr 12, 2022

More examples of different values that should both compare equal and also bucket together:

'1 MONTH 2 DAYS'::INTERVAL
'32 DAYS'::INTERVAL

'2 MONTHS'::INTERVAL
'60 DAYS'::INTERVAL
'86400 MINUTES'::INTERVAL

INTERVAL '1 year'
INTERVAL '518400 minutes'

(Anyone who has seen Rent might be scratching their head at that last one, but this bit from the postgres docs explains it: "Fractional parts of units greater than months are truncated to be an integer number of months, e.g. '1.5 years' becomes '1 year 6 mons'. Fractional parts of weeks and days are computed to be an integer number of days and microseconds, assuming 30 days per month and 24 hours per day, e.g., '1.75 months' becomes 1 mon 22 days 12:00:00.")

@michae2
Copy link
Collaborator Author

michae2 commented Apr 12, 2022

Based on EncodeDurationAscending these problems should only happen when using "improper" intervals in two forms:

  • seconds + minutes + hours >= 24 hours
  • seconds + minutes + hours + days >= 30 days

The improper interval doesn't necessarily have to be stored to get incorrect results, it could simply be used as a constant in a query. For example, this gives incorrect results:

[email protected]:26257/defaultdb> CREATE TABLE t (i INTERVAL PRIMARY KEY);
CREATE TABLE


Time: 8ms total (execution 8ms / network 0ms)

[email protected]:26257/defaultdb> INSERT INTO t VALUES ('1 day'::INTERVAL);
INSERT 1


Time: 6ms total (execution 5ms / network 0ms)

[email protected]:26257/defaultdb> SELECT * FROM t WHERE i = '24 hours'::INTERVAL;
  i
-----
(0 rows)


Time: 1ms total (execution 1ms / network 0ms)

@mgartner
Copy link
Collaborator

mgartner commented May 5, 2022

A related problem: #80999

@otan
Copy link
Contributor

otan commented May 24, 2022

Another related problem: #81577

@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label May 24, 2022
@rail
Copy link
Member

rail commented May 25, 2022

Manually synced with Jira

@jlinder jlinder removed the sync-me-3 label May 27, 2022
@mgartner mgartner moved this to 23.2 Release in SQL Queries Jul 24, 2023
@mgartner mgartner moved this from 23.2 Release to 24.1 Release in SQL Queries Aug 3, 2023
@yuzefovich yuzefovich added T-sql-queries SQL Queries Team T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Nov 21, 2023
@mgartner mgartner moved this from 24.1 Release to 24.2 Release in SQL Queries Nov 28, 2023
@mgartner mgartner moved this from 24.2 Release to Bugs to Fix in SQL Queries Mar 28, 2024
@mgartner mgartner added the S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors label Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-pgcompat Semantic compatibility with PostgreSQL A-sql-typing SQLtype inference, typing rules, type compatibility. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-sql-queries SQL Queries Team
Projects
Status: Bugs to Fix
Status: Triage
Development

No branches or pull requests

7 participants