DISTINCT ON queries return incorrect results on Distributed Hypertables #3784

tylerfontaine · 2021-11-05T19:53:48Z

Relevant system information:

OS: [e.g. Ubuntu 16.04, Windows 10 x64, etc]
PostgreSQL version (output of postgres --version): PostgreSQL 13.4 (Ubuntu 13.4-4.pgdg21.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, 64-bit
TimescaleDB version (output of \dx in psql): 2.5.0
Installation method: Timescale Cloud

Describe the bug
When issuing DISTINCT ON queries with an ORDER BY against a distributed hypertable, it produces an incorrect result.

When reformatting the same query to do a LIMIT 1, it returns the correct query.

To Reproduce
Steps to reproduce the behavior:
On a multi-node system (Mine has 1 AN and 3 DNs):
I was not able to reproduce this with a small amount of data, so here is a CSV of 100,000 rows which reliably reproduces the incorrect behavior.
distinct_on_repro.csv

Create two identical tables. Make one distributed, one a regular hypertable:

CREATE TABLE test_distributed (
    time timestamp without time zone NOT NULL,
    id text,
    val integer,
    label text
);
select create_distributed_hypertable('test_distributed', 'time');

CREATE TABLE test_hypertable (
    time timestamp without time zone NOT NULL,
    id text,
    val integer,
    label text
);
select create_hypertable('test_hypertable', 'time', 'label', 3);

Import the CSV above into both tables.

You can observe the different results with these queries:

select distinct on (id) id, val, time from test_distributed
where id in ('a')
order by id asc, time desc;

select distinct on (id) id, val, time from test_hypertable
where id in ('a')
order by id asc, time desc;

select id, val, time from test_distributed
where id in ('a')
order by id asc, time desc limit 1;

select id, val, time from test_hypertable
where id in ('a')
order by id asc, time desc limit 1;

tsdb=> select distinct on (id) id, val, time from test_distributed
where id in ('a')
order by id asc, time desc;
 id | val  |        time
----+------+---------------------
 a  | 7524 | 2021-10-13 03:10:24
(1 row)


tsdb=> select distinct on (id) id, val, time from test_hypertable
where id in ('a')
order by id asc, time desc;
 id | val  |        time
----+------+---------------------
 a  | 2198 | 2021-11-05 23:03:57
(1 row)


tsdb=> select id, val, time from test_distributed
where id in ('a')
order by id asc, time desc limit 1;
 id | val  |        time
----+------+---------------------
 a  | 2198 | 2021-11-05 23:03:57
(1 row)


tsdb=> select id, val, time from test_hypertable
where id in ('a')
order by id asc, time desc limit 1;
 id | val  |        time
----+------+---------------------
 a  | 2198 | 2021-11-05 23:03:57
(1 row)

Expected behavior
These queries would all return the same row.

Actual behavior
A different row is returned for the DISTINTCT ON query against the distributed hypertable

The text was updated successfully, but these errors were encountered:

akuzm · 2021-11-09T09:18:59Z

test=# select distinct on (id) id, val, time from test_distributed
where id in ('a')
order by id asc, time desc;
ERROR:  [data1]: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
Time: 249,513 ms

The query with DISTINCT on distributed hypertable errors out for me. TS 2.5 & PG 13.14, maybe because I'm using a debug version.

akuzm · 2021-11-09T16:35:26Z

One of the conditions seems to be that the planner were able to prove that id is constant and do not actually uniqualize on id. If you add more values to IN, or deceive it with a noop like concatenating to an empty string, the error doesn't reproduce:

# select distinct on (id) id, val, time from test_distributed
where id || '' in ('a')
order by id asc, time desc;
 id │ val  │        time         
────┼──────┼─────────────────────
 a  │ 2198 │ 2021-11-05 23:03:57
(1 row)

akuzm · 2021-11-10T09:52:51Z

After working around the error message, we still don't always get the correct result:

test=# select distinct on (id) id, val, time from test_distributed
where id || '' in ('a')
order by id asc, time desc;
 id │ val  │        time         
────┼──────┼─────────────────────
 a  │ 4377 │ 2021-09-07 21:51:37
(1 row)

The plan is incorrect. DISTINCT ON is pushed down to data nodes, but the ORDER BY is not, which is why it gives undetermined results.
https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT

test=# explain (analyze, verbose, costs off)
select distinct on (id) id, val, time from test_distributed
where id || '' in ('a')
order by id asc, time desc;
                                                                                                                                                                             
                                                                                                                                                                             
   QUERY PLAN                                                                                                                                                                
                                                                                                                                                                             
                 
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─────────────────
 Unique (actual time=162.544..162.554 rows=1 loops=1)
   Output: test_distributed.id, test_distributed.val, test_distributed."time"
   ->  Sort (actual time=162.541..162.546 rows=2 loops=1)
         Output: test_distributed.id, test_distributed.val, test_distributed."time"
         Sort Key: test_distributed.id, test_distributed."time" DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Custom Scan (AsyncAppend) (actual time=162.513..162.520 rows=2 loops=1)
               Output: test_distributed.id, test_distributed.val, test_distributed."time"
               ->  Append (actual time=0.009..0.015 rows=2 loops=1)
                     ->  Custom Scan (DataNodeScan) on public.test_distributed test_distributed_1 (actual time=0.006..0.007 rows=1 loops=1)
                           Output: test_distributed_1.id, test_distributed_1.val, test_distributed_1."time"
                           Data node: data1
                           Chunks: _dist_hyper_2_107_chunk, _dist_hyper_2_108_chunk, _dist_hyper_2_109_chunk, _dist_hyper_2_110_chunk, _dist_hyper_2_113_chunk, _dist_hyper_2
_114_chunk, _dist_hyper_2_117_chunk, _dist_hyper_2_119_chunk, _dist_hyper_2_120_chunk, _dist_hyper_2_125_chunk, _dist_hyper_2_127_chunk, _dist_hyper_2_130_chunk, _dist_hyper
_2_131_chunk, _dist_hyper_2_133_chunk, _dist_hyper_2_134_chunk, _dist_hyper_2_137_chunk, _dist_hyper_2_139_chunk, _dist_hyper_2_141_chunk, _dist_hyper_2_142_chunk, _dist_hyp
er_2_150_chunk, _dist_hyper_2_153_chunk, _dist_hyper_2_154_chunk, _dist_hyper_2_155_chunk, _dist_hyper_2_156_chunk, _dist_hyper_2_157_chunk, _dist_hyper_2_158_chunk, _dist_h
yper_2_159_chunk
                           Remote SQL: SELECT DISTINCT ON (id) "time", id, val FROM public.test_distributed WHERE _timescaledb_internal.chunks_in(public.test_distributed.*, 
ARRAY[54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]) AND (((id || ''::text) = 'a'::text))
                     ->  Custom Scan (DataNodeScan) on public.test_distributed test_distributed_2 (actual time=0.002..0.003 rows=1 loops=1)
                           Output: test_distributed_2.id, test_distributed_2.val, test_distributed_2."time"
                           Data node: data2
                           Chunks: _dist_hyper_2_111_chunk, _dist_hyper_2_112_chunk, _dist_hyper_2_115_chunk, _dist_hyper_2_116_chunk, _dist_hyper_2_118_chunk, _dist_hyper_2
_121_chunk, _dist_hyper_2_122_chunk, _dist_hyper_2_123_chunk, _dist_hyper_2_124_chunk, _dist_hyper_2_126_chunk, _dist_hyper_2_128_chunk, _dist_hyper_2_129_chunk, _dist_hyper
_2_132_chunk, _dist_hyper_2_135_chunk, _dist_hyper_2_136_chunk, _dist_hyper_2_138_chunk, _dist_hyper_2_140_chunk, _dist_hyper_2_143_chunk, _dist_hyper_2_144_chunk, _dist_hyp
er_2_145_chunk, _dist_hyper_2_146_chunk, _dist_hyper_2_147_chunk, _dist_hyper_2_148_chunk, _dist_hyper_2_149_chunk, _dist_hyper_2_151_chunk, _dist_hyper_2_152_chunk
                           Remote SQL: SELECT DISTINCT ON (id) "time", id, val FROM public.test_distributed WHERE _timescaledb_internal.chunks_in(public.test_distributed.*, 
ARRAY[54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]) AND (((id || ''::text) = 'a'::text))
 Planning Time: 1137.144 ms
 Execution Time: 166.321 ms
(21 rows)

Previously, we would push DISTINCT ON down to the data nodes even when the pathkeys of the resulting paths on the data nodes were not compatible with the given DISTINCT ON columns. This commit disables pushdown when the sorting is not compatible. Fixes timescale#3784

Previously, we would push DISTINCT ON down to the data nodes even when the pathkeys of the resulting paths on the data nodes were not compatible with the given DISTINCT ON columns. This commit disables pushdown when the sorting is not compatible. Fixes #3784

tylerfontaine added bug multinode labels Nov 5, 2021

svenklemm mentioned this issue Nov 8, 2021

Assertion failure in cursor_fetcher_rewind #3786

Closed

akuzm self-assigned this Nov 9, 2021

NunoFilipeSantos added 2.5.0 pg13 Timescale Cloud distributed-hypertable labels Nov 10, 2021

akuzm mentioned this issue Nov 10, 2021

Fix DISTINCT ON pushdown #3797

Merged

akuzm closed this as completed in #3797 Nov 17, 2021

duncan-tsdb mentioned this issue Nov 29, 2021

Draft2 release 2.5.1 #3861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISTINCT ON queries return incorrect results on Distributed Hypertables #3784

DISTINCT ON queries return incorrect results on Distributed Hypertables #3784

tylerfontaine commented Nov 5, 2021 •

edited

Loading

akuzm commented Nov 9, 2021

akuzm commented Nov 9, 2021 •

edited

Loading

akuzm commented Nov 10, 2021 •

edited

Loading

DISTINCT ON queries return incorrect results on Distributed Hypertables #3784

DISTINCT ON queries return incorrect results on Distributed Hypertables #3784

Comments

tylerfontaine commented Nov 5, 2021 • edited Loading

akuzm commented Nov 9, 2021

akuzm commented Nov 9, 2021 • edited Loading

akuzm commented Nov 10, 2021 • edited Loading

tylerfontaine commented Nov 5, 2021 •

edited

Loading

akuzm commented Nov 9, 2021 •

edited

Loading

akuzm commented Nov 10, 2021 •

edited

Loading