Use a query that postgresql optimises better for get_events_around #906

NegativeMjark · 2016-07-04T14:49:26Z

No description provided.

erikjohnston · 2016-07-04T15:32:55Z

LGTM

… indexes

NegativeMjark · 2016-07-04T18:51:00Z

@erikjohnston Having looked at how postgres handles the UNION all version I'm less sure that it's actually doing what we want here. It turns out that postgres really wants you to format things as (a,b) < (x,y) to use multi-column indexes.

I've switched everything to use lower_bound/upper_bound from the streams.py and made those functions switch on the database engine to decide which format of SQL to use :(

erikjohnston · 2016-07-05T08:36:19Z

Having looked at how postgres handles the UNION all version I'm less sure that it's actually doing what we want here.

Can you elaborate?

erikjohnston · 2016-07-05T08:36:45Z

synapse/storage/stream.py

            token.topological, "topological_ordering",
            token.topological, "topological_ordering",
-            token.stream, "stream_ordering",
+            token.stream, inclusive, "stream_ordering",


Is this not vulnerable to SQL injection?

It would be if inclusive was ever specified by the user.

I've made inclusive a boolean to make it a bit clearer. d44d11d

NegativeMjark · 2016-07-05T09:28:09Z

Can you elaborate?

When I ran that query in a test harness it resulted in a scan over half the index.

DROP TABLE IF EXISTS t;
CREATE TABLE t (c INT, a INT, b INT);

INSERT INTO t (c,a,b) SELECT 0, data.*, data.* FROM
    (SELECT * FROM generate_series(1,1000000)) AS data;

CREATE INDEX t_c_a_b ON t USING btree (c,a,b);

ANALYZE t;

EXPLAIN ANALYSE
        SELECT a,b FROM t WHERE c = 0 AND a < 500000
    UNION ALL
        SELECT a,b FROM t WHERE c = 0 AND a = 500000 AND b < 100
    ORDER BY a DESC, b DESC LIMIT 1;

Resulted in:

                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=24416.59..24416.59 rows=1 width=8) (actual time=123.093..123.094 rows=1 loops=1)
   ->  Sort  (cost=24416.59..25663.78 rows=498876 width=8) (actual time=123.093..123.093 rows=1 loops=1)
         Sort Key: t.a DESC, t.b DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Append  (cost=0.00..21922.21 rows=498876 width=8) (actual time=0.009..91.581 rows=499999 loops=1)
               ->  Seq Scan on t  (cost=0.00..16925.00 rows=498875 width=8) (actual time=0.009..71.667 rows=499999 loops=1)
                     Filter: (a < 500000)
                     Rows Removed by Filter: 500001
               ->  Index Only Scan using t_a_b on t t_1  (cost=0.42..8.45 rows=1 width=8) (actual time=0.009..0.009 rows=0 loops=1)
                     Index Cond: ((a = 500000) AND (b < 100))
                     Heap Fetches: 0
 Planning time: 0.050 ms
 Execution time: 123.109 ms
(13 rows)

… sqli, and possibly make the code clearer

erikjohnston · 2016-07-05T09:56:21Z

This presumably still has terrible performance on SQLite?

NegativeMjark · 2016-07-05T10:10:48Z

This presumably still has terrible performance on SQLite?

Yes. Interestingly SQLite does a reasonable job of optimising the UNION ALL version of the query.

sqlite> EXPLAIN query plan  
   ...> SELECT a,b FROM t WHERE c = 0 AND (a < 50000 OR (a = 50000 AND b < 100))
   ...> ORDER BY a DESC, b DESC LIMIT 1;
0|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=?)

sqlite> EXPLAIN query plan
   ...>         SELECT a,b FROM t WHERE c = 0 AND a < 50000
   ...>     UNION ALL
   ...>         SELECT a,b FROM t WHERE c = 0 AND a = 50000 AND b < 100
   ...>     ORDER BY a DESC, b DESC LIMIT 1;
1|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=? AND a<?)
2|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=? AND a=? AND b<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)

NegativeMjark · 2016-07-05T12:44:27Z

So for a sqlite3 database only the UNION ALL form has the desired performance :(

sqlite> EXPLAIN query plan SELECT a,b FROM t WHERE c = 0 AND (a < 500000 OR (a = 500000 AND b < 100)) ORDER BY a DESC, b DESC LIMIT 1;
0|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=?)

sqlite> EXPLAIN query plan SELECT a,b FROM t WHERE (c = 0 AND a < 500000) OR (c = 0 AND a = 500000 AND b < 100) ORDER BY a DESC, b DESC LIMIT 1; 
0|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=? AND a<?)
0|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=? AND a=? AND b<?)
0|0|0|USE TEMP B-TREE FOR ORDER BY

sqlite> EXPLAIN query plan SELECT a,b FROM t WHERE (c = 0 AND a < 500000) UNION ALL SELECT a,b FROM t WHERE (c = 0 AND a = 500000 AND b < 100) ORDER BY a DESC, b DESC LIMIT 1;
1|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=? AND a<?)
2|0|0|SEARCH TABLE t USING COVERING INDEX t_c_a_b (c=? AND a=? AND b<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)

In [47]: %time f("SELECT a,b FROM t WHERE c = 0 AND (a < 500000 OR (a = 500000 AND b < 100)) ORDER BY a DESC, b DESC LIMIT 1;", ())
CPU times: user 61.6 ms, sys: 7 ms, total: 68.6 ms
Wall time: 68 ms
Out[47]: [(499999, 499999)]

In [48]: %time f("SELECT a,b FROM t WHERE (c = 0 AND a < 500000) OR (c = 0 AND a = 500000 AND b < 100) ORDER BY a DESC, b DESC LIMIT 1;", ())
CPU times: user 279 ms, sys: 8.64 ms, total: 287 ms
Wall time: 287 ms
Out[48]: [(499999, 499999)]

in [49]: %time f("SELECT a,b FROM t WHERE (c = 0 AND a < 500000) UNION ALL SELECT a,b FROM t WHERE (c = 0 AND a = 500000 AND b < 100) ORDER BY a DESC, b DESC LIMIT 1;", ())
CPU times: user 368 us, sys: 75 us, total: 443 us
Wall time: 268 us
Out[49]: [(499999, 499999)]

NegativeMjark · 2016-07-05T12:59:33Z

Complete test harness for postgresql fwiw

btree_desc.sql.txt

NegativeMjark · 2016-07-05T13:22:53Z

@erikjohnston PTAL

erikjohnston · 2016-07-05T13:48:22Z

LGTM

Use a query that postgresql optimises better for get_events_around

f18d754

NegativeMjark assigned erikjohnston Jul 4, 2016

Use different SQL for postgres and sqlite3 for when using multicolumn…

0fb76c7

… indexes

erikjohnston reviewed Jul 5, 2016
View reviewed changes

Use true/false for boolean parameter inclusive to avoid potential for…

d44d11d

… sqli, and possibly make the code clearer

Make get_events_around more efficient on sqlite3

b6b0132

Fix typo

dd2ccee

NegativeMjark merged commit 04dee11 into develop Jul 5, 2016

richvdh deleted the markjh/faster_events_around branch December 1, 2016 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a query that postgresql optimises better for get_events_around #906

Use a query that postgresql optimises better for get_events_around #906

NegativeMjark commented Jul 4, 2016

erikjohnston commented Jul 4, 2016

NegativeMjark commented Jul 4, 2016

erikjohnston commented Jul 5, 2016

erikjohnston Jul 5, 2016

NegativeMjark Jul 5, 2016

NegativeMjark Jul 5, 2016

NegativeMjark commented Jul 5, 2016 •

edited

Loading

erikjohnston commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

erikjohnston commented Jul 5, 2016

Use a query that postgresql optimises better for get_events_around #906

Use a query that postgresql optimises better for get_events_around #906

Conversation

NegativeMjark commented Jul 4, 2016

erikjohnston commented Jul 4, 2016

NegativeMjark commented Jul 4, 2016

erikjohnston commented Jul 5, 2016

erikjohnston Jul 5, 2016

Choose a reason for hiding this comment

NegativeMjark Jul 5, 2016

Choose a reason for hiding this comment

NegativeMjark Jul 5, 2016

Choose a reason for hiding this comment

NegativeMjark commented Jul 5, 2016 • edited Loading

erikjohnston commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016

erikjohnston commented Jul 5, 2016

NegativeMjark commented Jul 5, 2016 •

edited

Loading