Improve performance of event_thread_relation background update for large databases #11375

bradtgmurray · 2021-11-17T18:22:18Z

Description

Been debugging some performance issues on our RDS database and came across this ugly chart.

pg_stat_statements say it's this query here:
SELECT event_id, json FROM event_json LEFT JOIN event_relations USING (event_id) WHERE event_id > '$iC2wL1TGv8qPaZXR8-KMSNuKxxgLLkhru3HeUeetEPI' AND event_relations.event_id IS NULL ORDER BY event_id LIMIT 100

Which is the top query on our DB by a factor of 3, as it needs to do a surprisingly slow and large index scan of the index on event_relations to see if there are any that exist for the events we'd like to process.

Limit  (cost=1.12..90.44 rows=100 width=1064) (actual time=2295.323..2295.979 rows=100 loops=1)
  ->  Merge Anti Join  (cost=1.12..20178406.57 rows=22592023 width=1064) (actual time=2295.322..2295.968 rows=100 loops=1)
        Merge Cond: (event_json.event_id = event_relations.event_id)
        ->  Index Scan using event_json_event_id_key on event_json  (cost=0.57..19154088.46 rows=24493063 width=1064) (actual time=0.022..0.630 rows=108 loops=1)
              Index Cond: (event_id > '$iC2wL1TGv8qPaZXR8-KMSNuKxxgLLkhru3HeUeetEPI'::text)
        ->  Index Only Scan using event_relations_id on event_relations  (cost=0.56..926766.22 rows=6923535 width=44) (actual time=0.003..1908.093 rows=5003206 loops=1)
              Heap Fetches: 778015
Planning Time: 0.153 ms
Execution Time: 2296.006 ms

2 seconds to scan through the whole event_relations_id index looking for entries that don't exist.

Note our database is quite large. Our events_json table is about 145gb alone, and we're only on the $i events in this background update and we already have 7 million rows in our event_relations table.

It feels like the batch size calculation here is actually hurting us: https://github.com/matrix-org/synapse/blob/develop/synapse/storage/background_updates.py#L274

The query takes so long that even though we only want it to take 100ms, it takes 2s+, so we always only ever do the minimum amount (100 entries). However, the nature of the query means that we're scanning the same index over and over again to find pretty small sets of entries.

As a test, running the same query with LIMIT 1000 instead of LIMIT 100 only takes half a second more.

Limit  (cost=1.12..894.30 rows=1000 width=1064) (actual time=2291.379..2778.058 rows=1000 loops=1)
  ->  Merge Anti Join  (cost=1.12..20178846.86 rows=22592349 width=1064) (actual time=2291.378..2777.892 rows=1000 loops=1)
        Merge Cond: (event_json.event_id = event_relations.event_id)
        ->  Index Scan using event_json_event_id_key on event_json  (cost=0.57..19154443.67 rows=24493419 width=1064) (actual time=0.011..485.704 rows=1088 loops=1)
              Index Cond: (event_id > '$iC2wL1TGv8qPaZXR8-KMSNuKxxgLLkhru3HeUeetEPI'::text)
        ->  Index Only Scan using event_relations_id on event_relations  (cost=0.56..926849.84 rows=6923643 width=44) (actual time=0.003..1904.018 rows=5003394 loops=1)
              Heap Fetches: 778134
Planning Time: 0.140 ms
Execution Time: 2778.188 ms

I'm a little scared to bump the minimum batch size across the board, as I'm not sure what impact that will have on different background updates.

Is there a way to do this work in a smarter way that doesn't hit the database so hard?

Should we have a way of tuning batch sizes on a per background update job basis so we avoid badly performing cases like this?

Version information

Homeserver:
matrix.beeper.com
Version:
Lightly modified version of v1.46
Install method:
Rolling out own Docker container
Platform:
Kubernetes on AWS + RDS for the database

The text was updated successfully, but these errors were encountered:

richvdh · 2021-11-17T20:06:57Z

As noted in chat, there are various related issues here: #11261, #11260, #11259.

A high number of "Heap Fetches" sounds like your event_relations might be in need of a VACUUM (or that you've got a very old transaction holding onto old row versions, or a secondary database replica which is lagging). It might be worth trying a VACUUM VERBOSE event_relations;.

https://blog.makandra.com/2018/11/investigating-slow-postgres-index-only-scans/ has quite a lot of detail on this.

richvdh · 2021-11-17T20:10:15Z

It's also worth noting that this update has been running on matrix.org for three weeks and is still well under halfway through the namespace (it's up to $N - note that A-Z comes before a-z); however the query plan there looks different to yours, implying a different bottleneck:

matrix=> explain SELECT event_id, json FROM event_json LEFT JOIN event_relations USING (event_id) WHERE event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY' AND event_relations.event_id IS NULL ORDER BY event_id LIMIT 515;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.27..1567.66 rows=515 width=1128)
   ->  Nested Loop Anti Join  (cost=1.27..1452399695.84 rows=477521350 width=1128)
         ->  Index Scan using evjson_uniq_idx on event_json  (cost=0.70..313557813.19 rows=486852034 width=1128)
               Index Cond: (event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY'::text)
         ->  Index Only Scan using event_relations_id on event_relations  (cost=0.56..2.34 rows=1 width=42)
               Index Cond: (event_id = event_json.event_id)
(6 rows)

Still, I wonder if it would end up being quicker for us to select all rows in event_json (regardless of their presence in event_relations), and upsert into event_relations for those with an io.element.thread relationship.

bradtgmurray · 2021-11-17T20:24:13Z

A high number of "Heap Fetches" sounds like your event_relations might be in need of a VACUUM (or that you've got a very old transaction holding onto old row versions, or a secondary database replica which is lagging). It might be worth trying a VACUUM VERBOSE event_relations;.

https://blog.makandra.com/2018/11/investigating-slow-postgres-index-only-scans/ has quite a lot of detail on this.

Amazing, thank you! I guess the new rows are coming in too fast for our auto VACUUM to keep up.

Limit  (cost=1.12..87.14 rows=100 width=1064) (actual time=1581.591..1581.741 rows=100 loops=1)
  ->  Merge Anti Join  (cost=1.12..19153593.05 rows=22267649 width=1064) (actual time=1581.590..1581.731 rows=100 loops=1)
        Merge Cond: (event_json.event_id = event_relations.event_id)
        ->  Index Scan using event_json_event_id_key on event_json  (cost=0.57..18895339.00 rows=24141663 width=1064) (actual time=0.011..0.122 rows=109 loops=1)
              Index Cond: (event_id > '$i_c9f-SUHoqAN4T-BhUEv78FywJpHXZmmeV05NoG7eM'::text)
        ->  Index Only Scan using event_relations_id on event_relations  (cost=0.56..161825.59 rows=6933669 width=44) (actual time=0.004..1198.679 rows=5049281 loops=1)
              Heap Fetches: 15406
Planning Time: 0.139 ms
Execution Time: 1581.766 ms

Still 1.5s~, but speedier than before.

richvdh · 2021-11-17T20:45:06Z

hrm. It's still scanning the whole of event_relations_id, which it shouldn't be. Possibly it needs an ANALYZE ?

bradtgmurray · 2021-11-18T01:59:16Z

ANALYZE didn't resolve it, still playing around. Temporarily disabling merge joins made it use the much faster plan, but that doesn't feel like a real solution. What is our postgres doing?

richvdh · 2021-11-18T17:12:26Z

What is our postgres doing?

Good question!

I think the next step here is to look at the stats Postgres is using to plan queries on event_relations, to see why it is scanning the whole index. Could you do:

SELECT attname, inherited, n_distinct, correlation
FROM pg_stats 
WHERE tablename = 'event_relations';

If you need a quick fix to get your server responding again, you can play around with the constants in synapse.storage.background_updates.BackgroundUpdater. You can also do that while Synapse is running via the manhole, for example:

>>> hs.get_datastore().db_pool.updates.MINIMUM_BACKGROUND_BATCH_SIZE
100
>>> hs.get_datastore().db_pool.updates.MINIMUM_BACKGROUND_BATCH_SIZE=1

richvdh · 2021-11-23T13:14:51Z

#11391 replaces this update with one that will hopefully be more efficient, so I think this will probably solve itself in the next release.

bradtgmurray · 2021-11-26T12:39:14Z

SELECT attname, inherited, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'event_relations';

attname	inherited	n_distinct	correlation
event_id	false	-1	-0.014066256
relates_to_id	false	182593	0.0046101464
relation_type	false	3	0.6919749
aggregation_key	false	3820	-0.018588943

richvdh · 2021-11-26T16:21:05Z

Thanks @bradtgmurray.

I've got to say, this has us completely mystified. I've actually tried this on my own homeserver (sw1v.org) and I see much the same results as you. (That server is on postgres 11, fyi).

For comparison, statistics for that table on matrix.org:

matrix=> SELECT attname, inherited, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'event_relations';
     attname     | inherited | n_distinct  | correlation 
-----------------+-----------+-------------+-------------
 event_id        | f         |          -1 |    0.277326
 relates_to_id   | f         | 2.05968e+06 |    0.268559
 relation_type   | f         |           3 |     0.29254
 aggregation_key | f         |         746 |  0.00632064

the correlation is higher on matrix.org, but I don't think that's particularly important for determining whether to do a merge join or a nested loop.

I also noticed that if I increase the LIMIT high enough (to 20000 or so), matrix.org also does a merge join. I therefore wondered if the difference might be about the relative size of the limit compared to the number of rows in the tables; however, even with LIMIT 1, sw1v.org still does a a merge join, and matrix.org's tables are rather less than 20000 times bigger than sw1v.org's, so I don't think that explains it.

Comparing nested loop vs merge join query plan costs on matrix.org:

matrix=>  explain SELECT event_id, json FROM event_json LEFT JOIN event_relations USING (event_id) WHERE event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY' AND event_relations.event_id IS NULL ORDER BY event_id limit 1;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.27..4.35 rows=1 width=1128)
   ->  Nested Loop Anti Join  (cost=1.27..1480661697.22 rows=481205597 width=1128)
         ->  Index Scan using evjson_uniq_idx on event_json  (cost=0.70..316274965.99 rows=490709467 width=1128)
               Index Cond: (event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY'::text)
         ->  Index Only Scan using event_relations_id on event_relations  (cost=0.56..2.37 rows=1 width=42)
               Index Cond: (event_id = event_json.event_id)
(6 rows)

matrix=> set enable_nestloop=off;
SET
matrix=> explain SELECT event_id, json FROM event_json LEFT JOIN event_relations USING (event_id) WHERE event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY' AND event_relations.event_id IS NULL ORDER BY event_id limit 1;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=28388.89..28389.56 rows=1 width=1128)
   ->  Merge Anti Join  (cost=28388.89..321389619.91 rows=481206018 width=1128)
         Merge Cond: (event_json.event_id = event_relations.event_id)
         ->  Index Scan using evjson_uniq_idx on event_json  (cost=0.70..316275273.42 rows=490709900 width=1128)
               Index Cond: (event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY'::text)
         ->  Index Only Scan using event_relations_id on event_relations  (cost=0.56..3672652.19 rows=47959820 width=42)
(6 rows)

And on sw1v.org:

synapse=# SET enable_mergejoin=off;
SET
synapse=# explain SELECT event_id, json FROM event_json LEFT JOIN event_relations USING (event_id) WHERE event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY' AND event_relations.event_id IS NULL ORDER BY event_id limit 1;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.98..6.11 rows=1 width=1067)
   ->  Nested Loop Anti Join  (cost=0.98..6947000.29 rows=1353831 width=1067)
         ->  Index Scan using event_json_event_id_key on event_json  (cost=0.56..2946092.40 rows=1441996 width=1067)
               Index Cond: (event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY'::text)
         ->  Index Only Scan using event_relations_id on event_relations  (cost=0.42..2.77 rows=1 width=42)
               Index Cond: (event_id = event_json.event_id)
(6 rows)

synapse=# SET enable_mergejoin=on;
SET
synapse=# explain SELECT event_id, json FROM event_json LEFT JOIN event_relations USING (event_id) WHERE event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY' AND event_relations.event_id IS NULL ORDER BY event_id limit 1;
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.98..3.42 rows=1 width=1067)
   ->  Merge Anti Join  (cost=0.98..3306039.07 rows=1353833 width=1067)
         Merge Cond: (event_json.event_id = event_relations.event_id)
         ->  Index Scan using event_json_event_id_key on event_json  (cost=0.56..2946100.35 rows=1441998 width=1067)
               Index Cond: (event_id > '$N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY'::text)
         ->  Index Only Scan using event_relations_id on event_relations  (cost=0.42..354732.35 rows=292746 width=42)
(6 rows)

The principal difference here appears to be the large "start-up cost" for the merge on matrix.org (28388.89). I don't really know what is causing this, though it's worth noting that the scan on event_relations appears to be scanning the entire index, rather than starting at $N98KdM_aW5GtgfdiyTKvyouvTyiFW6CSSdsqZZx1KTY as I think it should, so my guess would be that it's an estimated cost for the number of rows it reads and discards from event_relations_id. However, I have no idea why this does not apply on sw1v.org.

I'm sorry not to have better answers on this. It might be worth reaching out on the "pgsql-performance" mailing list (https://www.postgresql.org/list/pgsql-performance/2021-11/) asking if they have any ideas why postgres is favouring a merge join rather than a nested loop, despite the nested loop being much quicker.

bradtgmurray · 2021-11-29T23:17:44Z

Still investigating this one and trying to get some outside help.

This may be a clue here (from pgmustard.com):

For some reason it thinks we're going to get a large number of rows, even though the join is largely 1 to 1 and we have a LIMIT 100 on the outside.

bradtgmurray · 2022-01-10T15:50:00Z

I think we can probably close this out at the this point. The backfill has completed and seems to be some kind of weird difference in database configuration between our DB and matrix.org's, but it's unlikely we'll get to the bottom of it.

clokep mentioned this issue Nov 18, 2021

Rework relations storage and background update to handle any relation type #11387

Closed

DMRobertson added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-Performance Performance, both client-facing and admin-facing labels Nov 18, 2021

clokep mentioned this issue Nov 18, 2021

Store arbitrary relations from events #11391

Merged

bradtgmurray closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of event_thread_relation background update for large databases #11375

Improve performance of event_thread_relation background update for large databases #11375

bradtgmurray commented Nov 17, 2021

richvdh commented Nov 17, 2021

richvdh commented Nov 17, 2021

bradtgmurray commented Nov 17, 2021

richvdh commented Nov 17, 2021

bradtgmurray commented Nov 18, 2021

richvdh commented Nov 18, 2021

richvdh commented Nov 23, 2021

bradtgmurray commented Nov 26, 2021

richvdh commented Nov 26, 2021

bradtgmurray commented Nov 29, 2021

bradtgmurray commented Jan 10, 2022

Improve performance of event_thread_relation background update for large databases #11375

Improve performance of event_thread_relation background update for large databases #11375

Comments

bradtgmurray commented Nov 17, 2021

Description

Version information

richvdh commented Nov 17, 2021

richvdh commented Nov 17, 2021

bradtgmurray commented Nov 17, 2021

richvdh commented Nov 17, 2021

bradtgmurray commented Nov 18, 2021

richvdh commented Nov 18, 2021

richvdh commented Nov 23, 2021

bradtgmurray commented Nov 26, 2021

richvdh commented Nov 26, 2021

bradtgmurray commented Nov 29, 2021

bradtgmurray commented Jan 10, 2022