Very slow node parsing on updates (caused by PG's jit) #1045

cquest · 2020-01-02T04:44:36Z

On 2 worldwide rendering servers, I'm facing a very slow parsing of nodes during updates.
I've moved to postgresql 12 recently.

These servers are running 0.96 from the osmadmins ppa, with flatnodes and the whole PG database on SSD.

Here is a log extract:

Node stats: total(63541), max(7100611033) in 719s
Way stats: total(7330), max(760101527) in 15s
Relation stats: total(103), max(10525425) in 2s
...
Osm2pgsql took 767s overall

Node parsing takes 90% of the overall update time.
Am I the only one ?

On a third worldwide rendering server, I'm still running a very old 0.86 and the node parsing is way faster !

The text was updated successfully, but these errors were encountered:

pnorman · 2020-01-02T06:56:15Z

What is your osm2pgsql command line, and does it happen with recent versions?

cquest · 2020-01-02T06:59:59Z

I've checked the update logs we keep on each server which contains timing for each step (osmosis, osm2pgsql, tile expiration)
It seems to confirm the change since the PG upgrade from 10 to 12.

So... I've compared the postgresql.conf files and there's only minor differences and nothing that could explain the change.

arvindkartik · 2020-01-02T07:02:40Z

Slowdown potentially related to fix for #1014

I'd expect a slowdown because deletes are now done sequentially with the data insertion to avoid deadlocks. Before the deletes where done in a different thread via a different connection. I've down a quick measurement with a daily diff. Interestingly, I see quite a significant slowdown when fsync and synchronous commits are off but none when running postgres with fsync. I'll check nonetheless if there is room for optimization of the deletes.

cquest · 2020-01-02T07:06:38Z

The full osm2pgsql command line...

osm2pgsql diff.osc --number-processes=4 -a -C 64 -k -m -G -s -S ./osm2pgsql-choosen.style -d osm --flat-nodes /ssd/osm2pgsql/flat-nodes.raw --keep-coastlines --tag-transform-script ./config/activate-relation-type-waterway.lua --prefix=planet_osm

I'll do more tests on my workstation... the servers have enough troubles to test news things on them ;)
A quick test made yesterday with 0.96 and PG12 gave the same behaviour on my workstation.
I'll check with 1.2.1 and PG12, then with PG10...

cquest · 2020-01-02T09:47:20Z

Here are some test results, all done on the same machine (24 cores, nvme SSD) on a France extract with 1 daily diff:

osm2pgsql	PG	nodes	ways	rels
0.96	10	60s	33s	23s
1.2.1	10	11s	4s	6s
0.96	12	575s	43s	32s
1.2.1	12	276s	4s	6s

I've looked at postgres changelog, and maybe found something to look at.
PG12 introduced a new parameter (plan_cache_mode) which is said to "allow control over when generic plans are used for prepared statements".
Instead of the "auto" default, I set it to force_generic_plan, it was event slower, then to force_custom_plan and got the same result as auto/default.
So... it really seems linked and related to some internal change in PG 12 :(

lonvia · 2020-01-02T09:54:39Z

How did you upgrade form 10 to 12? Did you do a reimport or a database upgrade? Is this postgres 12.0 or 12.1? Did your tests use the same command line as above? If not do you mind running the same test against a standard import with the example lua to exclude that the issue is related to your specific database setup?

akara13 · 2020-01-02T10:08:04Z

Same problem here, seems to have something to do with changes introduced in PG 11, since problems with slow node parsing during updates were already reported on osm-dev mailing list after update from PG 10 to PG 11.

I was doing a full planet import for tile rendering in a fresh PG 12.1 instance, this import ran as expected and didn't show decreased node parsing speed as when compared to previous full imports.

Before we had osm2pgsql 0.9.4 and PG 9.6, daily updates took 2-3 hours, now they are up to more than 10 hours, due to slow node parsing.
Also tried the old osm2pgsql 0.9.4 for updates, still slow node parsing.
The parameters for osm2pgsql didn't change, using osm carto style and tag transforms.
The new PG 12.1 instance is using the same configuration as before, so seems to be related to some PG change introduced in PG 11.

cquest · 2020-01-02T10:12:40Z

I've just check what was different in my PG config files between 10 and 12 and I've found what causes the slowdown... JIT !

jit=on (by default)

just set it to

jit=off

And I got 11s for the node parsing, like with PG10 :)

cquest · 2020-01-02T10:18:58Z

How did you upgrade form 10 to 12? Did you do a reimport or a database upgrade? Is this postgres 12.0 or 12.1? Did your tests use the same command line as above? If not do you mind running the same test against a standard import with the example lua to exclude that the issue is related to your specific database setup?

On the rendering servers, I did the upgrade using pg_upgradecluster.
All the tests where done with a freshly imported extract file, exactly the same options for osm2pgsql in all cases...

But as you can see above, the cause it definitely not the upgrade, but the new jit feature of PG which has a very bad side effect on osm2pgsql prepared statements (I presume).

I'm checking in a few minutes on the rendering servers if jit=off solves the problem like on my workstation...

BINGO: confirmed :)

akara13 · 2020-01-02T11:10:57Z

Hm, I turned JIT off in our PG 12.1 instance, but couldn't see any positive impact on node parsing speed during updates, still at 0.1K/s.

cquest · 2020-01-02T15:18:06Z

Have you reloaded PG config ?

All my servers are ok since I changed+reloaded the config with jit=off
After searching PG bug mailing list, it looks that's quite a common issue.

akara13 · 2020-01-02T15:28:57Z

Sure reloaded the config.
Even restartet the server, just in case.

Anything else in PG configuration, that may affect update performance?
My config is:
jit = off
shared_buffers = 4GB
effective_cache_size = 48GB
work_mem = 128MB
maintenance_work_mem = 4GB
random_page_cost = 1.0
wal_buffers = 16MB
wal_level = minimal
max_wal_senders = 0

cquest · 2020-01-02T22:06:45Z

Make sure you ANALYZE the tables after a PG upgrade, this can really help the planner.

FYI, I've submit a bug to the postgres dev team... here is the thread on the bugs mailing list:

https://www.postgresql.org/message-id/[email protected]

lonvia · 2020-01-03T08:18:33Z

@cquest thanks for the follow up. I'm copying the query analysis over here:

osm=# PREPARE mark_ways_by_node(bigint) AS select id from 
planet_osm_ways WHERE nodes && ARRAY[$1];
PREPARE
osm=# explain analyze execute mark_ways_by_node(1836953770);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on planet_osm_ways  (cost=2468.37..305182.32 
rows=301467 width=8) (actual time=5.775..5.905 rows=2 loops=1)
    Recheck Cond: (nodes && '{1836953770}'::bigint[])
    Heap Blocks: exact=2
    ->  Bitmap Index Scan on planet_osm_ways_nodes (cost=0.00..2393.00 
rows=301467 width=0) (actual time=0.512..0.512 rows=2 loops=1)
          Index Cond: (nodes && '{1836953770}'::bigint[])
  Planning Time: 3.667 ms
  JIT:
    Functions: 4
    Options: Inlining false, Optimization false, Expressions true, 
Deforming true
    Timing: Generation 0.466 ms, Inlining 0.000 ms, Optimization 0.354 
ms, Emission 4.634 ms, Total 5.454 ms
  Execution Time: 30.393 ms
(11 rows)

My reading of this is that using prepared statements is not directly responsible for the slow down. The true culprit is the badly balanced way-node index. Even if we get rid of those and simply execute the SQL at the appropriate place, it still would try to use JIT when it shouldn't.

Two things we should do here:

Reduce the number of calls to mark_ways_by_node() by querying multiple nodes at once. This should be fairly easy to do and lead to a speed-up of updates in general.
Get rid of the way-node index and use a more efficient custom implementation instead. We've had some discussions about this in the past but so far none of the ideas was really practical. So this is really far off for now.

akara13 · 2020-01-03T09:17:36Z

As mentioned above, turning off jit doesn't improve node update performance on my system.
This is a fresh PG 12.1 installation with a full planet import.
When using the explain analyze above, my row estimates are even more off than in your example, so that the execution is even using two parallel workers:

osm=> PREPARE mark_ways_by_node(bigint) AS SELECT id FROM
planet_osm_ways WHERE nodes && ARRAY[$1];
PREPARE
osm=> EXPLAIN ANALYZE EXECUTE mark_ways_by_node(1836953770);
QUERY PLAN
---------------------------------------------------------------------------------------------
 Gather  (cost=26807.42..5395421.99 rows=3177861 width=8)
              (actual time=0.464..7.809 rows=2 loops=1)
   Output: id
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on public.planet_osm_ways
           (cost=25807.42..5076635.89 rows=1324109 width=8)
           (actual time=0.108..0.109 rows=1 loops=3)
         Output: id
         Recheck Cond: (planet_osm_ways.nodes && '{1836953770}'::bigint[])
         Heap Blocks: exact=2
         Worker 0: actual time=0.022..0.022 rows=0 loops=1
         Worker 1: actual time=0.022..0.022 rows=0 loops=1
         ->  Bitmap Index Scan on planet_osm_ways_nodes
                 (cost=0.00..25012.96 rows=3177861 width=0)
                 (actual time=0.266..0.267 rows=3 loops=1)
               Index Cond: (planet_osm_ways.nodes && '{1836953770}'::bigint[])
 Settings: effective_cache_size = '48GB', jit = 'off', random_page_cost = '1', work_mem = '128MB'
 Planning Time: 0.096 ms
 Execution Time: 7.851 ms
(15 rows)

This was with a daily update running in parallel.
After the update has finished, I will ANALYZE the tables and see if it helps.

cquest · 2020-01-03T11:10:18Z

My reading of this is that using prepared statements is not directly responsible for the slow down. The true culprit is the badly balanced way-node index. Even if we get rid of those and simply execute the SQL at the appropriate place, it still would try to use JIT when it shouldn't.

My understanding is that something is wrong at PG level: estimated plan cost.

In my EXPLAIN ANALYSIS (with or withour jit) rows returned by the index bitmap scan are estimated at 300k where we only got 2. That's a factor of 150 000 x !
If I understand the query, it returns the id of the ways connected to a single node. When would a node be connected to 300000 ways ?

The guess from statistics seems plain wrong 100% of the time. This is causing PG to take bad decisions in the query planning, and activate jit when it should not, etc... to me it looks like bad statistics maintained (or computed) on GIN index or something like that, but I don't know enough about PG internals to do more than a guess.

On @akara13 EXPLAIN, it's even worse (3.1M, 10x more) causing the planner to launch a parallel worker !
Launching a worker has a cost, and in this case, this may explain the slowdown despite the jit=off.
I think we have different statistics because I did my test on an extract (France), and he may be doing his EXPLAIN on a full planet db. My advice for @akara13 : try to disable parallel workers...

About improving the update process... I got a couple of ideas after looking at osm2pgsql source code to try to understand the original problem. Better to open a separate #1046 issue for that ;)

akara13 · 2020-01-03T11:36:29Z

Thanks @cquest for the suggestions.
I think I will first try to ANALYZE the table after the currently running daily update has finished and see what EXPLAIN ANALYZE says afterwards since planet_osm_ways was last analyzed in December. That may explain the totally wrong row estimate.

lonvia · 2020-01-03T12:26:46Z

The estimates for the way-node index have always been widely off. It The index is so huge and badly balanced that postgres cannot make sense of it. It just never mattered because the query is simple enough that postgres always figured that it should use the index as opposed to a sequential scan, which were the only choices then. We are running now into trouble only because postgres tries to be even more clever and use such advanced features as workers and jit.

There might be a way to influence the statistics. Newer versions of postgres have more advanced tools for that, see Postgresql Extended Statistics.

akara13 · 2020-01-03T13:30:02Z

I have analyzed planet_osm_ways, row estimate is still ridicously high at 3.1M.
With jit = 'off', max_parallel_workers_per_gather = '0' explain analyze now gives an execution time of 0.079 ms.
I've added env PGOPTIONS="-c jit=off -c max_parallel_workers_per_gather=0" to the osm2pgsql call to make shure jit and parallel query are turned off for updates.
Thanks everybody for helping.

cquest · 2020-01-03T13:46:25Z

The way_nodes index is definitely a problem... so slow to build, larger than the data itself, bloating more than others... but hard to replace by something more efficient.

I've not looked yet at how imposm deals with that.

rustprooflabs · 2020-01-13T00:46:39Z

The row estimates being off is definitely an issue. There was a simple patch proposed on the Postgres bugs list, I was able to test it today and found it does wonders for the estimated row counts.

https://www.postgresql-archive.org/BUG-16183-PREPARED-STATEMENT-slowed-down-by-jit-tp6117538p6118920.html

I tested on a much smaller data set of just the Colorado region, the estimates were showing >10,000 when actual was 1. With the patch applied it estimates 68. Assuming the improvements in estimates scale to the larger data set this should help the planner choose the correct path regarding JIT / parallel query.

They are known to have problems with the int arrays. See osm2pgsql-dev#1045

lonvia · 2020-07-27T16:22:53Z

#1256 disables JIT and parallel workers for the connections where intarray indexes are used. This should fix the slowness. We can't really require patches to Postgres for running osm2pgsql.

cquest changed the title ~~Very slow node parsing on updates ?~~ Very slow node parsing on updates (caused by PG's jit) Jan 2, 2020

lonvia mentioned this issue Jan 11, 2020

[EXPERIMENTAL] Reduced way node index #1058

Closed

lonvia mentioned this issue Feb 11, 2020

Import is slow on Postgresql 12 and/or Postgis 3.0 osm-search/Nominatim#1677

Closed

lonvia mentioned this issue Mar 21, 2020

New OpenStreetMap Carto release, v5.0.0 openstreetmap/chef#285

Closed

d3mon187 mentioned this issue Apr 14, 2020

Tile Generation - TTL Exceeded thesocialdev/kartodock#4

Closed

lonvia mentioned this issue Jun 28, 2020

Disable Postgresql jit and parallel processing for osm2pgsql updates osm-search/Nominatim#1852

Merged

joto added a commit to joto/osm2pgsql that referenced this issue Jul 26, 2020

Disable PostgreSQL JIT and parallel workers in middle

0e47aab

They are known to have problems with the int arrays. See osm2pgsql-dev#1045

joto mentioned this issue Jul 26, 2020

Disable PostgreSQL JIT and parallel workers in middle #1256

Merged

joto added a commit to joto/osm2pgsql that referenced this issue Jul 26, 2020

Disable PostgreSQL JIT and parallel workers in middle

ad45b13

They are known to have problems with the int arrays. See osm2pgsql-dev#1045

lonvia closed this as completed Jul 27, 2020

mmd-osm mentioned this issue Jan 9, 2021

Multiple VS/Postgres/Postgis versions on Appveyor #1391

Closed

dbahrdt mentioned this issue Feb 4, 2022

Check if postgres newer than 10 is still slow fmi-alg/openstreetmap-tile-server#6

Open

lonvia mentioned this issue Aug 26, 2023

Problem with query to get changed parent objects #2056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow node parsing on updates (caused by PG's jit) #1045

Very slow node parsing on updates (caused by PG's jit) #1045

cquest commented Jan 2, 2020 •

edited

Loading

pnorman commented Jan 2, 2020

cquest commented Jan 2, 2020

arvindkartik commented Jan 2, 2020

cquest commented Jan 2, 2020

cquest commented Jan 2, 2020

lonvia commented Jan 2, 2020

akara13 commented Jan 2, 2020

cquest commented Jan 2, 2020

cquest commented Jan 2, 2020 •

edited

Loading

akara13 commented Jan 2, 2020

cquest commented Jan 2, 2020

akara13 commented Jan 2, 2020

cquest commented Jan 2, 2020

lonvia commented Jan 3, 2020

akara13 commented Jan 3, 2020

cquest commented Jan 3, 2020 •

edited

Loading

akara13 commented Jan 3, 2020 •

edited

Loading

lonvia commented Jan 3, 2020

akara13 commented Jan 3, 2020

cquest commented Jan 3, 2020

rustprooflabs commented Jan 13, 2020 •

edited

Loading

lonvia commented Jul 27, 2020

Very slow node parsing on updates (caused by PG's jit) #1045

Very slow node parsing on updates (caused by PG's jit) #1045

Comments

cquest commented Jan 2, 2020 • edited Loading

pnorman commented Jan 2, 2020

cquest commented Jan 2, 2020

arvindkartik commented Jan 2, 2020

cquest commented Jan 2, 2020

cquest commented Jan 2, 2020

lonvia commented Jan 2, 2020

akara13 commented Jan 2, 2020

cquest commented Jan 2, 2020

cquest commented Jan 2, 2020 • edited Loading

akara13 commented Jan 2, 2020

cquest commented Jan 2, 2020

akara13 commented Jan 2, 2020

cquest commented Jan 2, 2020

lonvia commented Jan 3, 2020

akara13 commented Jan 3, 2020

cquest commented Jan 3, 2020 • edited Loading

akara13 commented Jan 3, 2020 • edited Loading

lonvia commented Jan 3, 2020

akara13 commented Jan 3, 2020

cquest commented Jan 3, 2020

rustprooflabs commented Jan 13, 2020 • edited Loading

lonvia commented Jul 27, 2020

cquest commented Jan 2, 2020 •

edited

Loading

cquest commented Jan 2, 2020 •

edited

Loading

cquest commented Jan 3, 2020 •

edited

Loading

akara13 commented Jan 3, 2020 •

edited

Loading

rustprooflabs commented Jan 13, 2020 •

edited

Loading