-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.12.0 has unusable performance #3280
Comments
#2640 is the PR that appears to have introduced the problems. |
Interesting thing is that on z16+ (where I usually test things) I have not seen so visible problems - it looked a bit suspicious, but not anything clear enough to claim performance problems. |
If reverting #2640 indeed fixes the performance issues i would suggest to do that and review the change regarding what causes this without pressure. There is also #3279 which i would assume to be caused by the same change so a more thorough review overall might not be such a bad idea. The documentation now written by @sommerluk (https://github.com/sommerluk/roadpatternrendering) should help with that. If a 20x performance loss is not recognized during the ~6 weeks this change was already in master this should however also be an indicator that some changes in procedure for better change evaluation and pre-release checks might not be such a bad idea. |
Sounds OK for me. Could you to take care of the revert and v4.12.1 release? As of performance: we have special ticket for this, yet nobody came with a performance framework solution idea (#1287). When you have 20 zoom levels and the problem is visible only on some of them, manual checking is not an option IMO. Do you have some suggestion or solution for this? |
Bad thing… Up to z9, the layer |
Our first priority now is to get master stable again. We can debug later, but for now we should revert entire #2640. |
@sommerluk - you could try the version i showed in #2640 (comment) or the CTE variant i sketched earlier in #2640 (comment). For just testing if the problem is indeed in that part you can also just test a static list of layer values. But you'd need to reliably check the performance in some way. And yes, i think it is better to do that without a hurry due to the need to prepare a bugfix release. @kocio-pl - No, i have not contributed or merged any changes for quite some time so i am not familiar with the more recent changes and would not be comfortable doing a revert & release. |
One thing that does stand out is also that this issue seems to be mainly on medium / low zoom. The more zoomed out you are, the less effective any spatial index is, and the more PostgreSQL will need to rely on its attribute indexes to prevent large scans over tables and many records. Looking at the list of default carto suggested indexes, it strikes me the current list of indexes is really conservative, with only an absolute minimal number of indexes defined considering the number of keys used in a style like this. That saves a lot of disk space, but might affect low zoom rendering performance. Maybe indexing a few more attributes / keys involved in the SQL, could really help boost performance. It currently seems that much of the rendering performance relies on the spatial index doing the main job, which is logical at zoomed in extent, but less so zoomed out. E.g. in a really bad hypothetical situation zoomed out to Z0-Z2, you might end up doing a full table scan on 300M records to only fetch 10 or so..., just because there is no attribute index defined that could help selecting those few records required by the SQL. |
On a general note, doing a full table scan on z0-z2 would be a negligible performance hit since these zoom levels only contain one single meta tile. On the other hand, having to keep a spatial index up to date affects every single database update, so that has to be kept in mind. In the concrete case at hand, as I wrote in openstreetmap/chef#168, there appears to be a full table scan on planet_osm_line going on on z10. Looking further into it, it becomes obvious that the |
Unless there's some kind of partial index (e.g. Slow rendering within reason doesn't matter for z0-z4, because there are a total of 33 metatiles to be rendered. There are 16k z10 metatiles. |
I will not oppose a revert… |
I can not confirm the suspicion that this is not happening on zoom-levels 16+
|
Thanks for checking it, Sven! This proves that automated testing framework is needed anyway. I was able to detect such problem before v4.10.0 (see #3159 (comment)), but not this time. I think that tiles I was using were not so slow to render, but still we need something more solid than just my gut feeling. |
I could not reproduce this on z19. This was with a small database and old indexes. Perhaps the issue only shows up on databases of reasonable size? |
I did not change anything database related. I just called the single tile rendering script with both versions of the style. However this might be related to my hstore-only database layout. |
@pnorman Thanks. |
Does it mean we can release v4.12.1 now? Could anyone do the actual release? I'm a bit busy with other things lately. |
Current master works fine again. |
I think this is a bit to generic a statement. A sequential scan does have its cost (and a considerable one) as well. You do not need to have only very specific partial indexes to take advantage of indexing. In my experience, at least one important thing to look for are OR statements in your queries, and all keys separated by them. E.g. look at the query below, which is quite typical for situations you may encounter in OSM, where due to tagging conflation and changes in tagging schemes, multiple keys may be needed to extract a specific thematic layer.
Now look at the image below, which shows the query execution plan with just the osm_man_made field having an index. Unsurprisingly, this ends up in a sequential scan, because the OR clauses causes the need to evaluate the entire table for the osm_disused_58_man_made and osm_abandoned_58_man_made keys as well, which is fastest in a sequential scan if no other indexes are available. This does, however, come at a cost (this is the entire Europe extract by the way). Extracting just 612 records from a >165M record materialized view takes 165 seconds, so almost 3 minutes using the sequential scan (using SSD storage). Now watch what happens when I do add indexes for the osm_disused_58_man_made and osm_abandoned_58_man_made fields as well: Notice the time is now down to just 152 ms - so less than one second to extract the 612 records - using the three key indexes added, and PostgreSQL using all indexes to get them, instead of a sequential scan. Of course, there are costs to indexing all of these fields in terms of disk space and maintaining them, especially in the environment with constant diff updates that openstreetmap-carto seems to operate in, so this is just an illustration to show that in specific cases it may still be beneficial to add indexes to solve a performance issue, although this all needs to be seen in the context of a primarily spatially accessed database as well, which diversifies the situation even more. Ultimately though, the best thing to do, is to simply test it for a specific query if it has a performance issue. Add the index(es) you think are relevant, and see how it does. |
Okay, I’ve played around a little bit with the code, and honestly I do not understand the results. I’ve fixed #3279 in the code for unpaved rendering. Than I’ve loaded a current Ivory Coast extract in the database and than rendering with Kosmtik. I’ve taken the time until all times in the browser window are loaded. Doing rendering starting at z19 up to z0 gives these results:
The results are not very exact. Repeating the same steps, they can vary up to 100%. Anyway, the real problem seems to start with z10 and higher. Given that the reports that that the SQL is slower by a factor 20 and that I see here at z10 only a factor 5, I suppose that the SQL execution time is less important and the Mapnik rendering more important if the database is small (I have only Ivory Coast loaded), but that this will change when the database is big (whole planet)? Now I have tried to isolate what is so time-consuming. Up to z9, we use the roads-low-zoom SQL query. Up from z10, we use the roads-fill SQL query and the corresponding bridges and tunnels SQL query. All these (roads-low-zoom, roads-fill, bridges…) are catched by the same code in roads.mss. I’ve called fixed coordinates at z20 in Kosmtik. Than I’ve switched directly to z10.
|
I created two sketches for possible solutions based on the 4.12.0 code: https://github.com/imagico/osm-carto-alternative-colors/tree/surface-fix1 The first led to a ~80 times performance improvement for the query alone on a small test database, the second about 100 times faster. Note i did not test if this is functionally the same, i.e. if it leads to the same rendering results. It should but none the less this should be tested. I have also not tested if there are other performance issues - this is just based on #3280 (comment) (which is pretty convincing) |
The statement you quoted is about z3 and z4 rendering. The rest of your comment is not really relevant to roads.
Diffs are |
Okay. At https://github.com/sommerluk/openstreetmap-carto/tree/unpaved20 the code for unpaved rendering, fixes #110 and does not re-introduce #3279 and has imagico:surface-fix2 applied. It still feels slow on z10-z13, but faster on the higher zoom levels. How could I measure the performance in a more objective way? |
Easiest way to benchmark queries (AFAIK) is to enable extended logging in postgres - as explained in: https://gis.stackexchange.com/questions/60448/making-mapnik-verbose You can also take specific queries from those logs and |
As an addition to this comment: |
Maybe |
Okay, I need some more help with SQL setup. I don’t get working Usually, I start PostGres with |
You need to run postgres as the dedicated user created for that (usually called
|
@imagico Thanks a lot! That works! The result of EXPLAIN ANALYSE for the query with fix2 from @imagico (bbox is copy&paste from the Postgres log from a Kosmtik run at z10) is:
|
The database of the above example had loaded Ivory Coast only. |
I think you got the wrong query, probably something Kosmtik is running as an advance test. You need a query from rendering an actual tile with a real bounding box and without a zero limit. Preferably from a tile where the query returns some data. And then you can compare the performance with and without the change at various zoom levels. This will not be quantitatively representative for a full database but it will tell you if there are any serious issues. |
Maybe my render_single_tile.py https://github.com/giggls/openstreetmap-carto-de/blob/master/scripts/render_single_tile.py script might be useful for you. I use this script in combination with Makefile targets to do manual tests when merging upstream changes. |
My makefile will probably also be useful for you then: |
I'm curious what are the numbers in you case? Are they comparable with #3355? |
Running "make test" which will render one single tile per zoom layer in my setup took 5m10s in master and unpaved20 branches (second pass, thus query cache filled). The fact if there are full table scans or not should be more meaningful than rendering times anyway. |
v4.12.0 was 93 minutes |
@Giggles Thanks a lot for testing this! |
From openstreetmap/chef#168
This is slow enough to the point of unusable, and right now I'm recommending that people do not upgrade to v4.12.0.
The issue is one with the surface changes. If we can't reasonably quickly figure out what the problem is, we should revert the PR and release v4.12.1.
I will not have time to step through the SQL to identify problems in the near future.
The text was updated successfully, but these errors were encountered: