-
-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use libosmium for geometry building #684
Conversation
Relations and nodes can be const, as their buffer is read only. Ways need to be writable, so we can add node locations directly in the object. Hand them around as pointers to make it clear they are mutable.
Don't go through the conversation into a geos geometry but use our new wkb reader instead.
Role out our own wkb writer class instead of using libosmiums internal implementation details. Allows a bit of stream-lining and removes some of the hacks added to the libosmium copy.
Go back to printf conversion because std::to_string() does not have sufficient precision.
No longer needed. Using libosmium types now.
Get roles and ids directly from the relation member list, avoiding intermediate id lists and additional loops through the members.
Postgres returns wkbs in hexform whereas the parser expects a pure binary form. So convert before usage.
Regarding broken geometries: In the current state of libosmium w.r.t. assembling multipolygons, i.e. without osmcode/libosmium#152 being addressed, this will have a huge impact on users in map rendering. I have used libosmium quite extensively for data preparation for landcover rendering and there are areas where 20-50 percent of the big landcover polygons are broken in ways currently not recoverable by libosmium, mostly self intersections. The visual impact on maps will be boldly visible. That would be great for the standard style because it would bring down the number of broken geometries in the database quite quickly, faster than any QA tool can do, but it would certainly also be problematic for many osm2pgsql users. Ideally libosmium should be able to handle these errors and you should be able to choose within osm2pgsql how strict you want to be about them. Additional note: If you assemble polygons in geographic coordinates and reproject them into other coordinate systems you can get invalid geometries. This is rare practically with mercator but if you want to make sure you only have valid geometries you need a validity check afterwards. |
Reintroducing the validity check is something I'd like to avoid. It's expensive and we'd essentially have to reimplement whatever libgeos is doing. So the question is, does this have any practical implications for normal osm2pgsql use cases? If it only happens when using some obscure projections, then post-filtering is probably the better option. If there are practical cases for mercator projection we should check if we can catch these particular ones without going through the full-blown validation process. |
joto's area statistics report that approximately 50k out of 242.000k have intersections. @joto tells me that this figure actually includes polygons that are reported multiple times and the real figure is more like 20k self-intersecting polygons of which 6k are relations. That's 0.001% of polygons. It's probably true that some areas where the polygons are concentrated but with this small number, they cannot be large and deserve being fixed. I also think that it will make life easier for mappers and data consumers alike if osm2pgsql takes a more consistent approach to creating polygons.
Not necessarily. The average data consumer expects to get the data they see on osm.org when they process the planet. They don't need what they are not aware is not even there. And fixing broken data always comes with the danger of fixing it the wrong way. |
Regarding reprojection and validity - It is probably not a big deal in most applications - normal rendering for example does not really care if the geometry is valid or not but with
i wanted to point out that this would not imply osm2pgsql never creates invalid geometries with coordinate systems other than geographic. The most important thing is to make this clear in the documentation. Ultimately this is a more generic problem anyway, i.e. the need for a reprojection library that is capable of validity preserving reprojection of linestring/polygon geometries.
I was talking about visual impact here - large polygons have a much higher likeliness of being broken than small ones - a bit like ceramic components in engineering 😄. At the same time large polygons have a high visual impact in map rendering. The vast majority of polygons in the OSM database are buildings which are obviously rarely broken. Note i am not against being stricter with polygons in osm2pgsql, on the contrary. But you need to expect complaints from users because of this. To get an idea of the visual impact - if you take the difference in Jochen's comparison map with the removal of old style MPs: http://area.jochentopf.com/map/index.html - the impact of removing all invalid polygons osmium currently does not handle is larger than that at z6-8 (that is for water and wood/forest) in many areas, in particular northern Europe. |
One of the things I do at $DAYJOB is sell shape files generated from OSM data, where I typically discard invalid polygons. I'd like to support imagico's assessment - nobody cares if a house goes missing, but everyone notices if a 50km segment of the Rhone is lacking its riverbank area. So yes, it might be "only" a few thousand relations but there will be many large ones and there will be complaints. On the other hand people who post-process OSM data like me will finally be free from user complaints like "but this forest is visible on the OSM web site, why is it not in your shapefile?", so +1 for dropping invalid stuff. Incidentally I made an attempt to do that 5 years ago in 8be0cfc, maybe this time it gets through ;) |
Appveyor is failing with
@alex85k, could this be something that's missing in the binary blobs used to supply the osm2pgsql dependencies? |
How will this work with |
I've set up https://osmium.osm2pgsql.paulnorman.ca. The data on the right is from 170125 and imported with |
The missing dependency for Appveyor is something we probably can (and would want to) get rid of in the code. So it's okay to leave it for now. |
That's the regression. It will be fine for the lua transform, as polygon can be set according to value not only key but it won't work for the C transform anymore. If that is not acceptable than we need to reintroduce the old behaviour for ways. |
Regarding the fixing of broken multipolygons: The Mapbox data team is actively fixing broken multipolygons that have a large visible impact on the map. They have some kind of prioritized list. I am also working on Maproulette challenges to get the community involved in fixing. |
Numbers from a test with master (b9d7fcc) and this PR (0333e11) pr
master
This is saving over an hour on a 6 hour import. Interestingly, master is more parallel, probably because it has more work to do. |
That is an impressive reduction of CPU from 30k to 12k seconds. Can that be explained by the simpler geometry processing? |
Yes. Using geos functions to build OSM geometries was not a straightforward process while libosmiums algorithm is tailored for the peculiarities of OSM data. There is also a lot less memory allocation going on when preparing the data as more data is processed directly in place in the libosmium buffers.
Our biggest bottleneck is still postgres. Given that all queries (including inserts) are done synchronous, osm2pgsql probably spends a lot of time waiting for results. Looking at the numbers it might be worth testing if reducing the number of parallel threads won't actually be faster. |
Performance numbers for a planet import on a 32GB RAM machine with SSDs (-C 24000 --drop -s): PR
master
The PR database has about 66k polygons less coming from about 20k objects. Summing up the areas of all polygons, there is about 1% less. So as suspected, there are in particular large areas missing. |
On the planet the PR is
|
I have finished all testing. Code-wise this is good to go. |
I'm also okay with it, it's now just a question of release management. |
@joto informed me this morning that I might have broken the expiry code. It assumes that the WKBs that are returned from the database have native endianess. That is not necessarily the case. It is trivial to fix but needs to be done before the PR can be merged. |
Would this also be an issue in putting geoms into the DB? |
No. Each WKB geometry contains a flag that indicates the endiness used, so the DB must make sure to interpret the string correctly. |
It wasn't trivial after all, so I've only added a check that the endianess is as expected the native endianess of the machine. We can add an endian flip if somebody should ever hit the exception. |
wkb.hpp
Outdated
if (out[0] != Endian) | ||
throw std::runtime_error( | ||
"Endian setting of database WKB not supported."); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say what the endianness of osm2pgsql is in the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better yet, something like "Different endian settings of database and osm2pgsql are not supported. Database is big-endian, osm2pgsql is little-endian."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@pnorman is there a new release in the offing then? |
There aren't plans for an immediate release. First we want to let any bugs after a big change like this have a a chance to shake out, but we'd like to get any other changes that involve schema changes into the same release as #668 |
@pnorman thanks. Any suggestion on how we should handle the Homebrew formula at this point? See https://github.com/Homebrew/homebrew-core/blob/master/Formula/osm2pgsql.rb#L21-L26 It's using
but that seems somewhat inappropriate at this point since #636 will never be merged. |
osm2pgsql has no support for geos 3.6 in the 0.92.x branch and is not likely to get it with the need to maintain 3.5 compatibility and the need to verify a patch, particularly around memory leaks. Because we've moved away from geos, it's not something many people are likely to work on. |
@pnorman Understood. A new release would be much appreciated so we can drop the patch. |
This PR replaces the libgeos-based geometry builder with the libosmium area assembly. As libosmium approaches multipolygon assembly in a slightly different way, there are some visible changes to the polygon handling, as well as some secondary changes.
Changes in Polygon Handling
--exclude-invalid-polygon
removed. libosmium never creates invalid geometries.Other Changes
Todo