Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite rapidml from scratch #158

Closed
mpadge opened this issue Nov 25, 2018 · 2 comments
Closed

Rewrite rapidml from scratch #158

mpadge opened this issue Nov 25, 2018 · 2 comments

Comments

@mpadge
Copy link
Member

mpadge commented Nov 25, 2018

I now know a lot more than when this package started, and can see the rapidml header could fairly easily be re-written from scratch as a custom OSM XML parser that would do the storage during initial reading. Interestingly, and excitingly for @mdsumner's silicate work, this direct store-on-read procedure is only really possible with and because of silicate. The entire OSM structure is in essence fully silicate-compliant, and can be directly stored line-for-line as read.

This should ultimately enable the entire package to be re-written to simply dump directly to SC format, and then use silicate to convert outputs to other formats. (Plus some additional fiddling to insert the "hidden" but necessary row names containing OSM IDs.) MIke, I've done some preliminary comparisons of direct SC-storage, and for the test data set (a chunk of about 1/3 of Melbourne streets), the current 15-16s reduces to about 0.4s. So we're looking at least a tenfold boost in speed, which is well worth pursuing.

Related to general osmdata_sc issue #148.

@mpadge mpadge mentioned this issue Nov 25, 2018
This was referenced Nov 27, 2018
@mpadge
Copy link
Member Author

mpadge commented Nov 29, 2018

So it's not really rapidxml that needs rewriting, the xml2::read_xml() function takes almost all of the XML pre-processing time, with rapidxml just taking a tiny fraction of this. Instead, the commit linked above completes an entire rewrite of the C++ side of osmdata_sc, with the following results tested on a very large OSM document (50MB or so):

> rbenchmark::benchmark (
+                        x <- osmdata_sf (q, doc),
+                        x <- osmdata_sc (q, doc),
+                        replications = 10)
                     test replications elapsed relative user.self sys.self user.child sys.child
2 x <- osmdata_sc(q, doc)           10  17.070    1.000    17.017    0.030          0         0
1 x <- osmdata_sf(q, doc)           10  80.834    4.735    77.251    3.434          0         0

so osmdata_sc() is now around 5 times faster than sf, and leaves plenty of spare processing time for conversion from sc to sf to still likely be ultimately more efficient than current osmdata_sf.

@mdsumner
Copy link

wow ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants