Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New "flex" backend #1036

Closed
joto opened this issue Dec 19, 2019 · 23 comments
Closed

New "flex" backend #1036

joto opened this issue Dec 19, 2019 · 23 comments

Comments

@joto
Copy link
Collaborator

joto commented Dec 19, 2019

In the past weeks I have started working on a new output backend. I am calling this the "flex" backend, because it is more flexible than any of the other backends and should eventually be able to support all use cases currently supported by the other backends and many more. This flexibility is mostly due to an increase in the use of Lua scripts which are used for configuration as well as more powerfull callback functions.

This work has mostly been triggered by #230, but it also touches many other issues like #901 .

Here are some design thoughts:

Osm2pgsql design for a more flexible backend and Lua integration

We write a new backend according to the ideas presented in this document. The
new backend should be able to do everything the current backends can do and
would replace them in the long-term. (This also means that the C++-only
transforms without Lua will not be supported forever.)

Table configuration

The new setup can work with any number of database tables. We provide Lua
functions define tables and their columns. Different tables can have different
column setups, for instance some tables might have lots of columns for tags
("old-style"), some have only an HSTORE or JSONB column for tags.

Lua callbacks

For every OSM object (node, way, or relation) that is processed by osm2pgsql, a
Lua callback is called with the data from this object as parameter. The Lua
callback can process the data of the OSM object in any way it likes and emits
zero, one, or more table entries (for any table) that should be added by
osm2pgsql. This is done by calling an add_row() function on those tables.

Processing in a Lua callback includes but is not limited to:

  • Decide which table or tables (if any) the data should go into.
  • Map attributes and tags from OSM objects to tables and their columns.
  • Convert data types (oneway=yes as boolean, width=10ft as integer in meters,
    put several tags into an HSTORE or JSON(B) column, etc.)
  • Convert geometries (in limited ways), like turning around line strings or
    calculating centroids.

Part of this functionality would be supported by C++ functions callable from
Lua (something like geom = MakeCentroid(geom)).

Getting data of related objects

For some use cases we need data from related objects that we want to add to the
currently processed object. We might, for instance, want the roles and tags
from all relations a way is a member of to be added to the way data in some
form.

This is solved by allowing Lua functions to

  • tell osm2pgsql that it should process certain OSM objects (mark the object)
  • store data in temporary storage in memory
  • retrieve data from temporary storage

So in the mentioned use case the following would happen:

  1. When processing a way with, say a highway tag, the Lua script tells
    osm2pgsql: Please find all relations having this way as member and call
    a Lua callback function later to re-process this way.
  2. In a later step osm2pgsql processes relations (either because they are in
    the input data and/or it gets relations out of the existing database) and
    calls the Lua callback function with the data of the way from step 1 and
    the relation data.
  3. This Lua callback can now use all of the data it gets and put them into
    any table. In the mentioned use case it could add a PostgreSQL ARRAY
    with relation tags to the way, or merge specific tags in a comma-separated
    string, or whatever is needed.

Note that this design means that finding related OSM objects is restricted to
one level of relatedness. So you can bring together ways with their nodes and
relations with their member nodes, ways, or relations. But further "nesting"
of relations can't be resolved. This simplifies the design here considerably
and should support enough uses cases. More complicated processing has to be
done somewhere else.

Drawbacks of the design

The design has its problems:

  • Reliance on Lua: osm2pgsql would long-term not work without Lua any more.
    If this is seen as a problem, we could always keep the C++ transforms and
    users of that just don't have all possibilities then.
  • Lua and performance: We would rely more heavily on Lua scripts which could
    have performance impacts. From the experience with current osm2pgsql it
    seems that this isn't too bad and we just have to be careful when
    implementing things. The bottleneck is probably always going to be the
    database anyway.
  • Because the database schema becomes more flexible and more processing is
    done in Lua, there might be more cases where you need a full re-import
    after some changes. On the other hand: We want pre-processing where it
    is cheap and easy instead of relying on complex SQL queries at rendering
    time. This is a tradeoff that every user has to make themselves. We allow
    lots of flexibility by having support for HSTORE and JSONB, so that's
    always an option.

Advanced Issues

Extra data

For some uses cases it might be useful to allow extra data to be added to
the processing step. For instance we might want to have height data from
somewhere outside OSM and add it to geometries. Or we want to add to each
highway the information whether it is in a country with left- or
right-hand-traffic.

The design of the Lua API should keep this in mind and allow this extra data
to be made available somehow, but this is for a future step.

Prefiltering in C++

It might be useful for better performance to allow some kind of optional
pre-filtering in C++ before Lua functions are called. So we might want to
be able to define that a Lua function is only called for all OSM objects with
a highway tag. This way we can save us the expensive Lua calls for everything
that's not a highway.

We can see how the performance looks before deciding whether we need this, but
it would change the configuration needed pretty fundamentally, so we have to do
this before finalizing the implementation.

Import vs. Update Mode and Separation of Passes

The Lua scripts should have the information whether we are in "create" or
in "append" mode. Or we have separate scripts for the modes.

We might also want to tell the Lua scripts whether they are run in pass 1
(when reading the data) or in pass 2 (when working on "marked" data).

It is currently unclear how the two passes needed for append mode and the
two passed needed for the more complex relationship processing will interact.

@mmd-osm
Copy link
Contributor

mmd-osm commented Dec 19, 2019

The complex interaction between Lua and C++ code (multiple phases, storing/retrieving data, marking objects for later processing) made me wonder if you had planned any kind of debugging/tracing/troubleshooting option for people writing Lua scripts?

@joto
Copy link
Collaborator Author

joto commented Dec 20, 2019

@mmd-osm You can use the full power of Lua to do that, in the simplest case just print out intermediate values etc. But you can also open a log file and write stuff there etc. If anything this becomes easier, because more of the magic isn't hidden in C++ code but plain to see in Lua code.

@openstreetmap-tiles
Copy link

openstreetmap-tiles commented Dec 20, 2019 via email

@joto
Copy link
Collaborator Author

joto commented Dec 20, 2019

@openstreetmap-tiles The whole point of this effort is that you can do things like proper highway shields efficiently without having to have extra tables in your database. So I think if this all pans out the way I envision it now, this should be able to solve your problem. Allowing tables without any geometry was a small change, so I have added this to the code.

I have added an example highway-shields.lua that demonstrates how to a) get refs from the route relations into the ways (in a comma-separated list) and b) put route relations into their own table and add their ids to the member ways (in an array). Here is a rendering of that data. The "shields" you are seeing are rendered from the way geometries, but have the refs from the relations. Note the E22 and A1 refs on the bottom. I'll leave it to you to come up with better rendering. :-)

route-refs-as-shields

@pnorman pnorman pinned this issue Dec 20, 2019
@joto
Copy link
Collaborator Author

joto commented Dec 21, 2019

@ImreSamu Please don't put everything and the kitchen sink into this issue. This is all not really related to the issue at hand and not useful to have everything in one huge issue.

@ImreSamu
Copy link
Contributor

@joto : sorry ; I have removed the "noise" ..

@nextstopsun
Copy link

@joto Could you please add an example with an additional geometry column and calculating a centroid (i.e. for boundary=administrative features with admin_level=6)?

@joto
Copy link
Collaborator Author

joto commented Dec 27, 2019

@nextstopsun Multiple geometry columns and centroids don't work yet. Supporting multiple geometries is difficult, especially because a geometry might be split into several geometries (multi-geometries or splitting long lines), and I don't know yet how exactly to do this.

@nextstopsun
Copy link

@joto So in theory with the new flex backend one can set up a table with two or more columns of type geometry, right?

Which lua library could be used to perform operations on geometry (i.e. calculation of a centroid)?

@joto
Copy link
Collaborator Author

joto commented Dec 27, 2019

So far this is all a rough design and we don't know yet what the code will look like. I would like to support several geometry columns, because this is basically a generalization of a "normal" geometry plus the way_area we currently have and I see other use cases where this could be useful (polygon and centroid for instance), but as mentioned it isn't clear how exactly this would look like in the code and it might turn out to be to complex to do. osm2pgsql still needs to be reasonably simple to use and understand, so there is a limit on how complex things should get.

I am pretty sure we are not going to give Lua access to the actual geometries and are not using any Lua libraries for this, because it would be quite a lot of overhead "shipping" geometries from C++ to Lua code and back etc. All this should be done in C++ code for performance, but this would, of course, put some limits on what can be done.

I'd be interested in hearing about use cases and ideas people might have, this might influence some design decisions.

@joto
Copy link
Collaborator Author

joto commented Jan 14, 2020

The current version of the PR is now in pretty good shape. It doesn't do everything we might ever want (no multiple geometry columns yet, for instance), but it is getting towards something that might me mergeable. There is even some documentation and lots of examples in the same directory.

There are still things to do and to decide though. The main points are:

  • Does it work on updates in all cases? My tests look all good, but I fear there might still be corner cases.
  • How to share data between first and second stage? This is currently done by serializing a magic lua variable (osm2pgsql.userdata) where you store everything you need in the second pass, and de-serializing it in all threads running the second pass. This can need quite a lot of memory, but it is very flexible. If we can narrow down typical use cases, we might be able to find a better solution here. (Also the current solution depends on the MessagePack Lua library which is not necessarily available everywhere.)

If you are interested in this issue, now would be a good time to test #1037 and give me feedback.

@pnorman
Copy link
Collaborator

pnorman commented Jan 14, 2020

I am pretty sure we are not going to give Lua access to the actual geometries and are not using any Lua libraries for this, because it would be quite a lot of overhead "shipping" geometries from C++ to Lua code and back etc. All this should be done in C++ code for performance, but this would, of course, put some limits on what can be done.

👍

  • Reliance on Lua: osm2pgsql would long-term not work without Lua any more.
    If this is seen as a problem, we could always keep the C++ transforms and
    users of that just don't have all possibilities then.

I don't see this as an issue. With osm-carto and styles based on it using Lua, the C transforms are not essential as long as we have reasonable default options.


I recall a mention of geometryless tables but can't find it again, had you discussed this?

@nextstopsun
Copy link

@joto could you please add an example where all tags are sent to jsonb column?

Is it possible to define postgresql db schema name for a table in lua config? (i.e. when you want to put tables to different schemas and tablespaces)

@joto
Copy link
Collaborator Author

joto commented Jan 14, 2020

@pnorman You can do geometryless tables simply by not defining a geometry. The flex-config/route-relations.lua has an example where a table simply stores relation tags.

@nextstopsun Filling a json(b) columns requires some Lua json library to create that json. You can then define a column with type json or jsonb and simply fill it with that data. If this is something a lot of people need, we might want to build it into osm2pgsql, but for the time being any json library in Lua should do.

There are some provisions in the flex backend code to set a schema for a table. But it doesn't work yet, because there are other changes that need to be done to the rest of the osm2pgsql code to make this work. I didn't want to do these changes before the flex backend is actually merged. So this is something for a bit later. Different tablespaces work as before, ie you can set a tablespace for all tables, but not individually.

@joto
Copy link
Collaborator Author

joto commented Jan 15, 2020

@nextstopsun I have added an example for how to deal with json here:
https://github.com/joto/osm2pgsql/blob/flex-backend/flex-config/places.lua

@mboeringa
Copy link

Hi @joto

A couple of thoughts about a more "flexible" backend that are not directly related to having multiple tables, but other options that might benefit usage:

  • It would be nice if there was an option to specify a custom prefix for column names. E.g.:

<CUSTOM_PREFIX><osm_key_name>, e.g. "<MY_PREFIX>highway" or "aaa_highway"

Why would I want this?
There are OpenStreetMap keys that are worth maintaining both as direct input by users, and as a tag converted column in the output table. By allowing to set prefixes for both original data columns and tag converted columns, you could maintain two sets of columns based on a single key, with one containing the original data and the other the tag converted value.

A classic example of a key that is worth maintaining both as an original value and as tag converted one, is the OSM direction key. It may contain both angles specified in degrees (0-359) and cardinal directions (NWSE).

You might then have two columns in the output table, one with the original values:
aaa_direction
'11'
'N'
'340'
'NE'

And the other a tag converted one:
bbb_direction
11
0
340
45

The tag converted column "bbb_direction" could then be used directly in applications to show e.g. the direction of viewpoints, while the original column "aaa_direction" still shows the original user input.

  • So, as per above, a separate prefix for tag converted columns would be desirable as well.

  • Added benefit of prefixes is that it allows easy detection of OSM key names in the SQL of a style.

  • It would also be nice to have a custom replacement string for the OSM colon ":" namespace separator character in OMS key names. Some applications and data formats do not allow exotic characters like colons in field names. Allowing to set a custom replacement string. would help with application compatibility and conversions to other data formats.

E.g. "addr<CUSTOM_COLON_REPLACEMENT>housenumber".

Of course, you could default to a single underscore, but allowing a custom string potentially allows setting a special marker and converting back to the original key. E.g., you might insert double underscore to signify the position of the colon versus the single underscore also used in openstreetmap keys.

E.g. "addr__housenumber" with double underscore could then be easily recognized as being "addr:housenumber" as key. This allows all kind of nice wizardry and automation of conversion between openstreetmap key names and database column names.

  • For maximum compatibility, prefixes and a replacement string should be limited to alphanumeric characters and underscore only: a-z / 0-9 / _. And preferably not starting with a numeric character.

@joto
Copy link
Collaborator Author

joto commented Jan 22, 2020

@mboeringa With the flex backend, all names of tables and columns are set in your Lua code, so this is all just a question of writing a little bit of Lua code that does whatever naming you want. There is no need for any special prefixes or colon replacement characters. I suggest you give it a try and if there is something you can't do with Lua code, open an issue for that.

@mboeringa
Copy link

mboeringa commented Jan 22, 2020

@mboeringa With the flex backend, all names of tables and columns are set in your Lua code, so this is all just a question of writing a little bit of Lua code that does whatever naming you want. There is no need for any special prefixes or colon replacement characters. I suggest you give it a try and if there is something you can't do with Lua code, open an issue for that.

Interesting. To be honest I have never messed with the Lua code up to now, so am unfamiliar with it yet. I do see references to what I presume are example table definitions in the pull's "files" view as linked below, e.g. in "simple.lua", that also include OpenStreetMap key names, but I don't see an example of a column being "renamed".

That is of course understandable, because the current system also simply uses the key names as column names since PostgreSQL doesn't have a an issue with things like colon in the field's name, but it leaves me wondering where of how to specify an "alias" or alternative name for the output column name.

https://github.com/openstreetmap/osm2pgsql/pull/1037/files

@nextstopsun
Copy link

Is it possible to setup "planarized" version of administrative borders table?
I.e. I need a table with LINESTRING geometry loaded from ways with boundary=administrative plus all ways that form outer rings of bounday=administrative relations.
Those ways must be distinct and have an array of admin_level values of rels to which they belong (similar to road shields example).

@joto Does flex backend have the power to solve this case?

@joto
Copy link
Collaborator Author

joto commented Jan 22, 2020

@mboeringa You simply name the columns whatever you want and later fill them:

buildings = osm2pgsql.define_way_table('buildings', {
    ....
    { column = 'addr_street', type = 'text' },
    ....
})
...
function osm2pgsql.process_way(object)
   ...
   buildings:add_row({ addr_street = object.tags['addr:street'] })
   )
   ...
end

@joto
Copy link
Collaborator Author

joto commented Jan 22, 2020

@nextstopsun Yes, in theory that should be possible. You'd have to store the admin_level information from the relations in memory and then, in stage 2, write all ways that either have those tags themselves or are members of relations that have those tags to the database. You will have to hold a lot of data in memory though, so I don't know how feasible this is. Also you have to handle all ways in stage 2 if you can't restrict this to ways with some specific tags. This will likely be too slow.

@mboeringa
Copy link

You simply name the columns whatever you want and later fill them:

@joto ,thanks. That explains it well and would work.

@joto
Copy link
Collaborator Author

joto commented Feb 8, 2020

The PR has now been merged and the flex backend is available. Please try it out and report problems or post ideas in new issues. Closing here.

@joto joto closed this as completed Feb 8, 2020
@pnorman pnorman unpinned this issue Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants