venue popularity #493

missinglink · 2019-06-12T14:08:21Z

This feature has been requested in pelias/pelias#171

This PR adds the ability to increase a document popularity score based on which tags it has.

I'd love to see some suggestions from the community on what they would like to see regarding 'importance' scoring in OSM.
Please suggest additional tags and scoring methodologies!

I will tune the final numbers in testing, these numbers are normalized using log1p before the scoring is applied in elasticsearch.
We should only really use these numbers as a 'tie-breaker' for multiple venues with the same name, eg Eiffel Tower

I also added some functionality to detect abandoned and disused places and give them negative popularity.
In the case where a document has negative popularity, it is discarded.

resolves pelias/pelias#171

bboure · 2019-06-13T10:44:13Z

Thank you @missinglink I am willing to test this.
I am still quite new to Pelias. If I understand correctly, I would need to re-import OSM data using your branch?

How did you end up with these scores? are they rather arbitrary?
I understand this is very WIP yet, but I suppose these could be configurable?

missinglink · 2019-06-13T10:59:40Z

Yep, it requires reindexing openstreetmap, if you are using pelias/docker for your builds then you can test this feature by changing the openstreetmap.image from pelias/openstreetmap:master to pelias/openstreetmap:venue_popularity.

Once you've done that you can pull the new image from dockerhub by executing pelias compose pull.

Finally you can reimport the openstreetmap data with pelias import osm and you should see the results immediately.

The query logic should already be in place to take advantage of updated popularity data but the effect might either be too strong or too weak.
So we may need to either adjust the figures in this config file or adjust the weighting at query-time in order to get the balance right.

The scores are totally arbitrary, I just made them up :)
It will take a bit of trial-and-error during testing to figure out if they are decent or not, once we get it nicely balanced we can commit these values to master.

It will also be possible for users to modify the config on their own Pelias setup by editing the file directly, or, using docker, by bind-mounting their own config over the one in the image.

Let me know how you get on

missinglink · 2019-06-13T11:04:46Z

@bboure you may find this useful pelias/docker#103
[edit] hmm that's still a draft and may contain omissions that might be tricky for a new Pelias user, I'd recommend one of the projects in the master branch, or making your own project based on the portland-metro project.

bboure · 2019-06-13T11:11:08Z

Thank you @missinglink
I was able to re-import OSM data and made a few tests (I was using Docker)
The effect to be the desired one. Good job! 🎉
I will keep playing with it and give some feedback.

Joxit

If it's possible to get the area of the object here, it would be nice to increase the score.
For exemple, an hospital with a important surface should have an higher score than a point or smaller hospital.
We can also do this with the type of the object eg way versus node?

stream/popularity_mapper.js

orangejulius · 2019-06-13T14:27:06Z

Woah, this PR got real big real fast. I really liked the simple initial version, although I still think this is an important thing to add, and really can't believe we haven't done it till now :)

I'm a little hesitant with the many many additions that came in later commits, although I also see how they can be useful. I've been trying to think through ways this PR could cause problems or regressions. I can't come up with much, but I bet it's possibly for a generically named venue (something like "Market") to drown out other more specific results.

It might be worth testing this PR on either a continent or planet scale before merging, to see if we can identify any cases like that, or if there's any unexpected behavior.

missinglink · 2019-06-13T17:29:48Z

One thing that came later was only setting popularity for the venue layer, if we allowed it on the address layer then osm would always have a points advantage over oa!

I think it's a great feature to add, a couple of hesitations on my part:

We could possibly remove the 'contact information' scoring completely, I reduced the scoring already but I think that the statement "venues tagged with a phone number are more popular than those without" isn't necessarily true.
This will need to have some end-to-end testing done before release, the popularity scoring subqueries have been in Pelias for a long time but haven't been used since the Quattroshapes days (like 2 years ago). So it's very likely that the balance is off.

Overall I think it's a really nice feature to have and will make the product more professional feeling, we just need to be careful not to upset the balance as a result.

missinglink · 2019-06-13T17:39:09Z

If it's possible to get the area of the object here, it would be nice to increase the score.
For exemple, an hospital with a important surface should have an higher score than a point or smaller hospital.
We can also do this with the type of the object eg way versus node?

We have bounding boxes for way and relation records via pbf2json and also the option of using the version metadata to see how often it has been edited in osm.

In the case of a school I think you're right, a big school is probably more important than a small school, however the larger school could be mapped as a node and so would appear to be smaller.

The same is also true of monuments, a tower is physically small but could be a more popular tourist attraction than an old football stadium which is much larger.

Total edits could be interesting, it shows at least that the place is popular with mappers and so it's probably important enough to get correct, although I don't know what sort of score we would assign based on how many edits were made.

Thoughts?

bboure · 2019-06-15T20:23:03Z

Total edits could be interesting, it shows at least that the place is popular with mappers and so it's probably important enough to get correct

I agree on that. A popular place will have many edits in OSM.

I am not sure about the size/are. For example, Manneken Pis is very small, but also very famous 😄

The number of translations (name.*) might also be a sign that the place is famous internationally? (similar/synonym to importance=international)

I have been playing a bit with the current implementation and it gives me pretty accurate results on a City scale map (Barcelona). It probably can improve though.
I will give more feedback later.

NickStallman · 2019-06-17T17:41:59Z

It might also be worthwhile to have a mechanism to allow for a proportional score.
E.g. start_date is more important the older it is, a date of yesterday and a date 200 years ago aren't equal. Some kind of exponential score here with a upper bound would be good.
And height would be another good one, taller things would often be more important than shorter things.

nvkelso · 2019-06-18T08:44:31Z

🎉 Super exciting progress!

For reference, we added collision_rank in v1.7.0 of Tilezen to solve similar "same but different" sort problem.

missinglink · 2019-06-18T09:14:43Z

Heya, nice to see a bunch of interest in this PR!

I've been thinking about this some more over the weekend and I think we all agree that the concept of popularity is still pretty vague and not anchored to anything in any logical way that would allow us to estimate the resulting behaviour when we assign popularity values to documents.

In order to deal with that subjectivity, I offer this explanation of popularity which we can adopt going forward.

Popularity for 'admin' areas is the same as their population
In some admin areas, the popularity will be more or less than the population (such as increasing it in places such as Venice or San Francisco & decreasing it for large regions and states)
All other layers are assigned popularity based on the statement Assuming there is a city with the same name, assigning higher popularity will rank this place higher than the city, and lower popularity will rank it lower. This will allow us to, for instance, score Newark International Airport higher than Newark, NJ in the results.
Addresses have zero popularity. We may make some exceptions for things like 1600 Pennsylvania Ave NW, Washington, DC, but this is generally true.
Streets have a popularity range (still to be decided) which is higher than zero but less than a small city (I'm thinking in the thousands).
Venues have the most variable popularity, ranging from airports (which may be more popular than their locality) to a low minimum value, such as 100. As per previous discussions, tourist attractions would be scored in a way that replicas of 'real' places would have lower popularity.
Postcodes would be assigned a default value similar to what a borough would have. (they would likely all have the same popularity?)

Some advantages of adopting a consistent popularity score across all layers:

We can score 'real' monuments higher than replicas
We can increase performance for short autocomplete inputs by filtering on popularity. see [on hold] add hard distance filter to short focus.point queries api#1215
We can better mix layers in results, queries using a focus.point will show a better mix of layers based on their relative popularity
It will be much easier (and bug-free) to exclude addresses from certain queries where they are not relevant.

Another way to think about it would be to ask, how many 'unique visitors' does this place see per year, and by that definition things like banks, train stations and tourist attractions naturally have higher popularity scores than private addresses.

Thoughts/feels?

missinglink · 2019-06-18T09:28:10Z

Couple notes on elasticsearch scoring based on global popularity values:

We would no longer need a subquery to score on population. popularity will serve the same purpose albeit in a more flexible way.
We will need to 'balance' the effects of the popularity bias relative to the textual matching scores. For example 'Newark, NJ' should show the locality higher despite the airport having the larger popularity value.

I think the latter bullet point will be easier to achieve and more consistent when the values are more consistent.

missinglink · 2019-06-18T09:40:53Z

I saw Sarah Hoffmann from the Nominatim Project last week and she said their popularity scoring is solely based on Wikipedia.

They compute the 'internal inbound link count' in Wikipedia for each OSM place with a concordance and use that value (ie. 'wiki page rank').

She said they were pretty happy with the results, the dump is available for download, it's about 6 years old but still pretty relevant. They plan to update the file this year.

bboure · 2019-06-22T18:20:49Z

After this change, would it make sense to add a way in the api to fetch the top n venues in a given area?
i.e: I am interested in showing popular places on a map (in a given bbox)

bboure · 2020-02-01T16:56:22Z

Hey,
What is the status on this?

I'd love to see some suggestions from the community on what they would like to see regarding 'importance' scoring in OSM.
Please suggest additional tags and scoring methodologies!

How about having customizable scoring system?
i.e.: Being able to configure what scope each specific tag should have (or not).

For example, if you are building a transportation system, you might want to boost train/bus stations higher, while still showing other results.
Example with https://www.openstreetmap.org/way/5013364 and https://www.openstreetmap.org/node/5682929172
You would rather show the bus stop first, and then the actual tower.

Which leads me to thinking that it would also be nice to have a query-time booster by layer or category on the API as well.

/search?text=Tour+Eiffel&prioritize.categories=transport

This would give a higher score to documents with the given categories, but would still show other results.

… places

…on tags

missinglink · 2020-04-28T12:03:40Z

We (Geocode Earth) are currently looking at this issue again and hope to merge some code which allows for improved venue scoring soon.

related: #385

blackmad · 2020-05-01T16:18:25Z

stream/popularity_mapper.js

+  aerodrome: {
+    international: { _score: 10000 },
+    regional: { _score: 5000 }
+  },


fwiw I found using the small/medium/large categories from https://ourairports.com/data/airports.csv to be more useful than "international" - there are some small and medium international airports that don't deserve such a boost.

blackmad · 2020-05-01T16:18:58Z

stream/popularity_mapper.js

+  // transportation
+  aerodrome: {
+    international: { _score: 10000 },
+    regional: { _score: 5000 }


also maybe want a downweight on aerodrome:type=military https://www.openstreetmap.org/node/369160593 ?

blackmad · 2020-05-01T16:19:30Z

stream/popularity_mapper.js

+    regional: { _score: 5000 }
+  },
+  iata: {
+    _score: 5000,


how much will this hurt a query for "CVS" that doesn't want the airport?

orangejulius · 2020-05-18T15:00:39Z

stream/popularity_mapper.js

+    supermarket: { _score: 2000 },
+    civic: { _score: 2000 },
+    government: { _score: 2000 },
+    hospital: { _score: 2000 },


add historic here for Wrigley field? https://www.openstreetmap.org/relation/1407988

orangejulius · 2020-05-18T15:24:34Z

stream/popularity_mapper.js

+  },
+
+  // transportation
+  aerodrome: {


Another tag that might make sense is aeroway:aerodrome. It looks like its on most of the major international airports like Phoenix and Miami.

orangejulius · 2020-05-18T15:29:34Z

stream/popularity_mapper.js

+    _score: 5000,
+    none: { _score: -5000 }
+  },
+  railway: {


Looks like another related tag is public_transport=station:

While the more common tag railway=station is used for all railway stations (i.e. including cargo), the public_transport=station tag is used only on the stations interesting for passenger transport.

This would hopefully help boost some more public transit stops like Park & Market which currently does not do very well in our tests

orangejulius · 2020-05-18T15:33:44Z

stream/popularity_mapper.js

+  architect: { _score: 5000 },
+  heritage: { _score: 5000 },
+  'heritage:operator': { _score: 2000 },
+  historic: {


How about adding historic=heritage for La Sagrada Familia?

As described in #537, the default set in #493, where all venues that have a calculated popularity below `0` are not imported, is a bit strict. This adds a config flag, `imports.openstreetmap.removeDisusedVenues` that controls whether or not that behavior is activated. In addition, when enabled, a `warning` is displayed for each removed record.

missinglink mentioned this pull request Jun 12, 2019

Add popularity scores to landmarks pelias/pelias#171

Closed

missinglink force-pushed the venue_popularity branch 3 times, most recently from 508a1b3 to b285f38 Compare June 13, 2019 08:41

missinglink requested review from Joxit, orangejulius and NickStallman June 13, 2019 08:45

missinglink force-pushed the venue_popularity branch 4 times, most recently from 1fe6ee5 to 6afc4d8 Compare June 13, 2019 09:33

missinglink force-pushed the venue_popularity branch from db5e1c4 to 97595b9 Compare June 13, 2019 12:49

Joxit approved these changes Jun 13, 2019

View reviewed changes

stream/popularity_mapper.js Show resolved Hide resolved

missinglink mentioned this pull request Jun 14, 2019

display population & popularity in geojson when debug flag enabled pelias/api#1311

Merged

nvkelso mentioned this pull request Jun 18, 2019

Adjust POIs min_zoom by popularity tilezen/vector-datasource#1909

Open

missinglink mentioned this pull request Jun 18, 2019

remove population subqueries pelias/api#1318

Draft

orangejulius mentioned this pull request Jun 20, 2019

Focus point distance filter pelias/api#1323

Merged

orangejulius mentioned this pull request Aug 22, 2019

Autocomplete should prioritize exact matches pelias/api#1295

Closed

bboure mentioned this pull request Apr 20, 2020

Proposal: Explore API pelias/pelias#857

Open

missinglink added 5 commits April 24, 2020 07:59

feat(venue_popularity): foundations for popularity scoring module

0a5dbb0

feat(venue_popularity): improved popularity scores for transportation…

94351c8

… places

feat(venue_popularity): reduce popularity score for contact informati…

d39b9fb

…on tags

fix(test): fix typo

a3be09c

fix(venue_popularity): fix handling of disused & abandoned places

4d037ae

orangejulius force-pushed the venue_popularity branch from d496230 to 4d037ae Compare April 24, 2020 15:02

blackmad reviewed May 1, 2020

View reviewed changes

orangejulius reviewed May 18, 2020

View reviewed changes

orangejulius mentioned this pull request May 18, 2020

Venue popularity additions #531

Merged

orangejulius closed this in #531 Jun 1, 2020

orangejulius mentioned this pull request Jul 2, 2020

Allow configuring whether disused/abandoned venues are removed #537

Closed

orangejulius mentioned this pull request Jul 28, 2020

Add configuration option to control venue removal #539

Merged

This was referenced Aug 7, 2020

support for disused tag #228

Closed

discard closed venues #69

Closed

orangejulius mentioned this pull request Sep 14, 2020

Mark two passing tests passing pelias/acceptance-tests#532

Merged

missinglink mentioned this pull request Feb 10, 2021

set popularity on some address records #549

Merged

missinglink mentioned this pull request Mar 5, 2021

popularity scoring for "important" venues pelias/geonames#394

Merged

missinglink mentioned this pull request May 14, 2021

popularity bands pelias/model#138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

venue popularity #493

venue popularity #493

missinglink commented Jun 12, 2019 •

edited

Loading

bboure commented Jun 13, 2019

missinglink commented Jun 13, 2019 •

edited

Loading

missinglink commented Jun 13, 2019 •

edited

Loading

bboure commented Jun 13, 2019

Joxit left a comment

orangejulius commented Jun 13, 2019

missinglink commented Jun 13, 2019 •

edited

Loading

missinglink commented Jun 13, 2019

bboure commented Jun 15, 2019

NickStallman commented Jun 17, 2019

nvkelso commented Jun 18, 2019

missinglink commented Jun 18, 2019 •

edited

Loading

missinglink commented Jun 18, 2019 •

edited

Loading

missinglink commented Jun 18, 2019

bboure commented Jun 22, 2019

bboure commented Feb 1, 2020

missinglink commented Apr 28, 2020 •

edited

Loading

blackmad May 1, 2020

blackmad May 1, 2020

blackmad May 1, 2020

orangejulius May 18, 2020

orangejulius May 18, 2020

orangejulius May 18, 2020

orangejulius May 18, 2020

venue popularity #493

venue popularity #493

Conversation

missinglink commented Jun 12, 2019 • edited Loading

bboure commented Jun 13, 2019

missinglink commented Jun 13, 2019 • edited Loading

missinglink commented Jun 13, 2019 • edited Loading

bboure commented Jun 13, 2019

Joxit left a comment

Choose a reason for hiding this comment

orangejulius commented Jun 13, 2019

missinglink commented Jun 13, 2019 • edited Loading

missinglink commented Jun 13, 2019

bboure commented Jun 15, 2019

NickStallman commented Jun 17, 2019

nvkelso commented Jun 18, 2019

missinglink commented Jun 18, 2019 • edited Loading

missinglink commented Jun 18, 2019 • edited Loading

missinglink commented Jun 18, 2019

bboure commented Jun 22, 2019

bboure commented Feb 1, 2020

missinglink commented Apr 28, 2020 • edited Loading

blackmad May 1, 2020

Choose a reason for hiding this comment

blackmad May 1, 2020

Choose a reason for hiding this comment

blackmad May 1, 2020

Choose a reason for hiding this comment

orangejulius May 18, 2020

Choose a reason for hiding this comment

orangejulius May 18, 2020

Choose a reason for hiding this comment

orangejulius May 18, 2020

Choose a reason for hiding this comment

orangejulius May 18, 2020

Choose a reason for hiding this comment

missinglink commented Jun 12, 2019 •

edited

Loading

missinglink commented Jun 13, 2019 •

edited

Loading

missinglink commented Jun 13, 2019 •

edited

Loading

missinglink commented Jun 13, 2019 •

edited

Loading

missinglink commented Jun 18, 2019 •

edited

Loading

missinglink commented Jun 18, 2019 •

edited

Loading

missinglink commented Apr 28, 2020 •

edited

Loading