Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

venue popularity #493

Closed
wants to merge 5 commits into from
Closed

venue popularity #493

wants to merge 5 commits into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Jun 12, 2019

This feature has been requested in pelias/pelias#171

This PR adds the ability to increase a document popularity score based on which tags it has.

I'd love to see some suggestions from the community on what they would like to see regarding 'importance' scoring in OSM.
Please suggest additional tags and scoring methodologies!

I will tune the final numbers in testing, these numbers are normalized using log1p before the scoring is applied in elasticsearch.
We should only really use these numbers as a 'tie-breaker' for multiple venues with the same name, eg Eiffel Tower

I also added some functionality to detect abandoned and disused places and give them negative popularity.
In the case where a document has negative popularity, it is discarded.

resolves pelias/pelias#171

@bboure
Copy link
Member

bboure commented Jun 13, 2019

Thank you @missinglink I am willing to test this.
I am still quite new to Pelias. If I understand correctly, I would need to re-import OSM data using your branch?

How did you end up with these scores? are they rather arbitrary?
I understand this is very WIP yet, but I suppose these could be configurable?

@missinglink
Copy link
Member Author

missinglink commented Jun 13, 2019

Yep, it requires reindexing openstreetmap, if you are using pelias/docker for your builds then you can test this feature by changing the openstreetmap.image from pelias/openstreetmap:master to pelias/openstreetmap:venue_popularity.

Once you've done that you can pull the new image from dockerhub by executing pelias compose pull.

Finally you can reimport the openstreetmap data with pelias import osm and you should see the results immediately.

The query logic should already be in place to take advantage of updated popularity data but the effect might either be too strong or too weak.
So we may need to either adjust the figures in this config file or adjust the weighting at query-time in order to get the balance right.

The scores are totally arbitrary, I just made them up :)
It will take a bit of trial-and-error during testing to figure out if they are decent or not, once we get it nicely balanced we can commit these values to master.

It will also be possible for users to modify the config on their own Pelias setup by editing the file directly, or, using docker, by bind-mounting their own config over the one in the image.

Let me know how you get on

@missinglink
Copy link
Member Author

missinglink commented Jun 13, 2019

@bboure you may find this useful pelias/docker#103
[edit] hmm that's still a draft and may contain omissions that might be tricky for a new Pelias user, I'd recommend one of the projects in the master branch, or making your own project based on the portland-metro project.

@bboure
Copy link
Member

bboure commented Jun 13, 2019

Thank you @missinglink
I was able to re-import OSM data and made a few tests (I was using Docker)
The effect to be the desired one. Good job! 🎉
I will keep playing with it and give some feedback.

Copy link
Member

@Joxit Joxit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible to get the area of the object here, it would be nice to increase the score.
For exemple, an hospital with a important surface should have an higher score than a point or smaller hospital.
We can also do this with the type of the object eg way versus node?

stream/popularity_mapper.js Show resolved Hide resolved
@orangejulius
Copy link
Member

Woah, this PR got real big real fast. I really liked the simple initial version, although I still think this is an important thing to add, and really can't believe we haven't done it till now :)

I'm a little hesitant with the many many additions that came in later commits, although I also see how they can be useful. I've been trying to think through ways this PR could cause problems or regressions. I can't come up with much, but I bet it's possibly for a generically named venue (something like "Market") to drown out other more specific results.

It might be worth testing this PR on either a continent or planet scale before merging, to see if we can identify any cases like that, or if there's any unexpected behavior.

@missinglink
Copy link
Member Author

missinglink commented Jun 13, 2019

One thing that came later was only setting popularity for the venue layer, if we allowed it on the address layer then osm would always have a points advantage over oa!

I think it's a great feature to add, a couple of hesitations on my part:

  • We could possibly remove the 'contact information' scoring completely, I reduced the scoring already but I think that the statement "venues tagged with a phone number are more popular than those without" isn't necessarily true.
  • This will need to have some end-to-end testing done before release, the popularity scoring subqueries have been in Pelias for a long time but haven't been used since the Quattroshapes days (like 2 years ago). So it's very likely that the balance is off.

Overall I think it's a really nice feature to have and will make the product more professional feeling, we just need to be careful not to upset the balance as a result.

@missinglink
Copy link
Member Author

If it's possible to get the area of the object here, it would be nice to increase the score.
For exemple, an hospital with a important surface should have an higher score than a point or smaller hospital.
We can also do this with the type of the object eg way versus node?

We have bounding boxes for way and relation records via pbf2json and also the option of using the version metadata to see how often it has been edited in osm.

In the case of a school I think you're right, a big school is probably more important than a small school, however the larger school could be mapped as a node and so would appear to be smaller.

The same is also true of monuments, a tower is physically small but could be a more popular tourist attraction than an old football stadium which is much larger.

Total edits could be interesting, it shows at least that the place is popular with mappers and so it's probably important enough to get correct, although I don't know what sort of score we would assign based on how many edits were made.

Thoughts?

@bboure
Copy link
Member

bboure commented Jun 15, 2019

Total edits could be interesting, it shows at least that the place is popular with mappers and so it's probably important enough to get correct

I agree on that. A popular place will have many edits in OSM.

I am not sure about the size/are. For example, Manneken Pis is very small, but also very famous 😄

The number of translations (name.*) might also be a sign that the place is famous internationally? (similar/synonym to importance=international)

I have been playing a bit with the current implementation and it gives me pretty accurate results on a City scale map (Barcelona). It probably can improve though.
I will give more feedback later.

@NickStallman
Copy link
Contributor

It might also be worthwhile to have a mechanism to allow for a proportional score.
E.g. start_date is more important the older it is, a date of yesterday and a date 200 years ago aren't equal. Some kind of exponential score here with a upper bound would be good.
And height would be another good one, taller things would often be more important than shorter things.

@nvkelso
Copy link

nvkelso commented Jun 18, 2019

🎉 Super exciting progress!

For reference, we added collision_rank in v1.7.0 of Tilezen to solve similar "same but different" sort problem.

@missinglink
Copy link
Member Author

missinglink commented Jun 18, 2019

Heya, nice to see a bunch of interest in this PR!

I've been thinking about this some more over the weekend and I think we all agree that the concept of popularity is still pretty vague and not anchored to anything in any logical way that would allow us to estimate the resulting behaviour when we assign popularity values to documents.

In order to deal with that subjectivity, I offer this explanation of popularity which we can adopt going forward.

  1. Popularity for 'admin' areas is the same as their population
  2. In some admin areas, the popularity will be more or less than the population (such as increasing it in places such as Venice or San Francisco & decreasing it for large regions and states)
  3. All other layers are assigned popularity based on the statement Assuming there is a city with the same name, assigning higher popularity will rank this place higher than the city, and lower popularity will rank it lower. This will allow us to, for instance, score Newark International Airport higher than Newark, NJ in the results.
  4. Addresses have zero popularity. We may make some exceptions for things like 1600 Pennsylvania Ave NW, Washington, DC, but this is generally true.
  5. Streets have a popularity range (still to be decided) which is higher than zero but less than a small city (I'm thinking in the thousands).
  6. Venues have the most variable popularity, ranging from airports (which may be more popular than their locality) to a low minimum value, such as 100. As per previous discussions, tourist attractions would be scored in a way that replicas of 'real' places would have lower popularity.
  7. Postcodes would be assigned a default value similar to what a borough would have. (they would likely all have the same popularity?)

Some advantages of adopting a consistent popularity score across all layers:

  • We can score 'real' monuments higher than replicas
  • We can increase performance for short autocomplete inputs by filtering on popularity. see [on hold] add hard distance filter to short focus.point queries api#1215
  • We can better mix layers in results, queries using a focus.point will show a better mix of layers based on their relative popularity
  • It will be much easier (and bug-free) to exclude addresses from certain queries where they are not relevant.

Another way to think about it would be to ask, how many 'unique visitors' does this place see per year, and by that definition things like banks, train stations and tourist attractions naturally have higher popularity scores than private addresses.

Thoughts/feels?

@missinglink
Copy link
Member Author

missinglink commented Jun 18, 2019

Couple notes on elasticsearch scoring based on global popularity values:

  • We would no longer need a subquery to score on population. popularity will serve the same purpose albeit in a more flexible way.
  • We will need to 'balance' the effects of the popularity bias relative to the textual matching scores. For example 'Newark, NJ' should show the locality higher despite the airport having the larger popularity value.

I think the latter bullet point will be easier to achieve and more consistent when the values are more consistent.

@missinglink
Copy link
Member Author

I saw Sarah Hoffmann from the Nominatim Project last week and she said their popularity scoring is solely based on Wikipedia.

They compute the 'internal inbound link count' in Wikipedia for each OSM place with a concordance and use that value (ie. 'wiki page rank').

She said they were pretty happy with the results, the dump is available for download, it's about 6 years old but still pretty relevant. They plan to update the file this year.

@bboure
Copy link
Member

bboure commented Jun 22, 2019

After this change, would it make sense to add a way in the api to fetch the top n venues in a given area?
i.e: I am interested in showing popular places on a map (in a given bbox)

@bboure
Copy link
Member

bboure commented Feb 1, 2020

Hey,
What is the status on this?

I'd love to see some suggestions from the community on what they would like to see regarding 'importance' scoring in OSM.
Please suggest additional tags and scoring methodologies!

How about having customizable scoring system?
i.e.: Being able to configure what scope each specific tag should have (or not).

For example, if you are building a transportation system, you might want to boost train/bus stations higher, while still showing other results.
Example with https://www.openstreetmap.org/way/5013364 and https://www.openstreetmap.org/node/5682929172
You would rather show the bus stop first, and then the actual tower.

Which leads me to thinking that it would also be nice to have a query-time booster by layer or category on the API as well.

/search?text=Tour+Eiffel&prioritize.categories=transport

This would give a higher score to documents with the given categories, but would still show other results.

@missinglink
Copy link
Member Author

missinglink commented Apr 28, 2020

We (Geocode Earth) are currently looking at this issue again and hope to merge some code which allows for improved venue scoring soon.

related: #385

Comment on lines +124 to +127
aerodrome: {
international: { _score: 10000 },
regional: { _score: 5000 }
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I found using the small/medium/large categories from https://ourairports.com/data/airports.csv to be more useful than "international" - there are some small and medium international airports that don't deserve such a boost.

// transportation
aerodrome: {
international: { _score: 10000 },
regional: { _score: 5000 }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe want a downweight on aerodrome:type=military https://www.openstreetmap.org/node/369160593 ?

regional: { _score: 5000 }
},
iata: {
_score: 5000,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much will this hurt a query for "CVS" that doesn't want the airport?

supermarket: { _score: 2000 },
civic: { _score: 2000 },
government: { _score: 2000 },
hospital: { _score: 2000 },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add historic here for Wrigley field? https://www.openstreetmap.org/relation/1407988

},

// transportation
aerodrome: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another tag that might make sense is aeroway:aerodrome. It looks like its on most of the major international airports like Phoenix and Miami.

_score: 5000,
none: { _score: -5000 }
},
railway: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like another related tag is public_transport=station:

While the more common tag railway=station is used for all railway stations (i.e. including cargo), the public_transport=station tag is used only on the stations interesting for passenger transport.

This would hopefully help boost some more public transit stops like Park & Market which currently does not do very well in our tests

architect: { _score: 5000 },
heritage: { _score: 5000 },
'heritage:operator': { _score: 2000 },
historic: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding historic=heritage for La Sagrada Familia?

orangejulius added a commit that referenced this pull request Jul 28, 2020
As described in #537, the
default set in #493, where
all venues that have a calculated popularity below `0` are not imported,
is a bit strict.

This adds a config flag, `imports.openstreetmap.removeDisusedVenues`
that controls whether or not that behavior is activated.

In addition, when enabled, a `warning` is displayed for each removed
record.
This was referenced Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add popularity scores to landmarks
7 participants