-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve pelias api deduper and label generation to separate identical items #26
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The final displayed name should be used as the basis for deduplication
this sends kilo -> peshawar problem away; kilon asema matches kilo but peshawar does not.
Then name already matches the search string well in conf scoring.
…pecify it New principle is that confidence is 1 if every requested part matches the data. Kilo matches Kilo perfectly, regardless of missing zipcodes or whatever other attributes.
- Do not consider lack of information as a difference, unless doc types differ - Option to use a distance threshold for deduping
Dropping the strange z-scores and other unused/duplicate score computations made scoring predictable. No need to compensate anomalies by weighting some knowns sensible properties.
This helps especially with autocomplete. Vanta matches Vantaa pretty well.
geojson conversion included some important business logic such as name translation and label generation. Name selection by language is now moved to the translate postprocessor, where it naturally belongs. Label generation code is placed to a new dedicated 'label' module, where more advanced labeling actions can be carried out.
For example: Postitie 2e, Kolho
…le underscore Currently translate postprocessor translates all document values for which there is a translation available. Some doc values, such as layer, should not be translated automatically in place but by explicit calls (layer value is used by UI to assign an icon). Now there's a convention to use a certain predefined prefix '__' to protect doc sections from accidental translation.
On logic side, the main changes are: - Do not add layer expansion to names, UI does that with an icon - Still consider layer as a separating factor so that names do not get expanded without a reason
The collection of name expanders can be defined in peliasconfig as an array of function names. Admin area expander can be configured by an array of admin attribute names such as ['neighbourhood'].
Each country has a fixed way for ordering street and house number. So, new per country order specs should be added to source code, not to the config file.
With addresses, Turtolan citymarket is a very good match with K-Citymaket Turtola.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
API deduper no longer considers lack of data as a difference, if the layer is the same. Document with missing data is dropped. Distance exceeding a threshold is considered as a difference.
Confidence scoring is improved radically to ensure, that deduper keeps the best documents. Most original Pelias scoring features are disabled, and score is computed by fuzzy matching all search elelements with respective document data. The required fuzzy matcher is now more advanced than before and works well with word order variations.
Identical labels are separated by postfixing additional data, such as address details, neighbourhood or category. The extension set is configurable.