Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pelias api deduper and label generation to separate identical items #26

Merged
merged 34 commits into from
Sep 21, 2016

Conversation

vesameskanen
Copy link
Member

API deduper no longer considers lack of data as a difference, if the layer is the same. Document with missing data is dropped. Distance exceeding a threshold is considered as a difference.

Confidence scoring is improved radically to ensure, that deduper keeps the best documents. Most original Pelias scoring features are disabled, and score is computed by fuzzy matching all search elelements with respective document data. The required fuzzy matcher is now more advanced than before and works well with word order variations.

Identical labels are separated by postfixing additional data, such as address details, neighbourhood or category. The extension set is configurable.

The final displayed name should be used as the basis for
deduplication
this sends kilo -> peshawar problem away; kilon asema matches kilo but
peshawar does not.
Then name already matches the search string well in conf scoring.
…pecify it

New principle is that confidence is 1 if every requested part matches
the data. Kilo matches Kilo perfectly, regardless of missing zipcodes
or whatever other attributes.
- Do not consider lack of information as a difference, unless doc types differ
- Option to use a distance threshold for deduping
Dropping the strange z-scores and other unused/duplicate score
computations made scoring predictable. No need to compensate
anomalies by weighting some knowns sensible properties.
This helps especially with autocomplete. Vanta matches Vantaa pretty well.
geojson conversion included some important business logic such as
name translation and label generation. Name selection by language
is now moved to the translate postprocessor, where it naturally belongs.
Label generation code is placed to a new dedicated 'label' module,
where more advanced labeling actions can be carried out.
…le underscore

Currently translate postprocessor translates all document values for
which there is a translation available. Some doc values, such as layer,
should not be translated automatically in place but by explicit calls
(layer value is used by UI to assign an icon). Now there's a convention
to use a certain predefined prefix '__' to protect doc sections from
accidental translation.
On logic side, the main changes are:
- Do not add layer expansion to names, UI does that with an icon
- Still consider layer as a separating factor so that names do not
  get expanded without a reason
The collection of name expanders can be defined in peliasconfig
as an array of function names. Admin area expander can be configured
by an array of admin attribute names such as ['neighbourhood'].
Each country has a fixed way for ordering street and house number.
So, new per country order specs should be added to source code,
not to the config file.
With addresses, Turtolan citymarket is a very good match with
K-Citymaket Turtola.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants