Improve pelias api deduper and label generation to separate identical items #26

vesameskanen · 2016-09-16T08:07:22Z

API deduper no longer considers lack of data as a difference, if the layer is the same. Document with missing data is dropped. Distance exceeding a threshold is considered as a difference.

Confidence scoring is improved radically to ensure, that deduper keeps the best documents. Most original Pelias scoring features are disabled, and score is computed by fuzzy matching all search elelements with respective document data. The required fuzzy matcher is now more advanced than before and works well with word order variations.

Identical labels are separated by postfixing additional data, such as address details, neighbourhood or category. The extension set is configurable.

The final displayed name should be used as the basis for deduplication

this sends kilo -> peshawar problem away; kilon asema matches kilo but peshawar does not.

…odules

Then name already matches the search string well in conf scoring.

…efined

…pecify it New principle is that confidence is 1 if every requested part matches the data. Kilo matches Kilo perfectly, regardless of missing zipcodes or whatever other attributes.

…ines.

- Do not consider lack of information as a difference, unless doc types differ - Option to use a distance threshold for deduping

Dropping the strange z-scores and other unused/duplicate score computations made scoring predictable. No need to compensate anomalies by weighting some knowns sensible properties.

…hecked

This helps especially with autocomplete. Vanta matches Vantaa pretty well.

geojson conversion included some important business logic such as name translation and label generation. Name selection by language is now moved to the translate postprocessor, where it naturally belongs. Label generation code is placed to a new dedicated 'label' module, where more advanced labeling actions can be carried out.

For example: Postitie 2e, Kolho

…le underscore Currently translate postprocessor translates all document values for which there is a translation available. Some doc values, such as layer, should not be translated automatically in place but by explicit calls (layer value is used by UI to assign an icon). Now there's a convention to use a certain predefined prefix '__' to protect doc sections from accidental translation.

…ordingly.

On logic side, the main changes are: - Do not add layer expansion to names, UI does that with an icon - Still consider layer as a separating factor so that names do not get expanded without a reason

The collection of name expanders can be defined in peliasconfig as an array of function names. Admin area expander can be configured by an array of admin attribute names such as ['neighbourhood'].

Each country has a fixed way for ordering street and house number. So, new per country order specs should be added to source code, not to the config file.

… field

With addresses, Turtolan citymarket is a very good match with K-Citymaket Turtola.

vesameskanen added 30 commits September 5, 2016 15:44

apply naming conventions before deduplication

09612d1

The final displayed name should be used as the basis for deduplication

try fuzzy name scoring

bb039af

this sends kilo -> peshawar problem away; kilon asema matches kilo but peshawar does not.

Symmetric fuzzy name scoring (uses longer name as base score)

fbfb312

Do not remove street number from name in name confidence evaluation

7023c33

Split street/number flipping and language translation into separate m…

741a5f2

…odules

Flip street/number before conf. scoring

0371fd0

Then name already matches the search string well in conf scoring.

A new dedicated helper module for fuzzy matching

dd1af3b

Use the new fuzzy match helper to score all strings in conf. scoring

eeb22f1

Score address components whenever any part (not necessarily all) is d…

cb8e612

…efined

Do not drop scores because of missing information if query does not s…

1359e4b

…pecify it New principle is that confidence is 1 if every requested part matches the data. Kilo matches Kilo perfectly, regardless of missing zipcodes or whatever other attributes.

Add second half of a split module (language translator) back to pipel…

abc6be5

…ines.

Improve deduper

62f4b0c

- Do not consider lack of information as a difference, unless doc types differ - Option to use a distance threshold for deduping

We do not need custom weights for certain properties any more

d5136a2

Dropping the strange z-scores and other unused/duplicate score computations made scoring predictable. No need to compensate anomalies by weighting some knowns sensible properties.

Update unit tests to match new conf. scores

0c96c10

Cleanup: better variable name

e8d67b0

Bugfix: for some reason only first value of array addr property was c…

bb73c84

…hecked

Apply fuzzy confidence scoring also to admin values

180f32c

This helps especially with autocomplete. Vanta matches Vantaa pretty well.

Configurable query size padding

f21932d

Update config file to match recent developments

c9cdde9

Locality should definitely be part of scored admin properties

89252b7

For example: Postitie 2e, Kolho

Add missing name conversion to translator processing

7518588

A new module for expanding identical labels with configurable qualifiers

9396781

Add new translations for label generation

fdd8624

labelSchema now processes non-jsonified document. Consider arrays acc…

164af91

…ordingly.

New postprocessing module tested, initial bugs (lots of them!) fixed.

526a1ba

On logic side, the main changes are: - Do not add layer expansion to names, UI does that with an icon - Still consider layer as a separating factor so that names do not get expanded without a reason

Expand name by admin details (neighbourhood or locality)

500bf8e

Make name expander configurable

fa9406d

The collection of name expanders can be defined in peliasconfig as an array of function names. Admin area expander can be configured by an array of admin attribute names such as ['neighbourhood'].

Add retail to category translations

b9637a2

vesameskanen added 4 commits September 12, 2016 14:16

Fix bugs in geographic location labeling

0a3f28e

Address order should not be configurable

0ff8af7

Each country has a fixed way for ordering street and house number. So, new per country order specs should be added to source code, not to the config file.

Bugfix: check existence of config section before referring to its sub…

d2bcfb9

… field

Fuzzy string match which understands changed word order

e1c936c

With addresses, Turtolan citymarket is a very good match with K-Citymaket Turtola.

hannesj merged commit 9a6167d into master Sep 21, 2016

vesameskanen deleted the improve-deduper branch September 22, 2016 08:34

hannesj mentioned this pull request Nov 15, 2016

Idea: add more detail to labels based on other results pelias/labels#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pelias api deduper and label generation to separate identical items #26

Improve pelias api deduper and label generation to separate identical items #26

vesameskanen commented Sep 16, 2016

Improve pelias api deduper and label generation to separate identical items #26

Improve pelias api deduper and label generation to separate identical items #26

Conversation

vesameskanen commented Sep 16, 2016