idea: subfield for `address_parts.number` with alpha tokens #502

missinglink · 2025-01-22T10:52:48Z

The peliasHousenumber analyzer strips non-numeric tokens.

As discussed in pelias/pelias#810 this is somewhat unintuitive but actually works very well.

Lines 124 to 128 in 41bd2d1

    
           "peliasHousenumber": { 
        
             "type": "custom", 
        
             "tokenizer":"standard", 
        
             "char_filter" : ["numeric"] 
        
           },

The issue with this is that the original housenumber (including alpha characters) is lost to the document, meaning we can't do later fine-grained sorting on it.

As a workaround we're using the phrase.default field to get access to those tokens.

The disadvantage of phrase.default is that it will contain tokens from both the street and the housenumber, potentially producing undesirable matches. For non-address queries it will also contain additional tokens.

In this issue I would like to float the idea of having a 'subfield' of address_parts.number, call it something like address_parts.number.raw and use a different analyzer on it, such as peliasUnit (which doesn't strip the alpha chars).

This would remain backwards compatible while also adding an additional field address_parts.number.raw which contains both alpha and numeric tokens.

The benefits would be that we can then target this 'raw' field directly in our queries to do unit number sorting, et al.

The only minor disadvantage would be that the new field would increase the index size on-disk, although I expect this to be insubstantial (<~1%).

Also, if we're not going to use it then there's no sense in adding it.

cc/ @orangejulius @ianthetechie @Joxit

The text was updated successfully, but these errors were encountered:

orangejulius · 2025-01-22T15:27:39Z

Yeah, this is a really good idea. I can't remember if we've discussed it in GitHub issues before, but we should even consider expanding it and having a "strict" and a "loose" subfield for most of our fields.

This could help in a lot of cases, for example:

Scoring when diacriticals matter, such as Sorting of Huttenstrasse vs Hüttenstrasse
Scoring/matching apostrophes or plurals as mentioned in Consider adding apostrophe tokenfilter #434
Housenumbers with separating characters like Via del Ponticello 38/2 Trieste italy) (There's no issue for this yet AFAIK)

I'm sure there's more, right?

Using Elasticsearch subfields is pretty critical for this, we've known about it for a long time and IIRC it's fairly efficient compared to adding an entire new field

missinglink added the enhancement label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: subfield for `address_parts.number` with alpha tokens #502

idea: subfield for `address_parts.number` with alpha tokens #502

missinglink commented Jan 22, 2025 •

edited

Loading

orangejulius commented Jan 22, 2025 •

edited

Loading

idea: subfield for address_parts.number with alpha tokens #502

idea: subfield for address_parts.number with alpha tokens #502

Comments

missinglink commented Jan 22, 2025 • edited Loading

orangejulius commented Jan 22, 2025 • edited Loading

idea: subfield for `address_parts.number` with alpha tokens #502

idea: subfield for `address_parts.number` with alpha tokens #502

missinglink commented Jan 22, 2025 •

edited

Loading

orangejulius commented Jan 22, 2025 •

edited

Loading