Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: subfield for address_parts.number with alpha tokens #502

Open
missinglink opened this issue Jan 22, 2025 · 1 comment
Open

idea: subfield for address_parts.number with alpha tokens #502

missinglink opened this issue Jan 22, 2025 · 1 comment

Comments

@missinglink
Copy link
Member

missinglink commented Jan 22, 2025

The peliasHousenumber analyzer strips non-numeric tokens.

As discussed in pelias/pelias#810 this is somewhat unintuitive but actually works very well.

schema/settings.js

Lines 124 to 128 in 41bd2d1

"peliasHousenumber": {
"type": "custom",
"tokenizer":"standard",
"char_filter" : ["numeric"]
},

The issue with this is that the original housenumber (including alpha characters) is lost to the document, meaning we can't do later fine-grained sorting on it.

As a workaround we're using the phrase.default field to get access to those tokens.

The disadvantage of phrase.default is that it will contain tokens from both the street and the housenumber, potentially producing undesirable matches. For non-address queries it will also contain additional tokens.

In this issue I would like to float the idea of having a 'subfield' of address_parts.number, call it something like address_parts.number.raw and use a different analyzer on it, such as peliasUnit (which doesn't strip the alpha chars).

This would remain backwards compatible while also adding an additional field address_parts.number.raw which contains both alpha and numeric tokens.

The benefits would be that we can then target this 'raw' field directly in our queries to do unit number sorting, et al.

The only minor disadvantage would be that the new field would increase the index size on-disk, although I expect this to be insubstantial (<~1%).

Also, if we're not going to use it then there's no sense in adding it.

cc/ @orangejulius @ianthetechie @Joxit

@orangejulius
Copy link
Member

orangejulius commented Jan 22, 2025

Yeah, this is a really good idea. I can't remember if we've discussed it in GitHub issues before, but we should even consider expanding it and having a "strict" and a "loose" subfield for most of our fields.

This could help in a lot of cases, for example:

I'm sure there's more, right?

Using Elasticsearch subfields is pretty critical for this, we've known about it for a long time and IIRC it's fairly efficient compared to adding an entire new field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants