-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make truncation of keyword field values easier #60329
Comments
Pinging @elastic/es-search (:Search/Mapping) |
Pinging @elastic/es-search (:Search/Analysis) |
Pinging @elastic/es-core-features (:Core/Features/Ingest) |
I forgot one option. Maybe we can also add a parameter to the 'keyword_tokenizer' that allows truncating the input after a certain length. |
I created a github account to post here... I have exactly the same issue. I want to index a lot of text fields, but have them sortable on keyword field as recommended in the documentation. Keeping field data on gives us all sorts of grief. I makes sense to me, search on the text and sort on the of the keyword. The text fields can easily exceed the limits and it is a waste of space. Using the ignore_above means I lose my keyword (and ability to sort on it). What I really want to do is: with documents like:
|
take the "ignore_above" field as example, I add a field in KeywordFieldMapper named "truncate_above" to truncate content when the length of content exceed the truncate length. with the following mapping, {
"properties":{
"keyField":{
"type":"keyword",
"truncate_above":5 // reserve 5 bytes at most
},
"rawField":{
"type":"text",
"fields":{
"raw":{
"type":"keyword",
"truncate_above":5 // reserve 5 bytes at most
}
}
}
}
} the keyField and rawField.raw will auto truncate content. |
In the KeywordFieldMapper, we can add a new config named "truncate_above". |
@waitingF thanks for the PR, unfortunately we haven't made a decision yet if adding a new parameter to the keyword field is the route we want to go here. I think allowing certain token filters in normalizers is still a valid option. |
@cbuescher thanks for reply. But maybe adding a new parameter is the most easiest way in this case. |
@cbuescher could you please paste a sample mapping shows how token filters truncate keyword value? By the way, I think filter is in the level of analyzer, and filters must be set on every field. However, truncate_above is the property of Keyword field, like ignore_above, truncate_above default value is valid for all keyword fields. It make sense to add a config truncate_above in KeywordFieldMapping. this is the code in KeywordFieldMapping private final Parameter<Integer> ignoreAbove
= Parameter.intParam("ignore_above", true, m -> toType(m).ignoreAbove, Integer.MAX_VALUE);
private final Parameter<Integer> truncateAbove
= Parameter.intParam("truncate_above", true, m -> toType(m).truncateAbove, MAX_TRUNCATE_LENGTH); |
I just wanted to point out that 32766 may still be dangerous. Lucene has a maximum of 32766 bytes, not characters. At least the Best case would be to specify a truncate_above value in bytes as well, but for now I'd be happy with just having a truncate_above, with a default of 8191 (or less). |
I came across the same problem. My workaround was to create the keyword as a separate field and truncate in my code. This is not very elegant as it:
So any improvement on this would be very welcome, be it to allow the truncate token filter in normalizers, or adding a specific option for keyword fields. What are the reasons, the truncate filter is not allowed for normalizers in the first place? |
Please excuse the long silence on this one, we just recently got to discussing the options available here.
This leaves the "allow 'truncate' token filter in normalizers" or "add truncation parameter to a keyword tokenizer" option.
We think that this is a relatively easy way to achieve the desired functionality and even add more flexibility than the additional parameter suggested here. Typically this type of data cleaning should already happen client side but when there is not control over this, scripting is a nice way to achieve something similar. |
@cbuescher, in the OP you state
I'm an ops/systems guy, so "learning a new programming language/environment", even if it is "relatively easy" for you, is a much greater hurdle for me.... especially when it's compared relative to "add a parameter to a query". If I'm the only one for whom that's the case, then this is obviously not a large issue and can remain closed. . . but if others are similarly inclined, they can re-open the issue. |
Hi @dylan-tock, I understand your questions and have to admit that doing keyword truncation in an index time script isn't as easy as having that operation sit somewhere in the analysis chain, but given the current options its doable. We are already thinking about how to make this even easier on the scripting side, but index time scripting, added with runtime fields, is exactly supposed to work for general flexible pre-processing like this (if you cannot have it somewhere outside in your own application). To your specific questions, I don't think the answer to any of them would be different / simpler with one of the other options we mentioned (i.e. tokenizer parameter / allow truncation filter in normalizers). Let me try to quickly explain why:
Yes, the script would need to be applied on a per-field basis. But the same goes for any mapping parameter the other options would need. The need for "not doing explicit mapping" configuration is addressed by Dynamic templates
No
Same as above, this would be a template. In the reindex scenario you describe, detecting whether a fields is currently a “keyword” types doesn’t work with the template matching rules though, they operate on the detected datatype of the input value (i.e. string). I understand the problem with reindexing you describe, but that problem wouldn’t be any different with a mapping parameter instead of a script.
This should be relatively lightweight, but yes, it’s not a free operation. Again, doing the same operation inside analysis via a parameter wouldn’t be free either. I hope these answer give an idea about why an analysis parameter wouldn’t be any easier in your case. However we keep thinking about this issue going forward, will try to better document the scripting examples and will keep truncation of keywords in our minds as an important use case while we make our index scripting feature easier to use. |
dylan-tock you are not the only one. It happend to me multiple times that I have missed critical error messages in Kibana because of too long keywords. In my case (Logging and Monitoring) truncate is much more feasible than ignore_above. Setting it as the default behavior would be great. It seems that issue #91680 is not related to this. So there is still no easier solution for truncate in sight? |
The time needed to attain the necessary knowledge and make a workable solution that handles any edge cases and works reliably is beyond what an ops-focused user (such as myself) is likely to be able to allocate. Sadly, the docs and examples right now are, for me, similar to this:
Sometimes it's not "[my] own application" but someone else's that's sending me data to ingest and I've got no control over that data (something you acknowledge in the first post as the targeted use case). And the "general flexibility" you refer to is much like the flexibility of a flat bit of wood and some really sharp engraving tools. I would say "paper and pencil", but to do the scripting you suggest requires I use
After I originally read through your response, spent another few hours trying to get something to work, then decided to find the documents with fields that were too long and delete them semi-manually. I'd like to have kept the documents and the data within them, but in the absence of the ability to dedicate the necessary time that is not a viable option. I am hopeful that in the future there will be better scripting documentation and/or keyword value truncation will be used as an example, but unless/until that happens this will still be something I or a co-worker will need to handle manually. |
Currently it seems difficult for users that are not completely in control of the data they ingest into a keyword field to truncate those values (see #57984).
Lucene enforces a maximum term length of 32766 which, when exceeded, causes a rejection of the indexed document, so a user reading e.g. from a database with values out of control needs to somehow prevent this.
Here are some things that don't immediately work:
Using a 'script' ingest processor for truncation seems like a viable, but not the easiest option.
I'm opening this issue to discuss the following options:
The text was updated successfully, but these errors were encountered: