Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make truncation of keyword field values easier #60329

Closed
cbuescher opened this issue Jul 28, 2020 · 17 comments
Closed

Make truncation of keyword field values easier #60329

cbuescher opened this issue Jul 28, 2020 · 17 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Analysis How text is split into tokens Team:Data Management Meta label for data/management team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch team-discuss

Comments

@cbuescher
Copy link
Member

Currently it seems difficult for users that are not completely in control of the data they ingest into a keyword field to truncate those values (see #57984).
Lucene enforces a maximum term length of 32766 which, when exceeded, causes a rejection of the indexed document, so a user reading e.g. from a database with values out of control needs to somehow prevent this.

Here are some things that don't immediately work:

  • using 'length' or 'truncate' token filters isn't currently allowed in keyword normalizers
  • using the keyword fields 'ignore_above' option will prevent the document from being rejected, but will also ignore those values completely if it otherwise would be okay to save the truncated versions and e.g. sort on them

Using a 'script' ingest processor for truncation seems like a viable, but not the easiest option.

I'm opening this issue to discuss the following options:

  • should we allow at least the 'truncate' token filter in normalizers?
  • should we add a keyword field option that safely truncates input values?
  • maybe this would also be a reason to introduce a decicated ' truncate' ingest processor that's easier to use than the script?
@cbuescher cbuescher added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Analysis How text is split into tokens :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jul 28, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@elasticmachine elasticmachine added Team:Search Meta label for search team Team:Data Management Meta label for data/management team labels Jul 28, 2020
@cbuescher
Copy link
Member Author

I forgot one option. Maybe we can also add a parameter to the 'keyword_tokenizer' that allows truncating the input after a certain length.

@ennova007
Copy link

ennova007 commented Oct 28, 2020

I created a github account to post here... I have exactly the same issue. I want to index a lot of text fields, but have them sortable on keyword field as recommended in the documentation. Keeping field data on gives us all sorts of grief. I makes sense to me, search on the text and sort on the of the keyword. The text fields can easily exceed the limits and it is a waste of space. Using the ignore_above means I lose my keyword (and ability to sort on it).

What I really want to do is:
"normalizer": { "truncate_keyword_normalizer": { "type": "custom", "filter": [ "lowercase", "trunc_256" ] } }, "filter": { "trunc_256": { "type": "truncate", "length": 256 } }

with documents like:

"C1": { "type": "text", "fields": { "keyword": { "type": "keyword", "normalizer": "truncate_keyword_normalizer" } }, "analyzer": "string_lowercase" },

@waitingF
Copy link

take the "ignore_above" field as example, I add a field in KeywordFieldMapper named "truncate_above" to truncate content when the length of content exceed the truncate length.

with the following mapping,

{
   "properties":{
      "keyField":{
         "type":"keyword",
         "truncate_above":5	// reserve 5 bytes at most
      },
      "rawField":{
         "type":"text",
         "fields":{
            "raw":{
               "type":"keyword",
               "truncate_above":5	// reserve 5 bytes at most
            }
         }
      }
   }
}

the keyField and rawField.raw will auto truncate content.

@waitingF
Copy link

In the KeywordFieldMapper, we can add a new config named "truncate_above".
The default value of truncate_above is 32766, which is the max length of maximum term. the default value can guarantee that there is no rejection of indexing a document.

@cbuescher
Copy link
Member Author

@waitingF thanks for the PR, unfortunately we haven't made a decision yet if adding a new parameter to the keyword field is the route we want to go here. I think allowing certain token filters in normalizers is still a valid option.

@waitingF
Copy link

@cbuescher thanks for reply. But maybe adding a new parameter is the most easiest way in this case.
I will also think about adding a certain token filter in normalizers.

@waitingF
Copy link

waitingF commented Mar 2, 2022

@cbuescher could you please paste a sample mapping shows how token filters truncate keyword value?

By the way, I think filter is in the level of analyzer, and filters must be set on every field. However, truncate_above is the property of Keyword field, like ignore_above, truncate_above default value is valid for all keyword fields. It make sense to add a config truncate_above in KeywordFieldMapping.

this is the code in KeywordFieldMapping

        private final Parameter<Integer> ignoreAbove
            = Parameter.intParam("ignore_above", true, m -> toType(m).ignoreAbove, Integer.MAX_VALUE);
        private final Parameter<Integer> truncateAbove
            = Parameter.intParam("truncate_above", true, m -> toType(m).truncateAbove, MAX_TRUNCATE_LENGTH);

@EmilBode
Copy link

In the KeywordFieldMapper, we can add a new config named "truncate_above". The default value of truncate_above is 32766, which is the max length of maximum term. the default value can guarantee that there is no rejection of indexing a document.

I just wanted to point out that 32766 may still be dangerous. Lucene has a maximum of 32766 bytes, not characters. At least the ignore_above parameter accepts a length in characters, and I've found that in some cases I had to set it to 8191, to prevent crashes when ingesting non-Latin (and long) texts.

Best case would be to specify a truncate_above value in bytes as well, but for now I'd be happy with just having a truncate_above, with a default of 8191 (or less).

@Lorenzschaef
Copy link

I came across the same problem. My workaround was to create the keyword as a separate field and truncate in my code. This is not very elegant as it:

  • Makes my code more complex
  • Pollutes the _source

So any improvement on this would be very welcome, be it to allow the truncate token filter in normalizers, or adding a specific option for keyword fields.

What are the reasons, the truncate filter is not allowed for normalizers in the first place?

@cbuescher
Copy link
Member Author

Please excuse the long silence on this one, we just recently got to discussing the options available here.
We really don't want to put yet another parameter on the keyword fieldtype. The reasons for this are:

  • field types already have too many parameters as it is, affecting maintenance, backward compatibility etc... and increasing the surface area for missconfiguration and bugs
  • truncating a value isn't really a data type property but a data preparation / cleaning operation

This leaves the "allow 'truncate' token filter in normalizers" or "add truncation parameter to a keyword tokenizer" option.
However it was pointed out that we have a way to solve the problem with existing tooling with index time scripts (look at the "script" paramater in the field type definitions or an example from the runtime fields use case). With this solution you wouldn't index or store the original keyword field at all (switch index/doc_values off in the mapping) and do the truncation in the script at index time.
Here's a small example that can probably be improved, just to give you an idea:

PUT test
{
  "mappings": {
    "properties": {
      "kw": {
        "type" : "keyword",
        "index": false,
        "doc_values": false
      },
      "kw1": {
        "type":  "keyword",
        "script": {
        "source":
        """
        String kw = params._source.kw;
        if (kw != null) {
          if (kw.length() > 3) {
            emit(params._source.kw.substring(0,3))
          } else {
            emit(kw)
          }
        }
        """
      }
      }
    }
  }
}

PUT /test/_doc/1
{
  "kw" : "1234567"
}

PUT /test/_doc/2
{
  "kw" : ""
}

PUT /test/_doc/3
{
  "foo" : "12"
}

GET /test/_search
{
  "query": {"term": {
    "kw1": {
      "value": "123"
    }
  }}
}

We think that this is a relatively easy way to achieve the desired functionality and even add more flexibility than the additional parameter suggested here. Typically this type of data cleaning should already happen client side but when there is not control over this, scripting is a nice way to achieve something similar.
I will close the issue for now since we don't plan to work on any of the other options mentioned in the near term, but feel free to reopen if the proposed solution doesn't fit your needs and we will pick this up again.

@dylan-tock
Copy link

@cbuescher, in the OP you state Using a 'script' ingest processor for truncation seems like a viable, but not the easiest option.. I believe that is still the case. Looking at your example script, there are a number of questions that I do not know the answer for:

  • How would I apply that script if I'm not doing explicit mapping for all fields in my index?
  • Do I need to know the keyword values ahead of time in order to set up the truncation of its values?
  • How would I apply the truncation to all keywords in the document? I'm an ops guy, not a developer, so having the ability to parameterize a reindex process is much, much more appealing than "write a script to recurse down the document key/value tree testing if the 'value' is actually a list/map holding other key/value pairs and handling those appropriately". But as for how to actually do that? 🤷 And, having spent the past hour trying to find out how to do this, it has not been a "painless" experience. Questions that come up:
    • Do I write the truncated value back to the original index, or does emit replace what was there with the truncated value and I just need to use it as a filter?
    • How do I enable scripting on my es cluster if it isn't already enabled?
    • What's the equivalent of 2> echo "This is a really big value"? Or 2> echo "Processing key '${key}'"?
  • If I want to run this for all indexing/reindexing requests, how much more load would that put on the system? What would the percentage increase be for indexing/reindexing time?

I'm an ops/systems guy, so "learning a new programming language/environment", even if it is "relatively easy" for you, is a much greater hurdle for me.... especially when it's compared relative to "add a parameter to a query".

If I'm the only one for whom that's the case, then this is obviously not a large issue and can remain closed. . . but if others are similarly inclined, they can re-open the issue.

@cbuescher
Copy link
Member Author

Hi @dylan-tock, I understand your questions and have to admit that doing keyword truncation in an index time script isn't as easy as having that operation sit somewhere in the analysis chain, but given the current options its doable. We are already thinking about how to make this even easier on the scripting side, but index time scripting, added with runtime fields, is exactly supposed to work for general flexible pre-processing like this (if you cannot have it somewhere outside in your own application).

To your specific questions, I don't think the answer to any of them would be different / simpler with one of the other options we mentioned (i.e. tokenizer parameter / allow truncation filter in normalizers). Let me try to quickly explain why:

How would I apply that script if I'm not doing explicit mapping for all fields in my index?

Yes, the script would need to be applied on a per-field basis. But the same goes for any mapping parameter the other options would need. The need for "not doing explicit mapping" configuration is addressed by Dynamic templates

Do I need to know the keyword values ahead of time in order to set up the truncation of its values?

No

How would I apply the truncation to all keywords in the document?

Same as above, this would be a template. In the reindex scenario you describe, detecting whether a fields is currently a “keyword” types doesn’t work with the template matching rules though, they operate on the detected datatype of the input value (i.e. string). I understand the problem with reindexing you describe, but that problem wouldn’t be any different with a mapping parameter instead of a script.

If I want to run this for all indexing/reindexing requests, how much more load would that put on the system? What would the percentage increase be for indexing/reindexing time?

This should be relatively lightweight, but yes, it’s not a free operation. Again, doing the same operation inside analysis via a parameter wouldn’t be free either.

I hope these answer give an idea about why an analysis parameter wouldn’t be any easier in your case. However we keep thinking about this issue going forward, will try to better document the scripting examples and will keep truncation of keywords in our minds as an important use case while we make our index scripting feature easier to use.

@pihai
Copy link

pihai commented Feb 10, 2023

dylan-tock you are not the only one. It happend to me multiple times that I have missed critical error messages in Kibana because of too long keywords. In my case (Logging and Monitoring) truncate is much more feasible than ignore_above. Setting it as the default behavior would be great.

It seems that issue #91680 is not related to this. So there is still no easier solution for truncate in sight?

@dylan-tock
Copy link

Hi @dylan-tock, I understand your questions and have to admit that doing keyword truncation in an index time script isn't as easy as having that operation sit somewhere in the analysis chain, but given the current options its doable.

The time needed to attain the necessary knowledge and make a workable solution that handles any edge cases and works reliably is beyond what an ops-focused user (such as myself) is likely to be able to allocate. Sadly, the docs and examples right now are, for me, similar to this:
image

We are already thinking about how to make this even easier on the scripting side, but index time scripting, added with runtime fields, is exactly supposed to work for general flexible pre-processing like this (if you cannot have it somewhere outside in your own application).

Sometimes it's not "[my] own application" but someone else's that's sending me data to ingest and I've got no control over that data (something you acknowledge in the first post as the targeted use case).

And the "general flexibility" you refer to is much like the flexibility of a flat bit of wood and some really sharp engraving tools. I would say "paper and pencil", but to do the scripting you suggest requires I use painless, a language that is not used outside of Elasticsearch and whose only documentation is what is on the elastic website. I'm not saying painless is the wrong choice (it might be the only reliable, performant option), but having it as a requirement has implications. If you don't provide documentation that is broadly targeted at a wide variety of users, you are limiting who can make use of scripts. And if you say "A user can implement this feature via scripting, so we won't implement it as a configuration option", that only means users targeted by your documentation are the only ones who will be able to make use of that feature. For everyone else, it'll still be missing and functionality will be limited/broken.

I hope these answer give an idea about why an analysis parameter wouldn’t be any easier in your case. However we keep thinking about this issue going forward, will try to better document the scripting examples and will keep truncation of keywords in our minds as an important use case while we make our index scripting feature easier to use.

After I originally read through your response, spent another few hours trying to get something to work, then decided to find the documents with fields that were too long and delete them semi-manually. I'd like to have kept the documents and the data within them, but in the absence of the ability to dedicate the necessary time that is not a viable option.

I am hopeful that in the future there will be better scripting documentation and/or keyword value truncation will be used as an example, but unless/until that happens this will still be something I or a co-worker will need to handle manually.

@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Analysis How text is split into tokens Team:Data Management Meta label for data/management team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch team-discuss
Projects
None yet
Development

No branches or pull requests

9 participants