Skip to content

Latest commit

 

History

History
executable file
·
211 lines (183 loc) · 5.8 KB

95_analyzed_vs_not.md

File metadata and controls

executable file
·
211 lines (183 loc) · 5.8 KB

[[aggregations-and-analysis]] === Aggregations and Analysis

Some aggregations, such as the terms bucket, operate((("analysis", "aggregations and")))((("aggregations", "and analysis"))) on string fields. And string fields may be either analyzed or not_analyzed, which begs the question: how does analysis affect aggregations?((("strings", "analyzed or not_analyzed string fields")))((("not_analyzed fields")))((("analyzed fields")))

The answer is "a lot," but it is best shown through an example. First, index some documents representing various states in the US:

[source,js]

POST /agg_analysis/data/_bulk { "index": {}} { "state" : "New York" } { "index": {}} { "state" : "New Jersey" } { "index": {}} { "state" : "New Mexico" } { "index": {}} { "state" : "New York" } { "index": {}} { "state" : "New York" }

We want to build a list of unique states in our dataset, complete with counts. Simple--let's use a terms bucket:

[source,js]

GET /agg_analysis/data/_search?search_type=count { "aggs" : { "states" : { "terms" : { "field" : "state" } } } }

This gives us these results:

[source,js]

{ ... "aggregations": { "states": { "buckets": [ { "key": "new", "doc_count": 5 }, { "key": "york", "doc_count": 3 }, { "key": "jersey", "doc_count": 1 }, { "key": "mexico", "doc_count": 1 } ] } } }

Oh dear, that's not at all what we want! Instead of counting states, the aggregation is counting individual words. The underlying reason is simple: aggregations are built from the inverted index, and the inverted index is post-analysis.

When we added those documents to Elasticsearch, the string "New York" was analyzed/tokenized into ["new", "york"]. These individual tokens were then used to populate fielddata, and ultimately we see counts for new instead of New York.

This is obviously not the behavior that we wanted, but luckily it is easily corrected.

We need to define a multifield for +state+ and set it to not_analyzed. This will prevent New York from being analyzed, which means it will stay a single token in the aggregation. Let's try the whole process over, but this time specify a raw multifield:

[source,js]

DELETE /agg_analysis/ PUT /agg_analysis { "mappings": { "data": { "properties": { "state" : { "type": "string", "fields": { "raw" : { "type": "string", "index": "not_analyzed"<1> } } } } } } }

POST /agg_analysis/data/_bulk { "index": {}} { "state" : "New York" } { "index": {}} { "state" : "New Jersey" } { "index": {}} { "state" : "New Mexico" } { "index": {}} { "state" : "New York" } { "index": {}} { "state" : "New York" }

GET /agg_analysis/data/_search?search_type=count { "aggs" : { "states" : { "terms" : { "field" : "state.raw" <2> } } } }

<1> This time we explicitly map the +state+ field and include a not_analyzed sub-field. <2> The aggregation is run on +state.raw+ instead of +state+.

Now when we run our aggregation, we get results that make sense:

[source,js]

{ ... "aggregations": { "states": { "buckets": [ { "key": "New York", "doc_count": 3 }, { "key": "New Jersey", "doc_count": 1 }, { "key": "New Mexico", "doc_count": 1 } ] } } }

In practice, this kind of problem is easy to spot. Your aggregations will simply return strange buckets, and you'll remember the analysis issue. It is a generalization, but there are not many instances where you want to use an analyzed field in an aggregation. When in doubt, add a multifield so you have the option for both.((("analyzed fields", "aggregations and")))

==== High-Cardinality Memory Implications

There is another reason to avoid aggregating analyzed fields: high-cardinality fields consume a large amount of memory when loaded into fielddata.((("memory usage", "high-cardinality fields")))((("cardinality", "high-cardinality fields, memory use issues"))) The analysis process often (although not always) generates a large number of tokens, many of which are unique. This increases the overall cardinality of the field and contributes to more memory pressure.((("analysis", "high-cardinality fields, memory use issues")))

Some types of analysis are extremely unfriendly with regards to memory. Consider an n-gram analysis process.((("n-grams", "memory use issues associated with"))) The term +New York+ might be n-grammed into the following tokens:

  • ne
  • ew
  • +w{nbsp}+
  • +{nbsp}y+
  • yo
  • or
  • rk

You can imagine how the n-gramming process creates a huge number of unique tokens, especially when analyzing paragraphs of text. When these are loaded into memory, you can easily exhaust your heap space.

So, before aggregating across fields, take a second to verify that the fields are not_analyzed. And if you want to aggregate analyzed fields, ensure that the analysis process is not creating an obscene number of tokens.

[TIP]

At the end of the day, it doesn't matter whether a field is analyzed or not_analyzed. The more unique values in a field--the higher the cardinality of the field--the more memory that is required. This is especially true for string fields, where every unique string must be held in memory--longer strings use more memory.

==================================================