Add field stats api #10523

martijnvg · 2015-04-10T07:20:36Z

The field stats api should return field level statistics such as lowest, highest values and number of documents that have at least one value for a field.

An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.

This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.

The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.

The api is quite straight forward it expects a list of fields and optionally an index can be provided:
(ran on a stack exchange dump)

curl -XGET "http://localhost:9200/_field_stats?fields=comment_count,view_count,answer_count,creation_date,display_name"

The response looks like this:

{
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "indices" : {
      "_all" : {
         "fields": {
            "comment_count": {
               "doc_count": 202940,
               "min_value": 0,
               "max_value": 98
            },
            "creation_date": {
               "doc_count": 564633,
               "min_value": "2008-08-01T16:37:51.513Z",
               "max_value": "2013-06-02T03:23:11.593Z"
            },
            "display_name": {
               "doc_count": 126741,
               "min_value": "0",
               "max_value": "정혜선"
            },
            "answer_count": {
               "doc_count": 139885,
               "min_value": 0,
               "max_value": 160
            },
            "view_count": {
               "doc_count": 152357,
               "min_value": 1,
               "max_value": 506955
            }
         }
      }
   }
}

This is a work in progress. Things to be done are documentation and tests.

rmuir · 2015-04-10T11:27:12Z

src/main/java/org/elasticsearch/action/fieldstats/FieldStats.java

+        @Override
+        public void append(FieldStats stats) {
+            Lng other = (Lng) stats;
+            this.docCount += other.docCount;


This needs to handle the -1 case.

good catch! I had this logic, refactored and forgot to add it back...

rmuir · 2015-04-10T11:29:22Z

Should we expose the other stats in Terms too? I agree with leaving out size(), but these two are easily summed up just like docCount:

sumDocFreq: number of postings
sumTotalTermFreq: number of tokens

martijnvg · 2015-04-10T11:55:30Z

@rmuir Thanks for taking a look at this! Agreed, it makes sense to expose sumDocFreq and sumTotalTermFreq as well, since it is trivial to add those. Adding size() is far from trivial, since we need to have the actual terms at reduce time...

rmuir · 2015-04-10T12:00:07Z

Should we return max_doc as well (i know it comes from the IndexReader)? The reason is, its needed for some calculations (like determining the density/sparseness of a field) and it seems screwed up if the user has to use another api just to compute that. e.g. displayname field is populated 54% of the time (doc_count/max_doc)

martijnvg · 2015-04-10T12:03:17Z

Should we return max_doc as well (i know it comes from the IndexReader)?

+1 Lets add that one too, it is very useful. I'll just pass maxDoc down with Terms.

martijnvg · 2015-04-12T23:42:12Z

@rmuir I've updated the PR based on your feedback and added tests and docs.

rmuir · 2015-04-13T03:50:22Z

Thanks for the update @martijnvg , a few questions:

For field_ratio, can we rename it? I think ratio is a little confusing (it makes me think of M:N). I would prefer something like density, and just document it is a percentage.

Along the same note, as field_ratio is a derived statistic, should we include other useful ones in a followup PR? Technically, that one a user could compute themselves, but I think its nice to have a plan.

Other potentially useful ones similar to that:

avg_length (average number of words per doc):

sum_total_term_freq / doc_count

note: sum_total_term_freq will be -1 if frequencies are omitted, but in that case this one is easy, its 1 :)

avg_num_terms (average number of unique words per doc):

sum_doc_freq / doc_count

note: both the above don't consider stopwords, they reflect what is indexed.
These can help you know how short/long the field is and so on.

Another interesting one:

multi_valued (more than one indexed token in any doc):

sum_doc_freq != -1 && sum_doc_freq > doc_count

note: this currently is no good for numeric/date fields (will always say true, because of "extra tokens"), but works in the future with autoprefix.
This could maybe help with debugging fielddata or just understanding what is there.

I will take a look at the code tomorrow, thank you!

martijnvg · 2015-04-13T06:23:20Z

For field_ratio, can we rename it? I think ratio is a little confusing (it makes me think of M:N). I would prefer something like density, and just document it is a percentage.

+1, it was just to best name I could come up with.

I like the idea of adding more derived statistics. In fact because of these derived statistics and the good use for debugging purposes, I think it makes sense to add a _cat api variant of this api too? This api can nicely be rendering in a tabular format in the console.

note: sum_total_term_freq will be -1 if frequencies are omitted, but in that case this one is easy, its 1 :)

Cool, that makes sense :)

note: both the above don't consider stopwords, they reflect what is indexed. These can help you know how short/long the field is and so on.

I think if in the documentation we emphasised on the fact that all stats are based on indexed tokens, this is less of an surprise.

note: this currently is no good for numeric/date fields (will always say true, because of "extra tokens"), but works in the future with autoprefix.

I like the multi_valued derived statistic. I'm just worried it will be misinterpret until autoprefix gets in, so maybe we should add that one later? (or at least not backport it to 1.6)

I also have been thinking about how to expose Terms.size() in a minimal way. I think it makes since to expose it as a per shard statistic, so it shows the number of terms for shard that holds the most terms? This way at least there is some insight in the number of terms per field.

rmuir · 2015-04-13T11:48:14Z

I like the multi_valued derived statistic. I'm just worried it will be misinterpret until autoprefix gets in, so maybe we should add that one later? (or at least not backport it to 1.6)

Well we already know the type of field, so i was thinking to just make it unavailable for numeric fields for now. But we'd still have it for strings.

I also have been thinking about how to expose Terms.size() in a minimal way. I think it makes since to expose it as a per shard statistic, so it shows the number of terms for shard that holds the most terms? This way at least there is some insight in the number of terms per field.

I don't think we should do this. It would have to be the number of terms for a segment... Lets stay away from this one because the minute we go down this path, the API becomes slow.

martijnvg · 2015-04-13T12:28:21Z

Well we already know the type of field, so i was thinking to just make it unavailable for numeric fields for now. But we'd still have it for strings.

That makes sense.

I don't think we should do this. It would have to be the number of terms for a segment... Lets stay away from this one because the minute we go down this path, the API becomes slow.

I totally missed that MultiTerms#size() isn't implemented, so yes doing that via the TermsEnum would be slow.

rmuir · 2015-04-13T21:14:44Z

src/main/java/org/elasticsearch/action/fieldstats/FieldStats.java

+     */
+    public void append(FieldStats stats) {
+        this.maxDoc += stats.maxDoc;
+        if (stats.docCount != -1) {


maybe it is a matter of style, but i think its easier to handle the exceptional case like a guard up front: check stats.docCount == -1 and set to -1, otherwise sum. this is not really important to me.

rmuir · 2015-04-13T21:19:37Z

changes look good to me. datatypes for all stats/ -1 handling / etc is all good. Maybe we want to rename the field_ratio -> density and push the change? We can open followups for the other stats ideas.

martijnvg · 2015-04-13T21:32:19Z

I updated the PR and renamed field_ratio to density. But yes lets do the other derived measurements and _cat api variant in different issues.

rmuir · 2015-04-13T21:50:06Z

docs/reference/search/field-stats.asciidoc

+
+* `max_doc`             - The total number of documents.
+* `doc_count`           - The number of documents that have at least one term for this field, or -1 if this measurement isn't available on one or more shards.
+* `density`             - A ratio as a percentage between all the documents that have at least one value for this field and all documents.


maybe we can reword the description when pushing, like "The percentage of documents that have at least one value for this field"

rmuir · 2015-04-13T21:51:56Z

+1 (just some more nitpicks). thanks martijn, this is great.

rjernst · 2015-04-13T21:54:00Z

docs/reference/search/field-stats.asciidoc

+
+experimental[]
+
+The field stats api allows to easily figure certain statistical properties without a field without executing a search, but


Maybe reword to:
...allows one to find statistical properties of a field without...

martijnvg · 2015-04-17T07:42:29Z

@rashidkpc Yes, I think we can add this filtering on the min value and max value in the reduce phase of this api. So we effectively omit shard level results with a min_value and max_value that fall outside of the defined min and max. First I'll make sure that this gets pushed and I'll create a followup issue for this enhancement.

martijnvg · 2015-04-17T12:29:55Z

I updated the PR to the latest master and addressed @rjernst's comments. (latest commit)

rjernst · 2015-04-17T17:04:49Z

src/main/java/org/elasticsearch/action/fieldstats/FieldStats.java

+        out.writeLong(sumTotalTermFreq);
+    }
+
+    public static class Lng extends FieldStats<Long> {


I don't think the short names actually buy anything? Could these have normal names like Long here?

I like to rename it to Long, but then it clashes with java.lang.Long. Since these concrete classes are kind of private to ES the name shouldn't be an problem? I can also change the generic type to java.lang.Long.

Can't you just use FieldStats<java.lang.Long>? That is the only place the boxed type appears right?

Weird markdown seemed to silently remove some of my text...I was trying to say FieldStats<java.lang.Long> (which is what I think you meant by your last statement).

rjernst · 2015-04-17T18:43:15Z

@martijnvg LGTM, I left a couple more minor comments. I also accidentally left most of them on the last commit itself...sorry.

martijnvg · 2015-04-19T22:05:27Z

I updated the PR with the follow changes:

Incorporated @rjernst feedback (thank you!)
Added a level option to control whether the field stats of all indices are combined or field stats are only combined on a per index basis.
Added better error handling in the case fields with the same are of a different type between indices.

martijnvg · 2015-04-21T09:43:52Z

If there are no further comments about the additional commits then I'll push this tomorrow.

rjernst · 2015-04-23T06:26:58Z

src/main/java/org/elasticsearch/action/fieldstats/TransportFieldStatsTransportAction.java

+                } else if ("indices".equals(request.level())) {
+                    indexName = shardResponse.getIndex();
+                } else {
+                    // should already have been catched by the FieldStatsRequest#validate(...)


catched -> caught

rjernst · 2015-04-23T06:29:44Z

@martijnvg LGTM, i just left a couple minor suggestions.

The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field. An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings. This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick. The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based. Closes elastic#10523

The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field. An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings. This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick. The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based. Closes #10523

maharg101 · 2015-04-28T19:15:34Z

Very nice. Will elasticsearch core use this internally to optimise which indices to query for a given field value ?

pjcard · 2015-10-26T09:05:55Z

Am I right in thinking that anyone wishing to replace their time based indices with this mechanism will need to perform two queries in order to do so? (The first field_api query to obtain the indicies to pass in the URI for the search query). Is it possible to add this as a filter for a search query, thus doing away for the need use patterns in the search URI completely?

Or is @maharg101 correct in postulating that elasticsearch will use this internally to remove the need for the query to be explicit about indicies at all?

maharg101 · 2015-10-26T09:53:12Z

@pjcard you have extrapolated what I meant slightly - the scenario I had in mind was that you would still be explicit about the time-based indices, and that the query execution would take advantage of the min / max values of fields to reduce the set of indices / shards to look in.

martijnvg · 2015-10-27T02:52:10Z

@pjcard You would still need to send two requests to ES (1 field stats & 1 search request). ES doesn't use the min and max internally to optimise the search (all though in theory it can). This is also how Kibana is going to use the field stats api to reduce the number of indices a search needs to be executed on: elastic/kibana#4342

pjcard · 2015-10-27T10:29:58Z

Hi @martijnvg, thanks for the information. I'll add a request for that feature in case it's of use for anybody else and/or feasible.

martijnvg added >feature v2.0.0-beta1 WIP v1.6.0 labels Apr 10, 2015

rmuir reviewed Apr 10, 2015
View reviewed changes

martijnvg added review and removed WIP labels Apr 12, 2015

clintongormley added the :Data Management/Stats Statistics tracking and retrieval APIs label Apr 13, 2015

rmuir reviewed Apr 13, 2015
View reviewed changes

rjernst reviewed Apr 13, 2015
View reviewed changes

martijnvg force-pushed the field_stats branch from cebac06 to 6d53ac9 Compare April 17, 2015 12:19

rjernst reviewed Apr 17, 2015
View reviewed changes

rashidkpc mentioned this pull request Apr 20, 2015

Clearer error message for timestamp-based indices elastic/kibana#3635

Closed

rjernst reviewed Apr 23, 2015
View reviewed changes

martijnvg force-pushed the field_stats branch 2 times, most recently from c07698a to 84bc4ed Compare April 23, 2015 06:49

martijnvg force-pushed the field_stats branch from 84bc4ed to 5705537 Compare April 23, 2015 06:52

martijnvg merged commit 5705537 into elastic:master Apr 23, 2015

kevinkluge removed the review label Apr 23, 2015

mikemccand mentioned this pull request May 1, 2015

Track min/max on numerics in field data per segment #5829

Closed

rashidkpc mentioned this pull request May 15, 2015

Field stats filter #11187

Closed

martijnvg deleted the field_stats branch May 18, 2015 23:25

pjcard mentioned this pull request Oct 27, 2015

Field stats query #14301

Closed

colings86 mentioned this pull request Aug 4, 2016

Should we remove/modify some of the experiment tags in the documentation #19798

Closed

lukas-vlcek mentioned this pull request Aug 30, 2016

[DO_NOT_MERGE] Logging upgrade openshift/origin-aggregated-logging#208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add field stats api #10523

Add field stats api #10523

martijnvg commented Apr 10, 2015

rmuir Apr 10, 2015

martijnvg Apr 10, 2015

rmuir commented Apr 10, 2015

martijnvg commented Apr 10, 2015

rmuir commented Apr 10, 2015

martijnvg commented Apr 10, 2015

martijnvg commented Apr 12, 2015

rmuir commented Apr 13, 2015

martijnvg commented Apr 13, 2015

rmuir commented Apr 13, 2015

martijnvg commented Apr 13, 2015

rmuir Apr 13, 2015

rmuir commented Apr 13, 2015

martijnvg commented Apr 13, 2015

rmuir Apr 13, 2015

rmuir commented Apr 13, 2015

rjernst Apr 13, 2015

martijnvg commented Apr 17, 2015

martijnvg commented Apr 17, 2015

rjernst Apr 17, 2015

martijnvg Apr 17, 2015

rjernst Apr 17, 2015

rjernst Apr 17, 2015

rjernst commented Apr 17, 2015

martijnvg commented Apr 19, 2015

martijnvg commented Apr 21, 2015

rjernst Apr 23, 2015

rjernst commented Apr 23, 2015

maharg101 commented Apr 28, 2015

pjcard commented Oct 26, 2015

maharg101 commented Oct 26, 2015

martijnvg commented Oct 27, 2015

pjcard commented Oct 27, 2015


		experimental[]

		The field stats api allows to easily figure certain statistical properties without a field without executing a search, but

Add field stats api #10523

Add field stats api #10523

Conversation

martijnvg commented Apr 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmuir commented Apr 10, 2015

martijnvg commented Apr 10, 2015

rmuir commented Apr 10, 2015

martijnvg commented Apr 10, 2015

martijnvg commented Apr 12, 2015

rmuir commented Apr 13, 2015

martijnvg commented Apr 13, 2015

rmuir commented Apr 13, 2015

martijnvg commented Apr 13, 2015

Choose a reason for hiding this comment

rmuir commented Apr 13, 2015

martijnvg commented Apr 13, 2015

Choose a reason for hiding this comment

rmuir commented Apr 13, 2015

Choose a reason for hiding this comment

martijnvg commented Apr 17, 2015

martijnvg commented Apr 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjernst commented Apr 17, 2015

martijnvg commented Apr 19, 2015

martijnvg commented Apr 21, 2015

Choose a reason for hiding this comment

rjernst commented Apr 23, 2015

maharg101 commented Apr 28, 2015

pjcard commented Oct 26, 2015

maharg101 commented Oct 26, 2015

martijnvg commented Oct 27, 2015

pjcard commented Oct 27, 2015