Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field stats api #10523

Merged
merged 1 commit into from
Apr 23, 2015
Merged

Add field stats api #10523

merged 1 commit into from
Apr 23, 2015

Conversation

martijnvg
Copy link
Member

The field stats api should return field level statistics such as lowest, highest values and number of documents that have at least one value for a field.

An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.

This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.

The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.

The api is quite straight forward it expects a list of fields and optionally an index can be provided:
(ran on a stack exchange dump)

curl -XGET "http://localhost:9200/_field_stats?fields=comment_count,view_count,answer_count,creation_date,display_name"

The response looks like this:

{
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "indices" : {
      "_all" : {
         "fields": {
            "comment_count": {
               "doc_count": 202940,
               "min_value": 0,
               "max_value": 98
            },
            "creation_date": {
               "doc_count": 564633,
               "min_value": "2008-08-01T16:37:51.513Z",
               "max_value": "2013-06-02T03:23:11.593Z"
            },
            "display_name": {
               "doc_count": 126741,
               "min_value": "0",
               "max_value": "정혜선"
            },
            "answer_count": {
               "doc_count": 139885,
               "min_value": 0,
               "max_value": 160
            },
            "view_count": {
               "doc_count": 152357,
               "min_value": 1,
               "max_value": 506955
            }
         }
      }
   }
}

This is a work in progress. Things to be done are documentation and tests.

@Override
public void append(FieldStats stats) {
Lng other = (Lng) stats;
this.docCount += other.docCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to handle the -1 case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I had this logic, refactored and forgot to add it back...

@rmuir
Copy link
Contributor

rmuir commented Apr 10, 2015

Should we expose the other stats in Terms too? I agree with leaving out size(), but these two are easily summed up just like docCount:

  • sumDocFreq: number of postings
  • sumTotalTermFreq: number of tokens

@martijnvg
Copy link
Member Author

@rmuir Thanks for taking a look at this! Agreed, it makes sense to expose sumDocFreq and sumTotalTermFreq as well, since it is trivial to add those. Adding size() is far from trivial, since we need to have the actual terms at reduce time...

@rmuir
Copy link
Contributor

rmuir commented Apr 10, 2015

Should we return max_doc as well (i know it comes from the IndexReader)? The reason is, its needed for some calculations (like determining the density/sparseness of a field) and it seems screwed up if the user has to use another api just to compute that. e.g. displayname field is populated 54% of the time (doc_count/max_doc)

@martijnvg
Copy link
Member Author

Should we return max_doc as well (i know it comes from the IndexReader)?

+1 Lets add that one too, it is very useful. I'll just pass maxDoc down with Terms.

@martijnvg
Copy link
Member Author

@rmuir I've updated the PR based on your feedback and added tests and docs.

@martijnvg martijnvg added review and removed WIP labels Apr 12, 2015
@rmuir
Copy link
Contributor

rmuir commented Apr 13, 2015

Thanks for the update @martijnvg , a few questions:

For field_ratio, can we rename it? I think ratio is a little confusing (it makes me think of M:N). I would prefer something like density, and just document it is a percentage.

Along the same note, as field_ratio is a derived statistic, should we include other useful ones in a followup PR? Technically, that one a user could compute themselves, but I think its nice to have a plan.

Other potentially useful ones similar to that:

  • avg_length (average number of words per doc):

    sum_total_term_freq / doc_count

note: sum_total_term_freq will be -1 if frequencies are omitted, but in that case this one is easy, its 1 :)

  • avg_num_terms (average number of unique words per doc):

    sum_doc_freq / doc_count

note: both the above don't consider stopwords, they reflect what is indexed.
These can help you know how short/long the field is and so on.

Another interesting one:

  • multi_valued (more than one indexed token in any doc):

    sum_doc_freq != -1 && sum_doc_freq > doc_count

note: this currently is no good for numeric/date fields (will always say true, because of "extra tokens"), but works in the future with autoprefix.
This could maybe help with debugging fielddata or just understanding what is there.

I will take a look at the code tomorrow, thank you!

@martijnvg
Copy link
Member Author

For field_ratio, can we rename it? I think ratio is a little confusing (it makes me think of M:N). I would prefer something like density, and just document it is a percentage.

+1, it was just to best name I could come up with.

I like the idea of adding more derived statistics. In fact because of these derived statistics and the good use for debugging purposes, I think it makes sense to add a _cat api variant of this api too? This api can nicely be rendering in a tabular format in the console.

note: sum_total_term_freq will be -1 if frequencies are omitted, but in that case this one is easy, its 1 :)

Cool, that makes sense :)

note: both the above don't consider stopwords, they reflect what is indexed. These can help you know how short/long the field is and so on.

I think if in the documentation we emphasised on the fact that all stats are based on indexed tokens, this is less of an surprise.

note: this currently is no good for numeric/date fields (will always say true, because of "extra tokens"), but works in the future with autoprefix.

I like the multi_valued derived statistic. I'm just worried it will be misinterpret until autoprefix gets in, so maybe we should add that one later? (or at least not backport it to 1.6)

I also have been thinking about how to expose Terms.size() in a minimal way. I think it makes since to expose it as a per shard statistic, so it shows the number of terms for shard that holds the most terms? This way at least there is some insight in the number of terms per field.

@rmuir
Copy link
Contributor

rmuir commented Apr 13, 2015

I like the multi_valued derived statistic. I'm just worried it will be misinterpret until autoprefix gets in, so maybe we should add that one later? (or at least not backport it to 1.6)

Well we already know the type of field, so i was thinking to just make it unavailable for numeric fields for now. But we'd still have it for strings.

I also have been thinking about how to expose Terms.size() in a minimal way. I think it makes since to expose it as a per shard statistic, so it shows the number of terms for shard that holds the most terms? This way at least there is some insight in the number of terms per field.

I don't think we should do this. It would have to be the number of terms for a segment... Lets stay away from this one because the minute we go down this path, the API becomes slow.

@clintongormley clintongormley added the :Data Management/Stats Statistics tracking and retrieval APIs label Apr 13, 2015
@martijnvg
Copy link
Member Author

Well we already know the type of field, so i was thinking to just make it unavailable for numeric fields for now. But we'd still have it for strings.

That makes sense.

I don't think we should do this. It would have to be the number of terms for a segment... Lets stay away from this one because the minute we go down this path, the API becomes slow.

I totally missed that MultiTerms#size() isn't implemented, so yes doing that via the TermsEnum would be slow.

*/
public void append(FieldStats stats) {
this.maxDoc += stats.maxDoc;
if (stats.docCount != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is a matter of style, but i think its easier to handle the exceptional case like a guard up front: check stats.docCount == -1 and set to -1, otherwise sum. this is not really important to me.

@rmuir
Copy link
Contributor

rmuir commented Apr 13, 2015

changes look good to me. datatypes for all stats/ -1 handling / etc is all good. Maybe we want to rename the field_ratio -> density and push the change? We can open followups for the other stats ideas.

@martijnvg
Copy link
Member Author

I updated the PR and renamed field_ratio to density. But yes lets do the other derived measurements and _cat api variant in different issues.


* `max_doc` - The total number of documents.
* `doc_count` - The number of documents that have at least one term for this field, or -1 if this measurement isn't available on one or more shards.
* `density` - A ratio as a percentage between all the documents that have at least one value for this field and all documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can reword the description when pushing, like "The percentage of documents that have at least one value for this field"

@rmuir
Copy link
Contributor

rmuir commented Apr 13, 2015

+1 (just some more nitpicks). thanks martijn, this is great.


experimental[]

The field stats api allows to easily figure certain statistical properties without a field without executing a search, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reword to:
...allows one to find statistical properties of a field without...

@martijnvg
Copy link
Member Author

@rashidkpc Yes, I think we can add this filtering on the min value and max value in the reduce phase of this api. So we effectively omit shard level results with a min_value and max_value that fall outside of the defined min and max. First I'll make sure that this gets pushed and I'll create a followup issue for this enhancement.

@martijnvg
Copy link
Member Author

I updated the PR to the latest master and addressed @rjernst's comments. (latest commit)

out.writeLong(sumTotalTermFreq);
}

public static class Lng extends FieldStats<Long> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the short names actually buy anything? Could these have normal names like Long here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to rename it to Long, but then it clashes with java.lang.Long. Since these concrete classes are kind of private to ES the name shouldn't be an problem? I can also change the generic type to java.lang.Long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just use FieldStats<java.lang.Long>? That is the only place the boxed type appears right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird markdown seemed to silently remove some of my text...I was trying to say FieldStats<java.lang.Long> (which is what I think you meant by your last statement).

@rjernst
Copy link
Member

rjernst commented Apr 17, 2015

@martijnvg LGTM, I left a couple more minor comments. I also accidentally left most of them on the last commit itself...sorry.

@martijnvg
Copy link
Member Author

I updated the PR with the follow changes:

  • Incorporated @rjernst feedback (thank you!)
  • Added a level option to control whether the field stats of all indices are combined or field stats are only combined on a per index basis.
  • Added better error handling in the case fields with the same are of a different type between indices.

@martijnvg
Copy link
Member Author

If there are no further comments about the additional commits then I'll push this tomorrow.

} else if ("indices".equals(request.level())) {
indexName = shardResponse.getIndex();
} else {
// should already have been catched by the FieldStatsRequest#validate(...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catched -> caught

@rjernst
Copy link
Member

rjernst commented Apr 23, 2015

@martijnvg LGTM, i just left a couple minor suggestions.

@martijnvg martijnvg force-pushed the field_stats branch 2 times, most recently from c07698a to 84bc4ed Compare April 23, 2015 06:49
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Apr 23, 2015
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field.

An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.

This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.

The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.

Closes elastic#10523
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field.

An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.

This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.

The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.

Closes elastic#10523
@martijnvg martijnvg merged commit 5705537 into elastic:master Apr 23, 2015
@kevinkluge kevinkluge removed the review label Apr 23, 2015
martijnvg added a commit that referenced this pull request Apr 23, 2015
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field.

An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.

This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.

The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.

Closes #10523
@maharg101
Copy link

Very nice. Will elasticsearch core use this internally to optimise which indices to query for a given field value ?

@pjcard
Copy link

pjcard commented Oct 26, 2015

Am I right in thinking that anyone wishing to replace their time based indices with this mechanism will need to perform two queries in order to do so? (The first field_api query to obtain the indicies to pass in the URI for the search query). Is it possible to add this as a filter for a search query, thus doing away for the need use patterns in the search URI completely?

Or is @maharg101 correct in postulating that elasticsearch will use this internally to remove the need for the query to be explicit about indicies at all?

@maharg101
Copy link

@pjcard you have extrapolated what I meant slightly - the scenario I had in mind was that you would still be explicit about the time-based indices, and that the query execution would take advantage of the min / max values of fields to reduce the set of indices / shards to look in.

@martijnvg
Copy link
Member Author

@pjcard You would still need to send two requests to ES (1 field stats & 1 search request). ES doesn't use the min and max internally to optimise the search (all though in theory it can). This is also how Kibana is going to use the field stats api to reduce the number of indices a search needs to be executed on: elastic/kibana#4342

@pjcard
Copy link

pjcard commented Oct 27, 2015

Hi @martijnvg, thanks for the information. I'll add a request for that feature in case it's of use for anybody else and/or feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants