-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add field stats api #10523
Add field stats api #10523
Conversation
@Override | ||
public void append(FieldStats stats) { | ||
Lng other = (Lng) stats; | ||
this.docCount += other.docCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to handle the -1 case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! I had this logic, refactored and forgot to add it back...
Should we expose the other stats in Terms too? I agree with leaving out size(), but these two are easily summed up just like docCount:
|
@rmuir Thanks for taking a look at this! Agreed, it makes sense to expose |
Should we return max_doc as well (i know it comes from the IndexReader)? The reason is, its needed for some calculations (like determining the density/sparseness of a field) and it seems screwed up if the user has to use another api just to compute that. e.g. |
+1 Lets add that one too, it is very useful. I'll just pass |
@rmuir I've updated the PR based on your feedback and added tests and docs. |
Thanks for the update @martijnvg , a few questions: For Along the same note, as Other potentially useful ones similar to that:
note: sum_total_term_freq will be -1 if frequencies are omitted, but in that case this one is easy, its 1 :)
note: both the above don't consider stopwords, they reflect what is indexed. Another interesting one:
note: this currently is no good for numeric/date fields (will always say true, because of "extra tokens"), but works in the future with autoprefix. I will take a look at the code tomorrow, thank you! |
+1, it was just to best name I could come up with. I like the idea of adding more derived statistics. In fact because of these derived statistics and the good use for debugging purposes, I think it makes sense to add a _cat api variant of this api too? This api can nicely be rendering in a tabular format in the console.
Cool, that makes sense :)
I think if in the documentation we emphasised on the fact that all stats are based on indexed tokens, this is less of an surprise.
I like the I also have been thinking about how to expose |
Well we already know the type of field, so i was thinking to just make it unavailable for numeric fields for now. But we'd still have it for strings.
I don't think we should do this. It would have to be the number of terms for a segment... Lets stay away from this one because the minute we go down this path, the API becomes slow. |
That makes sense.
I totally missed that MultiTerms#size() isn't implemented, so yes doing that via the TermsEnum would be slow. |
*/ | ||
public void append(FieldStats stats) { | ||
this.maxDoc += stats.maxDoc; | ||
if (stats.docCount != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is a matter of style, but i think its easier to handle the exceptional case like a guard up front: check stats.docCount == -1 and set to -1, otherwise sum. this is not really important to me.
changes look good to me. datatypes for all stats/ -1 handling / etc is all good. Maybe we want to rename the |
I updated the PR and renamed |
|
||
* `max_doc` - The total number of documents. | ||
* `doc_count` - The number of documents that have at least one term for this field, or -1 if this measurement isn't available on one or more shards. | ||
* `density` - A ratio as a percentage between all the documents that have at least one value for this field and all documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can reword the description when pushing, like "The percentage of documents that have at least one value for this field"
+1 (just some more nitpicks). thanks martijn, this is great. |
|
||
experimental[] | ||
|
||
The field stats api allows to easily figure certain statistical properties without a field without executing a search, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reword to:
...allows one to find statistical properties of a field without...
@rashidkpc Yes, I think we can add this filtering on the min value and max value in the reduce phase of this api. So we effectively omit shard level results with a |
I updated the PR to the latest master and addressed @rjernst's comments. (latest commit) |
out.writeLong(sumTotalTermFreq); | ||
} | ||
|
||
public static class Lng extends FieldStats<Long> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the short names actually buy anything? Could these have normal names like Long here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to rename it to Long
, but then it clashes with java.lang.Long
. Since these concrete classes are kind of private to ES the name shouldn't be an problem? I can also change the generic type to java.lang.Long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just use FieldStats<java.lang.Long>? That is the only place the boxed type appears right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird markdown seemed to silently remove some of my text...I was trying to say FieldStats<java.lang.Long>
(which is what I think you meant by your last statement).
@martijnvg LGTM, I left a couple more minor comments. I also accidentally left most of them on the last commit itself...sorry. |
I updated the PR with the follow changes:
|
If there are no further comments about the additional commits then I'll push this tomorrow. |
} else if ("indices".equals(request.level())) { | ||
indexName = shardResponse.getIndex(); | ||
} else { | ||
// should already have been catched by the FieldStatsRequest#validate(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catched -> caught
@martijnvg LGTM, i just left a couple minor suggestions. |
c07698a
to
84bc4ed
Compare
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field. An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings. This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick. The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based. Closes elastic#10523
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field. An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings. This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick. The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based. Closes elastic#10523
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field. An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings. This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick. The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based. Closes #10523
Very nice. Will elasticsearch core use this internally to optimise which indices to query for a given field value ? |
Am I right in thinking that anyone wishing to replace their time based indices with this mechanism will need to perform two queries in order to do so? (The first field_api query to obtain the indicies to pass in the URI for the search query). Is it possible to add this as a filter for a search query, thus doing away for the need use patterns in the search URI completely? Or is @maharg101 correct in postulating that elasticsearch will use this internally to remove the need for the query to be explicit about indicies at all? |
@pjcard you have extrapolated what I meant slightly - the scenario I had in mind was that you would still be explicit about the time-based indices, and that the query execution would take advantage of the min / max values of fields to reduce the set of indices / shards to look in. |
@pjcard You would still need to send two requests to ES (1 field stats & 1 search request). ES doesn't use the min and max internally to optimise the search (all though in theory it can). This is also how Kibana is going to use the field stats api to reduce the number of indices a search needs to be executed on: elastic/kibana#4342 |
Hi @martijnvg, thanks for the information. I'll add a request for that feature in case it's of use for anybody else and/or feasible. |
The field stats api should return field level statistics such as lowest, highest values and number of documents that have at least one value for a field.
An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.
This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using
Terms
class in Lucene). So finding out these stats for fields is cheap and quick.The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.
The api is quite straight forward it expects a list of fields and optionally an index can be provided:
(ran on a stack exchange dump)
The response looks like this:
This is a work in progress. Things to be done are documentation and tests.