Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc_count field mapper #64503

Merged
merged 6 commits into from
Nov 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion docs/reference/mapping/fields.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ fields can be customized when a mapping is created.
The size of the `_source` field in bytes, provided by the
{plugins}/mapper-size.html[`mapper-size` plugin].

q[discrete]
=== Doc count metadata field

<<mapping-doc-count-field,`_doc_count`>>::

A custom field used for storing doc counts when a document represents pre-aggregated data.

[discrete]
=== Indexing metadata fields

Expand All @@ -55,6 +62,7 @@ fields can be customized when a mapping is created.

Application specific metadata.

include::fields/doc-count-field.asciidoc[]

include::fields/field-names-field.asciidoc[]

Expand All @@ -69,4 +77,3 @@ include::fields/meta-field.asciidoc[]
include::fields/routing-field.asciidoc[]

include::fields/source-field.asciidoc[]

118 changes: 118 additions & 0 deletions docs/reference/mapping/fields/doc-count-field.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
[[mapping-doc-count-field]]
=== `_doc_count` data type
++++
<titleabbrev>_doc_count</titleabbrev>
++++

Bucket aggregations always return a field named `doc_count` showing the number of documents that were aggregated and partitioned
in each bucket. Computation of the value of `doc_count` is very simple. `doc_count` is incremented by 1 for every document collected
in each bucket.

While this simple approach is effective when computing aggregations over individual documents, it fails to accurately represent
documents that store pre-aggregated data (such as `histogram` or `aggregate_metric_double` fields), because one summary field may
represent multiple documents.

To allow for correct computation of the number of documents when working with pre-aggregated data, we have introduced a
metadata field type named `_doc_count`. `_doc_count` must always be a positive integer representing the number of documents
aggregated in a single summary field.

When field `_doc_count` is added to a document, all bucket aggregations will respect its value and increment the bucket `doc_count`
by the value of the field. If a document does not contain any `_doc_count` field, `_doc_count = 1` is implied by default.

[IMPORTANT]
========
* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed.
* If a document contains no `_doc_count` fields, aggregators will increment by 1, which is the default behavior.
========

[[mapping-doc-count-field-example]]
==== Example

The following <<indices-create-index, create index>> API request creates a new index with the following field mappings:

* `my_histogram`, a `histogram` field used to store percentile data
* `my_text`, a `keyword` field used to store a title for the histogram

[source,console]
--------------------------------------------------
PUT my_index
{
"mappings" : {
"properties" : {
"my_histogram" : {
"type" : "histogram"
},
"my_text" : {
"type" : "keyword"
}
}
}
}
--------------------------------------------------

The following <<docs-index_,index>> API requests store pre-aggregated data for
two histograms: `histogram_1` and `histogram_2`.

[source,console]
--------------------------------------------------
PUT my_index/_doc/1
{
"my_text" : "histogram_1",
"my_histogram" : {
"values" : [0.1, 0.2, 0.3, 0.4, 0.5],
"counts" : [3, 7, 23, 12, 6]
},
"_doc_count": 45 <1>
}

PUT my_index/_doc/2
{
"my_text" : "histogram_2",
"my_histogram" : {
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5],
"counts" : [8, 17, 8, 7, 6, 2]
},
"_doc_count_": 62 <1>
}
--------------------------------------------------
<1> Field `_doc_count` must be a positive integer storing the number of documents aggregated to produce each histogram.

If we run the following <<search-aggregations-bucket-terms-aggregation, terms aggregation>> on `my_index`:

[source,console]
--------------------------------------------------
GET /_search
{
"aggs" : {
"histogram_titles" : {
"terms" : { "field" : "my_text" }
}
}
}
--------------------------------------------------

We will get the following response:

[source,console-result]
--------------------------------------------------
{
...
"aggregations" : {
"histogram_titles" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "histogram_2",
"doc_count" : 62
},
{
"key" : "histogram_1",
"doc_count" : 45
}
]
}
}
}
--------------------------------------------------
// TESTRESPONSE[skip:test not setup]
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
setup:
- do:
indices.create:
index: test_1
body:
settings:
number_of_replicas: 0
mappings:
properties:
str:
type: keyword
number:
type: integer

- do:
bulk:
index: test_1
refresh: true
body:
- '{"index": {}}'
- '{"_doc_count": 10, "str": "abc", "number" : 500, "unmapped": "abc" }'
- '{"index": {}}'
- '{"_doc_count": 5, "str": "xyz", "number" : 100, "unmapped": "xyz" }'
- '{"index": {}}'
- '{"_doc_count": 7, "str": "foo", "number" : 100, "unmapped": "foo" }'
- '{"index": {}}'
- '{"_doc_count": 1, "str": "foo", "number" : 200, "unmapped": "foo" }'
- '{"index": {}}'
- '{"str": "abc", "number" : 500, "unmapped": "abc" }'

---
"Test numeric terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"

- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "num_terms" : { "terms" : { "field" : "number" } } } }

- match: { hits.total: 5 }
- length: { aggregations.num_terms.buckets: 3 }
- match: { aggregations.num_terms.buckets.0.key: 100 }
- match: { aggregations.num_terms.buckets.0.doc_count: 12 }
- match: { aggregations.num_terms.buckets.1.key: 500 }
- match: { aggregations.num_terms.buckets.1.doc_count: 11 }
- match: { aggregations.num_terms.buckets.2.key: 200 }
- match: { aggregations.num_terms.buckets.2.doc_count: 1 }


---
"Test keyword terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str" } } } }

- match: { hits.total: 5 }
- length: { aggregations.str_terms.buckets: 3 }
- match: { aggregations.str_terms.buckets.0.key: "abc" }
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }
- match: { aggregations.str_terms.buckets.1.key: "foo" }
- match: { aggregations.str_terms.buckets.1.doc_count: 8 }
- match: { aggregations.str_terms.buckets.2.key: "xyz" }
- match: { aggregations.str_terms.buckets.2.doc_count: 5 }

---

"Test unmapped string terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
bulk:
index: test_2
refresh: true
body:
- '{"index": {}}'
- '{"_doc_count": 10, "str": "abc" }'
- '{"index": {}}'
- '{"str": "abc" }'
- do:
search:
index: test_2
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str.keyword" } } } }

- match: { hits.total: 2 }
- length: { aggregations.str_terms.buckets: 1 }
- match: { aggregations.str_terms.buckets.0.key: "abc" }
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }

---
"Test composite str_terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" :
{ "composite_agg" : { "composite" :
{
"sources": ["str_terms": { "terms": { "field": "str" } }]
}
}
}
}

- match: { hits.total: 5 }
- length: { aggregations.composite_agg.buckets: 3 }
- match: { aggregations.composite_agg.buckets.0.key.str_terms: "abc" }
- match: { aggregations.composite_agg.buckets.0.doc_count: 11 }
- match: { aggregations.composite_agg.buckets.1.key.str_terms: "foo" }
- match: { aggregations.composite_agg.buckets.1.doc_count: 8 }
- match: { aggregations.composite_agg.buckets.2.key.str_terms: "xyz" }
- match: { aggregations.composite_agg.buckets.2.doc_count: 5 }


---
"Test composite num_terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" :
{ "composite_agg" :
{ "composite" :
{
"sources": ["num_terms" : { "terms" : { "field" : "number" } }]
}
}
}
}

- match: { hits.total: 5 }
- length: { aggregations.composite_agg.buckets: 3 }
- match: { aggregations.composite_agg.buckets.0.key.num_terms: 100 }
- match: { aggregations.composite_agg.buckets.0.doc_count: 12 }
- match: { aggregations.composite_agg.buckets.1.key.num_terms: 200 }
- match: { aggregations.composite_agg.buckets.1.doc_count: 1 }
- match: { aggregations.composite_agg.buckets.2.key.num_terms: 500 }
- match: { aggregations.composite_agg.buckets.2.doc_count: 11 }

Loading