forked from elastic/elasticsearch
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Cumulative Cardinality agg (and Data Science plugin) (elastic#43661)
This adds a pipeline aggregation that calculates the cumulative cardinality of a field. It does this by iteratively merging in the HLL sketch from consecutive buckets and emitting the cardinality up to that point. This is useful for things like finding the total "new" users that have visited a website (as opposed to "repeat" visitors). This is a Basic+ aggregation and adds a new Data Science plugin to house it and future advanced analytics/data science aggregations.
- Loading branch information
1 parent
5fbb572
commit 4a280a6
Showing
26 changed files
with
1,580 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
235 changes: 235 additions & 0 deletions
235
docs/reference/aggregations/pipeline/cumulative-cardinality-aggregation.asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,235 @@ | ||
[role="xpack"] | ||
[testenv="basic"] | ||
[[search-aggregations-pipeline-cumulative-cardinality-aggregation]] | ||
=== Cumulative Cardinality Aggregation | ||
|
||
A parent pipeline aggregation which calculates the Cumulative Cardinality in a parent histogram (or date_histogram) | ||
aggregation. The specified metric must be a cardinality aggregation and the enclosing histogram | ||
must have `min_doc_count` set to `0` (default for `histogram` aggregations). | ||
|
||
The `cumulative_cardinality` agg is useful for finding "total new items", like the number of new visitors to your | ||
website each day. A regular cardinality aggregation will tell you how many unique visitors came each day, but doesn't | ||
differentiate between "new" or "repeat" visitors. The Cumulative Cardinality aggregation can be used to determine | ||
how many of each day's unique visitors are "new". | ||
|
||
==== Syntax | ||
|
||
A `cumulative_cardinality` aggregation looks like this in isolation: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"cumulative_cardinality": { | ||
"buckets_path": "my_cardinality_agg" | ||
} | ||
} | ||
-------------------------------------------------- | ||
// NOTCONSOLE | ||
|
||
[[cumulative-cardinality-params]] | ||
.`cumulative_cardinality` Parameters | ||
[options="header"] | ||
|=== | ||
|Parameter Name |Description |Required |Default Value | ||
|`buckets_path` |The path to the cardinality aggregation we wish to find the cumulative cardinality for (see <<buckets-path-syntax>> for more | ||
details) |Required | | ||
|`format` |format to apply to the output value of this aggregation |Optional |`null` | ||
|=== | ||
|
||
The following snippet calculates the cumulative cardinality of the total daily `users`: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET /user_hits/_search | ||
{ | ||
"size": 0, | ||
"aggs" : { | ||
"users_per_day" : { | ||
"date_histogram" : { | ||
"field" : "timestamp", | ||
"calendar_interval" : "day" | ||
}, | ||
"aggs": { | ||
"distinct_users": { | ||
"cardinality": { | ||
"field": "user_id" | ||
} | ||
}, | ||
"total_new_users": { | ||
"cumulative_cardinality": { | ||
"buckets_path": "distinct_users" <1> | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
// TEST[setup:user_hits] | ||
|
||
<1> `buckets_path` instructs this aggregation to use the output of the `distinct_users` aggregation for the cumulative cardinality | ||
|
||
And the following may be the response: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"took": 11, | ||
"timed_out": false, | ||
"_shards": ..., | ||
"hits": ..., | ||
"aggregations": { | ||
"users_per_day": { | ||
"buckets": [ | ||
{ | ||
"key_as_string": "2019-01-01T00:00:00.000Z", | ||
"key": 1546300800000, | ||
"doc_count": 2, | ||
"distinct_users": { | ||
"value": 2 | ||
}, | ||
"total_new_users": { | ||
"value": 2 | ||
} | ||
}, | ||
{ | ||
"key_as_string": "2019-01-02T00:00:00.000Z", | ||
"key": 1546387200000, | ||
"doc_count": 2, | ||
"distinct_users": { | ||
"value": 2 | ||
}, | ||
"total_new_users": { | ||
"value": 3 | ||
} | ||
}, | ||
{ | ||
"key_as_string": "2019-01-03T00:00:00.000Z", | ||
"key": 1546473600000, | ||
"doc_count": 3, | ||
"distinct_users": { | ||
"value": 3 | ||
}, | ||
"total_new_users": { | ||
"value": 4 | ||
} | ||
} | ||
] | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TESTRESPONSE[s/"took": 11/"took": $body.took/] | ||
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": $body._shards/] | ||
// TESTRESPONSE[s/"hits": \.\.\./"hits": $body.hits/] | ||
|
||
|
||
Note how the second day, `2019-01-02`, has two distinct users but the `total_new_users` metric generated by the | ||
cumulative pipeline agg only increments to three. This means that only one of the two users that day were | ||
new, the other had already been seen in the previous day. This happens again on the third day, where only | ||
one of three users is completely new. | ||
|
||
==== Incremental cumulative cardinality | ||
|
||
The `cumulative_cardinality` agg will show you the total, distinct count since the beginning of the time period | ||
being queried. Sometimes, however, it is useful to see the "incremental" count. Meaning, how many new users | ||
are added each day, rather than the total cumulative count. | ||
|
||
This can be accomplished by adding a `derivative` aggregation to our query: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET /user_hits/_search | ||
{ | ||
"size": 0, | ||
"aggs" : { | ||
"users_per_day" : { | ||
"date_histogram" : { | ||
"field" : "timestamp", | ||
"calendar_interval" : "day" | ||
}, | ||
"aggs": { | ||
"distinct_users": { | ||
"cardinality": { | ||
"field": "user_id" | ||
} | ||
}, | ||
"total_new_users": { | ||
"cumulative_cardinality": { | ||
"buckets_path": "distinct_users" | ||
} | ||
}, | ||
"incremental_new_users": { | ||
"derivative": { | ||
"buckets_path": "total_new_users" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
// TEST[setup:user_hits] | ||
|
||
|
||
And the following may be the response: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"took": 11, | ||
"timed_out": false, | ||
"_shards": ..., | ||
"hits": ..., | ||
"aggregations": { | ||
"users_per_day": { | ||
"buckets": [ | ||
{ | ||
"key_as_string": "2019-01-01T00:00:00.000Z", | ||
"key": 1546300800000, | ||
"doc_count": 2, | ||
"distinct_users": { | ||
"value": 2 | ||
}, | ||
"total_new_users": { | ||
"value": 2 | ||
} | ||
}, | ||
{ | ||
"key_as_string": "2019-01-02T00:00:00.000Z", | ||
"key": 1546387200000, | ||
"doc_count": 2, | ||
"distinct_users": { | ||
"value": 2 | ||
}, | ||
"total_new_users": { | ||
"value": 3 | ||
}, | ||
"incremental_new_users": { | ||
"value": 1.0 | ||
} | ||
}, | ||
{ | ||
"key_as_string": "2019-01-03T00:00:00.000Z", | ||
"key": 1546473600000, | ||
"doc_count": 3, | ||
"distinct_users": { | ||
"value": 3 | ||
}, | ||
"total_new_users": { | ||
"value": 4 | ||
}, | ||
"incremental_new_users": { | ||
"value": 1.0 | ||
} | ||
} | ||
] | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TESTRESPONSE[s/"took": 11/"took": $body.took/] | ||
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": $body._shards/] | ||
// TESTRESPONSE[s/"hits": \.\.\./"hits": $body.hits/] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.