Skip to content

Commit

Permalink
Apply suggestions from review
Browse files Browse the repository at this point in the history
- count-distinct.asciidoc
  - Content restructured, moving the section about approximate counts to end of doc.

- count.asciidoc
  - Clarified that omitting the `expression` parameter in `COUNT` is equivalent to `COUNT(*)`, which counts the number of rows.

- percentile.asciidoc
  - Moved the note about `PERCENTILE` being approximate and non-deterministic to end of doc.

- stats.asciidoc
  - Clarified the `STATS` command
  -  Added a note indicating that individual `null` values are skipped during aggregation
  • Loading branch information
leemthompo committed Jan 30, 2024
1 parent 631d414 commit bdb1ac9
Show file tree
Hide file tree
Showing 4 changed files with 44 additions and 39 deletions.
46 changes: 23 additions & 23 deletions docs/reference/esql/functions/count-distinct.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,29 +23,6 @@ same effect as a threshold of 40000. The default value is 3000.

Returns the approximate number of distinct values.

[discrete]
[[esql-agg-count-distinct-approximate]]
==== Counts are approximate

Computing exact counts requires loading values into a set and returning its
size. This doesn't scale when working on high-cardinality sets and/or large
values as the required memory usage and the need to communicate those
per-shard sets between nodes would utilize too many resources of the cluster.

This `COUNT_DISTINCT` function is based on the
https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
algorithm, which counts based on the hashes of the values with some interesting
properties:

include::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]

The `COUNT_DISTINCT` function takes an optional second parameter to configure
the precision threshold. The precision_threshold options allows to trade memory
for accuracy, and defines a unique count below which counts are expected to be
close to accurate. Above this value, counts might become a bit more fuzzy. The
maximum supported value is 40000, thresholds above this number will have the
same effect as a threshold of 40000. The default value is `3000`.

*Supported types*

Can take any field type as input.
Expand Down Expand Up @@ -83,3 +60,26 @@ include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExp
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression-result]
|===

[discrete]
[[esql-agg-count-distinct-approximate]]
==== Counts are approximate

Computing exact counts requires loading values into a set and returning its
size. This doesn't scale when working on high-cardinality sets and/or large
values as the required memory usage and the need to communicate those
per-shard sets between nodes would utilize too many resources of the cluster.

This `COUNT_DISTINCT` function is based on the
https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
algorithm, which counts based on the hashes of the values with some interesting
properties:

include::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]

The `COUNT_DISTINCT` function takes an optional second parameter to configure
the precision threshold. The precision_threshold options allows to trade memory
for accuracy, and defines a unique count below which counts are expected to be
close to accurate. Above this value, counts might become a bit more fuzzy. The
maximum supported value is 40000, thresholds above this number will have the
same effect as a threshold of 40000. The default value is `3000`.
6 changes: 4 additions & 2 deletions docs/reference/esql/functions/count.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ COUNT([expression])
*Parameters*

`expression`::
Expression that outputs values to be counted. If omitted, returns a count all
(the number of rows).
Expression that outputs values to be counted.
If omitted, equivalent to `COUNT(*)` (the number of rows).


*Description*

Returns the total number (count) of input values.
See also <<esql-agg-count-distinct-approximate>>.

*Supported types*

Expand Down
24 changes: 12 additions & 12 deletions docs/reference/esql/functions/percentile.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,6 @@ Returns the value at which a certain percentage of observed values occur. For
example, the 95th percentile is the value which is greater than 95% of the
observed values and the 50th percentile is the <<esql-agg-median>>.

[discrete]
[[esql-agg-percentile-approximate]]
==== `PERCENTILE` is (usually) approximate

include::../../aggregations/metrics/percentile-aggregation.asciidoc[tag=approximate]

[WARNING]
====
`PERCENTILE` is also {wikipedia}/Nondeterministic_algorithm[non-deterministic].
This means you can get slightly different results using the same data.
====

*Example*

[source.merge.styled,esql]
Expand All @@ -58,3 +46,15 @@ include::{esql-specs}/stats_percentile.csv-spec[tag=docsStatsPercentileNestedExp
|===
include::{esql-specs}/stats_percentile.csv-spec[tag=docsStatsPercentileNestedExpression-result]
|===

[discrete]
[[esql-agg-percentile-approximate]]
==== `PERCENTILE` is (usually) approximate

include::../../aggregations/metrics/percentile-aggregation.asciidoc[tag=approximate]

[WARNING]
====
`PERCENTILE` is also {wikipedia}/Nondeterministic_algorithm[non-deterministic].
This means you can get slightly different results using the same data.
====
7 changes: 5 additions & 2 deletions docs/reference/esql/processing-commands/stats.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@

[source,esql]
----
STATS [column1 =] expression1[, ..., [columnN =] expressionN] [BY grouping_expression1[, ..., grouping_expressionN]]
STATS [column1 =] expression1[, ..., [columnN =] expressionN]
[BY grouping_expression1[, ..., grouping_expressionN]]
----

*Parameters*
Expand All @@ -21,6 +22,8 @@ An expression that computes an aggregated value.
`grouping_expressionX`::
An expression that outputs the values to group by.

NOTE: Individual `null` values are skipped when computing aggregations.

*Description*

The `STATS ... BY` processing command groups rows according to a common value
Expand Down Expand Up @@ -86,7 +89,7 @@ include::{esql-specs}/stats.csv-spec[tag=statsGroupByMultipleValues]
----

Both the aggregating functions and the grouping expressions accept other
functions. This can come in useful for using `STATS...BY` on multivalue columns.
functions. This is useful for using `STATS...BY` on multivalue columns.
For example, to calculate the average salary change, you can use `MV_AVG` to
first average the multiple values per employee, and use the result with the
`AVG` function:
Expand Down

0 comments on commit bdb1ac9

Please sign in to comment.