Skip to content

Latest commit

 

History

History
470 lines (256 loc) · 16.4 KB

aggregates.rst

File metadata and controls

470 lines (256 loc) · 16.4 KB

Continuous Aggregates

One of the fundamental goals of PipelineDB is to facilitate high-performance continuous aggregation, so not suprisingly aggregates are a central component of PipelineDB's utility. Continuous aggregates can be very powerful--in the most general sense they make it possible to keep the amount of data persisted in PipelineDB constant relative to the amount of data that has been pushed through it. This can enable sustainable and very high data throughput on modest hardware.

Continuous aggregates are incrementally updated in real time as new events are read by the the continuous view that they're a part of. For simple aggregates such as count and sum, it is easy to see how their results can be incrementally updated--just add the new value to the existing result.

But for more complicated aggregates, such as avg, stddev, percentile_cont, etc., more advanced infrastructure is required to support efficient incremental updates, and PipelineDB handles all of that complexity for you transparently.

Below you'll find a description of all the aggregates that PipelineDB supports. A few of them behave slightly differently than their standard counterparts in order to efficiently operate on infinite streams of data. Such aggregates have been annotated with an explanation of how exactly their behavior differs.

Note

It may be helpful for you to consult the excellent PostgreSQL aggregates documentation.


Bloom Filter Aggregates

bloom_agg ( expression )

Adds all input values to a :ref:`bloom-filter`

bloom_agg ( expression , p , n )

Adds all input values to a Bloom filter and sizes it according to the given parameters. p is the desired false-positive rate, and n is the expected number of unique elements to add.

bloom_union_agg ( bloom filter )

Takes the union of all input Bloom filters, resulting in a single Bloom filter containing all of the input Bloom filters' information.

bloom_intersection_agg ( bloom filter )

Takes the intersection of all input Bloom filters, resulting in a single Bloom filter containing only the information shared by all of the input Bloom filters.

See :ref:`bloom-funcs` for functionality that can be used to manipulate Bloom filters.

Frequency Tracking Aggregates

freq_agg ( expression )

Adds all input values to an internal :ref:`count-min-sketch`, enabling efficient online computation of the frequency of each input expression.

freq_agg ( expression, epsilon, p )

Same as above, but accepts epsilon and p as parameters for the underlying cmsketch. epsilon determines the acceptable error rate of the cmsketch, and defaults to 0.002 (0.2%). p determines the confidence, and defaults to 0.995 (99.5%). Lower epsilon and p will result in smaller cmsketch structures, and vice versa.

freq_merge_agg ( count-min sketch )

Merges all input Count-min sketches into a single one containing all of the information of the input Count-min sketches.

See :ref:`cmsketch-funcs` for functionality that can be used to manipulate Count-Min sketches.

Top-K Aggregates

topk_agg ( expression , k )

Tracks the top k input expressions by adding all input values to a :ref:`topk` data structure sized for the given k, incrementing each value's count by 1 each time it is added.

topk_agg (expression, k, weight )

Same as above, but associates the given weight to the input expression (rather than a default weight of 1).

topk_merge_agg ( topk )

Merges all topk inputs into a single topk data structure.

See :ref:`topk-funcs` for functionality that can be used to manipulate topk objects.

HyperLogLog Aggregates

hll_agg ( expression )

Adds all input values to a :ref:`hll`.

hll_agg ( expression, p )

Adds all input values to a :ref:`hll` with the given p. A larger p reduces the HyperLogLog's error rate, at the expense of a larger size.

hll_union_agg ( hyperloglog )

Takes the union of all input HyperLogLogs, resulting in a single HyperLogLog that contains all of the information of the input HyperLogLogs.

See :ref:`hll-funcs` for functionality that can be used to manipulate HyperLogLog objects.

Distribution Aggregates

dist_agg ( expression )

Adds all input values to a :ref:`t-digest` in order to track the distribution of all input expressions.

dist_agg ( expression, compression )

Same as above, but builds the underyling tdigest using the given compression. compression must be an integer in the range [20, 1000]. A higher value for compression will yield a larger tdigest with but with more precision than a smaller tdigest with a lower compression value.

See :ref:`tdigest-funcs` for functionality that can be used to manipulate tdigest objects.

Miscellaneous Aggregates

bucket_agg ( expression , bucket_id )

Adds 4-byte hashes of each input expression to the bucket with the given bucket_id. Each hash may only be present precisely once in one bucket at any given time. Buckets can therefore be thought of as exclusive sets of hashes of the input expressions.

bucket_agg ( expression , bucket_id , timestamp )

Same as above, but allows a timestamp expression to determine bucket entry order. That is, only a value's latest entry will cause it to change buckets.

See :ref:`misc-funcs` for functionality that can be used to manipulate bucket_agg objects.

exact_count_distinct ( expression )

Counts the exact number of distinct values for the given expression. Since count distinct used in continuous views implicitly uses HyperLogLog for efficiency, exact_count_distinct can be used when the small margin of error inherent to using HyperLogLog is not acceptable.

Important

exact_count_distinct must store all unique values observed in order to determine uniqueness, so it is not recommended for use when many unique values are expected.

first_values ( n ) WITHIN GROUP (ORDER BY sort_expression)

An ordered-set aggregate that stores the first n values ordered by the provided sort expression.

Note

See also: :ref:`pipeline-funcs`, which explains some of the PipelineDB's non-aggregate functionality for manipulating Bloom filters, Count-min sketches, HyperLogLogs and T-Digests. Also, check out :ref:`probabilistic` for more information about what they are and how you can leverage them.

keyed_max ( key, value )

Returns the value associated with the "highest" key.

keyed_min ( key, value )

Returns the value associated with the "lowest" key.

set_agg ( expression )

Adds all input values to a set.

See :ref:`misc-funcs` for functionality that can be used to manipulate sets.


Combine

Since PipelineDB can incrementally update aggregate values, it has the capability to combine existing aggregates using more information than simply their current raw values. For example, combining multiple averages isn't simply a matter of taking the average of the averages. Their weights must be taken into account.

For this type of operation, PipelineDB exposes the special combine aggregate. Its description is as follows:

combine ( aggregate column )

Given an aggregate column, combines all values into a single value as if all of the individual aggregates' inputs were aggregated a single time.

Note

combine only works on aggregate columns that belong to continuous views.

Let's look at an example:

postgres=# CREATE VIEW v AS SELECT g::integer, AVG(x::integer) FROM stream GROUP BY g;
CREATE VIEW
postgres=# INSERT INTO stream (g, x) VALUES (0, 10), (0, 10), (0, 10), (0, 10), (0, 10);
INSERT 0 5
postgres=# INSERT INTO stream (g, x) VALUES (1, 20);
INSERT 0 1
postgres=# SELECT * FROM v;
 g |         avg
---+---------------------
 0 | 10.0000000000000000
 1 | 20.0000000000000000
(2 rows)

postgres=# SELECT avg(avg) FROM v;
         avg
---------------------
 15.0000000000000000
(1 row)

postgres=# -- But that didn't take into account that the value of 10 weighs much more,
postgres=# -- because it was inserted 5 times, whereas 20 was only inserted once.
postgres=# -- combine() will take this weight into account
postgres=#
postgres=# SELECT combine(avg) FROM v;
       combine
---------------------
 11.6666666666666667
(1 row)

postgres=# -- There we go! This is the same average we would have gotten if we ran
postgres=# -- a single average on all 6 of the above inserted values, yet we only
postgres=# -- needed two rows to do it.

General Aggregates

array_agg ( expression )

Input values, including nulls, concatenated into an array

avg ( expression )

The average of all input values

bit_and ( expression )

The bitwise AND of all non-null input values, or null if none

bit_or ( expression )

The bitwise OR of all non-null input values, or null if none

bool_and ( expression )

True if all input values are true, otherwise false

bool_or ( expression )

True if at least one input value is true, otherwise false

count ( * )

Number of input rows

count ( DISTINCT expression)

Number of rows for which expression is distinct.

Note

Counting the distinct number of expressions on an infinite stream would require infinite memory, so continuous views use :ref:`hll` to accomplish distinct counting in constant space and time, at the expense of a small margin of error. Empirically, PipelineDB's implementation of :ref:`hll` has an error rate of ~0.81%. For example, count distinct might show 1008 when the actual number of unique expressions was 1000.

count ( expression )

Number of rows for which expression is non-null.

every ( expression )

Equivalent to bool_and

json_agg ( expression )

Aggregates values as a JSON array

json_object_agg ( key, value )

Aggregates key-value pairs as a JSON object

jsonb_agg ( expression )

Aggregates values as a JSONB array

jsonb_object_agg ( key, value )

Aggregates key-value pairs as a JSONB object

max ( expression )

Maximum value of expression across all input values

min ( expression )

Minimum value of expression across all input values

string_agg ( expression, delimiter )

Input values concatenated into a string, separated by delimiter

sum ( expression )

Sum of expression across all input values

Statistical Aggregates

corr ( y, x )

Correlation coefficient

covar_pop ( y, x )

Population covariance

covar_samp ( y, x )

Sample covariance

regr_avgx ( y, x )

Average of the independent variable (sum(x)/N)

regr_avgy ( y, x )

Average of the independent variable (sum(y)/N)

regr_count ( y, x )

Number of input rows in which both expressions are non-null

regr_intercept ( y, x )

y-intercept of the least-squares-fit linear equation determined by the (x, y) pairs

regr_r2 ( y, x )

Square of the correlation coefficient

regr_slope ( y, x )

Slope of the least-squares-fit linear equation determined by the (x, y) pairs

regr_sxx ( y, x )

sum(X^2) - sum(X)^2/N -- sum of squares of the independent variable

regr_sxy ( y, x )

sum(X*Y) - sum(X) * sum(Y)/N -- sum of products of independent times dependent variable

regr_syy ( y, x )

sum(Y^2) - sum(Y)^2/N -- sum of squares of the independent variable

stddev ( expression )

Sample standard deviation of the input values

stddev_pop ( expression )

Population standard deviation of the input values

variance ( expression )

Sample variance of the input values (square of the sample standard deviation)

var_pop ( expression )

Population variance of the input values (square of the population standard deviation)

Ordered-set Aggregates

ordered-set aggregates apply ordering to their input in order to obtain their results, so they use the WITHIN GROUP clause. Its syntax is as follows:

aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause )

Let's look at a couple examples.

Compute the 99th percentile of value:

SELECT percentile_cont(0.99) WITHIN GROUP (ORDER BY value) FROM some_table;

Or with a continuous view:

CREATE VIEW percentile AS
SELECT percentile_cont(0.99) WITHIN GROUP (ORDER BY value::float8)
FROM some_stream;

percentile_cont ( fraction )

Continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed

percentile_cont ( array of fractions )

Multiple continuous percentile: returns an array of results matching the shape of the fractions parameter, with each non-null element replaced by the value corresponding to that percentile

Note

Computing percentiles on infinite streams would require infinite memory, so both forms of percentile_cont, when used by continuous views, use :ref:`t-digest` as a way to estimate percentiles with a very high degree of accuracy. In general, percentiles in continuous views are more accurate the closer they are to the upper or lower bounds of [0, 1).


Hypothetical-set Aggregates

hypothetical-set aggregates take an expression and compute something about it within the context of a set of input rows. For example, rank(2) computes the rank of 2 within the context of whatever the input rows end up being.

The hypothetical-set aggregates use the WITHIN GROUP clause to define the input rows. Its syntax is as follows:

aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause )

Here is an example of of a hypothetical-set aggregate being used by a continuous view:

CREATE VIEW continuous_rank AS
SELECT rank(42) WITHIN GROUP (ORDER BY value::float8)
FROM some_stream;

This continuous view will continuously update the rank of 42 given all of the events it has read.

rank ( arguments )

Rank of the hypothetical row, with gaps for duplicate rows

dense_rank ( arguments )

Rank of the hypothetical row, without gaps

Note

Computing the hypothetical dense_rank of a value given an infinite stream of values would require infinite memory, so continuous views use :ref:`hll` to do it in constant time and space, at the expense of a small margin of error. Empirically, PipelineDB's implementation of :ref:`hll` has an error rate of ~0.2%. In other words, dense_rank (1000) in a continuous view might show 998 when the actual number of unique lower-ranking values seen was 1000.

percent_rank ( arguments )

Relative rank of the hypothetical row, ranging from 0 to 1

cume_dist ( arguments )

Relative rank of the hypothetical row, ranging from 1/N to 1

Unsupported Aggregates

mode ( )

Future releases of PipelineDB will include an implementation of an online mode estimation algorithm, but for now it's not supported

percentile_disc ( arguments )

Given an input percentile (such as 0.99), percentile_disc returns the very first value in the input set that is within that percentile. This requires actually sorting the input set, which is obviously impractical on an infinite stream, and doesn't even allow for a highly accurate estimation algorithm such as the one we use for percentile_cont.

xmlagg ( xml )

:(

<aggregate_name> (DISTINCT expression)

Only the count aggregate function is supported with a DISTINCT expression as noted above in the General Aggregates section. In future releases, we might leverage :ref:`bloom-filter` to allow DISTINCT expressions for all aggregate functions.