Skip to content

Latest commit

 

History

History
483 lines (390 loc) · 14.9 KB

tdigest.md

File metadata and controls

483 lines (390 loc) · 14.9 KB

T-Digest experimental

Description
Details
Example
Continuous Aggregate Example
API

Description

TimescaleDB Toolkit provides an implementation of the t-digest data structure for quantile approximations. A t-digest is a space efficient aggregation which provides increased resolution at the edges of the distribution. This allows for more accurate estimates of extreme quantiles than traditional methods.

Details

Timescale's t-digest is implemented as an aggregate function in PostgreSQL. They do not support moving-aggregate mode, and are not ordered-set aggregates. Presently they are restricted to float values, but the goal is to make them polymorphic. They are partializable and are good candidates for continuous aggregation.

One additional thing to note about TDigests is that they are somewhat dependant on the order of inputs. The percentile approximations should be nearly equal for the same underlying data, especially at the extremes of the quantile range where the TDigest is inherently more accurate, they are unlikely to be identical if built in a different order. While this should have little effect on the accuracy of the estimates, it is worth noting that repeating the creation of the TDigest might have subtle differences if the call is being parallelized by Postgres. Similarly, building a TDigest by combining several subdigests using the summary aggregate is likely to produce a subtley different result than combining all of the underlying data using a single point aggregate.

Usage Example

For this example we're going to start with a table containing some NOAA weather data for a few weather stations across the US over the past 20 years.

\d weather;
                         Table "public.weather"
 Column  |            Type             | Collation | Nullable | Default
---------+-----------------------------+-----------+----------+---------
 station | text                        |           |          |
 name    | text                        |           |          |
 date    | timestamp without time zone |           |          |
 prcp    | double precision            |           |          |
 snow    | double precision            |           |          |
 tavg    | double precision            |           |          |
 tmax    | double precision            |           |          |
 tmin    | double precision            |           |          |

Now let's create some t-digests for our different stations and verify that they're receiving data.

CREATE VIEW high_temp AS
    SELECT name, tdigest(100, tmax)
    FROM weather
    GROUP BY name;

SELECT
    name,
    num_vals(tdigest)
FROM high_temp;
                 name                  | num_vals
---------------------------------------+-----------
 PORTLAND INTERNATIONAL AIRPORT, OR US |      7671
 LITCHFIELD PARK, AZ US                |      5881
 NY CITY CENTRAL PARK, NY US           |      7671
 MIAMI INTERNATIONAL AIRPORT, FL US    |      7671
(4 rows)

We can then check to see the 99.5 percentile high temperature for each location.

SELECT
    name,
    approx_percentile(0.995, tdigest)
FROM high_temp;
                 name                  |           quantile
---------------------------------------+--------------------
 PORTLAND INTERNATIONAL AIRPORT, OR US |   98.4390837104072
 LITCHFIELD PARK, AZ US                | 114.97809722222223
 NY CITY CENTRAL PARK, NY US           |  95.86391321044545
 MIAMI INTERNATIONAL AIRPORT, FL US    |  95.04283854166665
(4 rows)

Or even check to see what quantile 90F would fall at in each city.

SELECT
    name,
    approx_percentile_rank(90.0, tdigest)
FROM high_temp;
                 name                  |  approx_percentile_rank
---------------------------------------+--------------------
 PORTLAND INTERNATIONAL AIRPORT, OR US | 0.9609990016734108
 LITCHFIELD PARK, AZ US                | 0.5531621580122781
 NY CITY CENTRAL PARK, NY US           | 0.9657150306348585
 MIAMI INTERNATIONAL AIRPORT, FL US    | 0.8093468908877591
(4 rows)

Example Using TimeScale Continuous Aggregates (tdigest-cagg-example)

Timescale continuous aggregates provide an easy way to keep a tdigest up to date as more data is added to a table. The following example shows how this might look in practice. The first step is to create a Timescale hypertable to store our data.

SET TIME ZONE 'UTC';
CREATE TABLE test(time TIMESTAMPTZ, value DOUBLE PRECISION);
SELECT create_hypertable('test', 'time');

Next a materialized view with the timescaledb.continuous property is added. This will automatically keep itself, including the tdigest in this case, up to date as data is added to the table.

CREATE MATERIALIZED VIEW weekly_sketch
WITH (timescaledb.continuous)
AS SELECT
    time_bucket('7 day'::interval, time) as week,
    tdigest(100, value) as digest
FROM test
GROUP BY time_bucket('7 day'::interval, time);

Next a utility function, generate_periodic_normal_series, is called to generate some data. When called in this manner the function will return 28 days worth of data points spaced 10 minutes apart. These points are generate by adding a random point (with a normal distribution and standard deviation of 100) to a sine wave which oscilates between 900 and 1100 over the period of a day.

INSERT INTO test
    SELECT time, value
    FROM toolkit_experimental.generate_periodic_normal_series('2020-01-01 UTC'::timestamptz, rng_seed => 543643);
INSERT 0 4032
Refresh to make sure we're exercising our serialization path. ```SQL ,non-tranactional,ignore-output CALL refresh_continuous_aggregate('weekly_sketch', NULL, NULL); ```

Finally, a query is run over the aggregate to see various approximate percentiles from different weeks.

SELECT
    week,
    approx_percentile(0.01, digest) AS low,
    approx_percentile(0.5, digest) AS mid,
    approx_percentile(0.99, digest) AS high
FROM weekly_sketch
ORDER BY week;
         week          |        low        |        mid         |        high
-----------------------+-------------------+--------------------+--------------------
2019-12-30 00:00:00+00 | 783.2075197029583 | 1030.4505832620227 | 1276.7865808567146
2020-01-06 00:00:00+00 | 865.2941219994462 | 1096.0356855737048 |  1331.649176312383
2020-01-13 00:00:00+00 | 834.6747915021757 |  1060.024660266383 |    1286.1810386717
2020-01-20 00:00:00+00 | 728.2421431793433 |  955.3913494459423 |  1203.730690023456
2020-01-27 00:00:00+00 | 655.1143367116582 |  903.4836014674186 | 1167.7058289748031

It is also possible to combine the weekly aggregates to run queries on the entire data:

SELECT
    approx_percentile(0.01, combined.digest) AS low,
    approx_percentile(0.5, combined.digest) AS mid,
    approx_percentile(0.99, combined.digest) AS high
FROM (SELECT rollup(digest) AS digest FROM weekly_sketch) AS combined;
       low        |        mid         |        high
------------------+--------------------+--------------------
746.7844638729881 | 1026.6100299252928 | 1294.5391132795592

Command List (A-Z)

Aggregate Functions

Accessor Functions


tdigest (point form)

tdigest(
    buckets INTEGER,
    value DOUBLE PRECISION
) RETURNS TDigest

This will construct and return a TDigest with the specified number of buckets over the given values.

Required Arguments

Name Type Description
buckets INTEGER Number of buckets in the digest. Increasing this will provide more accurate quantile estimates, but will require more memory.
value DOUBLE PRECISION Column to aggregate.

Returns

Column Type Description
tdigest TDigest A t-digest object which may be passed to other t-digest APIs.

Sample Usages

For this example, assume we have a table 'samples' with a column 'weights' holding DOUBLE PRECISION values. The following will simply return a digest over that column

SELECT tdigest(100, data) FROM samples;

It may be more useful to build a view from the aggregate that can later be passed to other tdigest functions.

CREATE VIEW digest AS
    SELECT tdigest(100, data)
    FROM samples;

rollup (summary form)

rollup(
    digest TDigest
) RETURNS TDigest

This will combine multiple already constructed TDigests, if they were created with the same size. This is very useful for re-aggregating digests already constructed using the point form. Note that the resulting digest may be subtly different from a digest constructed directly from the underlying points, as noted in the details section above.

Required Arguments

Name Type Description
digest TDigest Previously constructed TDigest objects.

Returns

Column Type Description
tdigest TDigest A TDigest representing all of the underlying data from all the subaggregates.

Sample Usages

This example assumes a table 'samples' with a column 'data' holding DOUBLE PRECISION values and an 'id' column that holds the what series the data belongs to. A view to get the TDigests for each id using the point form can be created like so:

CREATE VIEW digests AS
    SELECT
        id,
        rollup(100, data) as digest
    FROM samples
    GROUP BY id;

That view can then be used to get the full aggregate like so:

SELECT rollup(digest)
FROM digests;

approx_percentile

approx_percentile(
    quantile DOUBLE PRECISION,
    digest TDigest
) RETURNS TDigest

Get the approximate value at a quantile from a t-digest

Required Arguments

Name Type Description
quantile DOUBLE PRECISION The desired quantile (0.0-1.0) to approximate.
digest TDigest The digest to compute the quantile on.

Returns

Column Type Description
approx_percentile DOUBLE PRECISION The estimated value at the requested quantile.

Sample Usage

SELECT approx_percentile(0.90, tdigest(100, data))
FROM generate_series(1, 100) data;
 approx_percentile
----------
     90.5

approx_percentile_rank

approx_percentile_rank(
    value DOUBLE PRECISION,
    digest TDigest
) RETURNS TDigest

Estimate what quantile a given value would be located at in a t-digest.

Required Arguments

Name Type Description
value DOUBLE PRECISION The value to estimate the quantile of.
digest TDigest The digest to compute the quantile on.

Returns

Column Type Description
approx_percentile_rank DOUBLE PRECISION The estimated quantile associated with the provided value.

Sample Usage

SELECT approx_percentile_rank(90, tdigest(100, data))
FROM generate_series(1, 100) data;
 approx_percentile_rank
-------------------
             0.895

max_val

max_val(digest TDigest) RETURNS DOUBLE PRECISION

Get the maximum value from a t-digest.

Required Arguments

Name Type Description
digest TDigest The digest to extract the max value from.

Returns

Column Type Description
max_val DOUBLE PRECISION The maximum value entered into the t-digest.

Sample Usage

SELECT max_val(tdigest(100, data))
FROM generate_series(1, 100) data;
 max_val
---------
     100

mean

mean(digest TDigest) RETURNS DOUBLE PRECISION

Get the average of all the values contained in a t-digest.

Required Arguments

Name Type Description
digest TDigest The digest to extract the mean value from.

Returns

Column Type Description
mean DOUBLE PRECISION The average of the values entered into the t-digest.

Sample Usage

SELECT mean(tdigest(100, data))
FROM generate_series(1, 100) data;
 mean
------
 50.5

min_val

min_val(digest TDigest) RETURNS DOUBLE PRECISION

Get the minimum value from a t-digest.

Required Arguments

Name Type Description
digest TDigest The digest to extract the min value from.

Returns

Column Type Description
min_val DOUBLE PRECISION The minimum value entered into the t-digest.

Sample Usages

SELECT min_val(tdigest(100, data))
FROM generate_series(1, 100) data;
 min_val
-----------
         1

num_vals

num_vals(digest TDigest) RETURNS DOUBLE PRECISION

Get the number of values contained in a t-digest.

Required Arguments

Name Type Description
digest TDigest The digest to extract the number of values from.

Returns

Column Type Description
num_vals DOUBLE PRECISION The number of values entered into the t-digest.

Sample Usage

SELECT num_vals(tdigest(100, data))
FROM generate_series(1, 100) data;
 num_vals
-----------
       100