From fafba308be7af75d7bac463f581780e753ea92c8 Mon Sep 17 00:00:00 2001 From: Andrei Matei Date: Wed, 13 Apr 2016 17:06:16 -0400 Subject: [PATCH 1/2] list the types of aggregators --- docs/RFCS/distributed_sql.md | 46 ++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/docs/RFCS/distributed_sql.md b/docs/RFCS/distributed_sql.md index 5a7545193e93..d4a272b18e71 100644 --- a/docs/RFCS/distributed_sql.md +++ b/docs/RFCS/distributed_sql.md @@ -17,6 +17,7 @@ * [Example 1](#example-1) * [Example 2](#example-2) * [Example 3](#example-3) + * [Types of aggregators](#types-of-aggregators) * [From logical to physical](#from-logical-to-physical) * [Processors](#processors) * [Joins](#joins) @@ -431,6 +432,51 @@ AGGREGATOR final: Composition: src -> countdistinctmin -> final ``` +### Types of aggregators + +- `TABLE READER` is a special agregator, with no input stream. It's configured + with spans of a table or index and the schema that it needs to read. + Like every other aggregator, it can be configured with a programmable output + filter. +- `PROGRAM` is a fully programmable no-grouping aggregator. It runs a "program" + on each individual row. The program can drop the row, or modify it + arbitrarily. +- `JOIN` performs a join on two streams, with equality constraints between + certain columns. The aggregator is grouped on the columns that are + constrained to be equal. See [Stream joins](#stream-joins). +- `JOIN READER` performs point-lookups for rows with the kyes indicated by the + input stream. It can do so by performing (potentially remote) KV reads, or by + setting up remote flows. See [Join-by-lookup](#join-by-lookup) and + [On-the-fly flows setup](#on-the-fly-flows-setup). +- `MUTATE` performs insertions/deletions/updates to KV. See section TODO. +- `SET OPERATION` takes several inputs and performs set arithmetic on them + (union, difference). +- `AGGREGATOR` is the one that does "aggregation" in the SQL sense. It groups + rows and computes an aggregate for each group. The group is configured using + the group key. `AGGREGATOR` can be configured with one or more aggregation + functions: + - `SUM` + - `COUNT` + - `COUNT DISTINCT` + `AGGREGATOR`'s output schema consists of the group key, plus a configurable + subset of the the generated aggregated values. The optional output filter has + access to the group key and all the aggregagated values (i.e. it can use even + values that are not ultimately outputted). +- `SORT` sorts the input according to a configurable set of columns. Note that + this is a no-grouping aggregator, hence it can be distributed arbitrarily to + the data producers. This means, of course, that it doesn't produce a global + ordering, instead it just guarantees an intra-stream ordering on each + physical output streams). The global ordering, when needed, is achieved by an + input synchronizer of a grouped processor (such as `LIMIT` or `FINAL`). +- `LIMIT` is a single-group aggregator that stops after reading so many input + rows. +- `INTENT-COLLECTOR` is a single-group aggregator, scheduled on the gateway, + that receives all the intents generated by a `MUTATE` and keeps track of them + in memory until the transaction is committed. +- `FINAL` is a single-group aggregator, scheduled on the gateway, that collects + the results of the query. This aggregator will be hooked up to the pgwire + connection to the client. + ## From logical to physical To distribute the computation that was described in terms of aggregators and From 1468c415982428cc6b62180999c8aebd268951df Mon Sep 17 00:00:00 2001 From: Andrei Matei Date: Wed, 13 Apr 2016 17:32:07 -0400 Subject: [PATCH 2/2] add DISTINCT as an aggregation function --- docs/RFCS/distributed_sql.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/RFCS/distributed_sql.md b/docs/RFCS/distributed_sql.md index d4a272b18e71..dbfd8ca99734 100644 --- a/docs/RFCS/distributed_sql.md +++ b/docs/RFCS/distributed_sql.md @@ -458,6 +458,7 @@ Composition: src -> countdistinctmin -> final - `SUM` - `COUNT` - `COUNT DISTINCT` + - `DISTINCT` `AGGREGATOR`'s output schema consists of the group key, plus a configurable subset of the the generated aggregated values. The optional output filter has access to the group key and all the aggregagated values (i.e. it can use even