Skip to content

Latest commit

 

History

History
796 lines (625 loc) · 23.5 KB

AGGREGATIONS.md

File metadata and controls

796 lines (625 loc) · 23.5 KB

Aggregate Functions

ZomboDB exposes nearly all of Elasticsearch's aggregates as type-checked SQL functions that return tables and discreet values, as opposed to json blobs.

In all cases, unless explicitly otherwise noted, the results returned from all of the below aggregate functions are MVCC-correct. This means that the functions only operate against records that are considered visible to the current transaction.

Arbitrary Aggregate Support

FUNCTION zdb.arbitrary_agg(
	index regclass,
	query zdbquery,
	agg_json json) 
RETURNS json

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

This function is a direct-path for executing any arbitrary aggregate search request that Elasticsearch supports.

The result is a json blob that can be processed in your application code or otherwise manipulated using Postgres json support functions.

Single-Value Aggregates

FUNCTION zdb.avg(
	index regclass,
	field text,
	query zdbquery) 
RETURNS numeric

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-avg-aggregation.html

A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.cardinality(
	index regclass,
	field text,
	query zdbquery) 
RETURNS numeric

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document.


FUNCTION zdb.count(
	index regclass,
	query zdbquery) 
RETURNS bigint

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html

Not an aggregate per se, this function is mapped to Elasticsearch's _count endpoint and simply returns the number of documents that match the provided query. The result is MVCC-correct.


FUNCTION zdb.raw_count(
	index regclass,
	query zdbquery) 
RETURNS bigint SET zdb.ignore_visibility = true

Similar to zdb.count() above, but it ignores MVCC visibility rules, and the result is the actual count of documents matching the query, including deleted documents, documents from aborted transactions, old versions of documents from an UPDATE statement, and new/updated docs from in-flight transactions.

Generally you'll want to use zdb.count() instead.


FUNCTION zdb.max(
	index regclass,
	field text,
	query zdbquery) 
RETURNS numeric

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-max-aggregation.html

A single-value metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.min(
	index regclass,
	field text,
	query zdbquery) 
RETURNS numeric

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-min-aggregation.html

A single-value metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.missing(
	index regclass,
	field text,
	query zdbquery) 
RETURNS numeric

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-missing-aggregation.html

A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set).


FUNCTION zdb.sum(
	index regclass,
	field text,
	query zdbquery) 
RETURNS numeric

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-sum-aggregation.html

A single-value metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.value_count(
	index regclass,
	field text,
	query zdbquery) 
RETURNS numeric

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-valuecount-aggregation.html

A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents. These values can be extracted either from specific fields in the documents.


Multi-Row/Column Aggregates

The following aggregates transform the results from Elasticsearch into "tables", and should all be queried as such. For example:

SELECT * FROM zdb.terms('idxproducts', 'tags', dsl.match_all());

FUNCTION zdb.adjacency_matrix(
	index regclass,
	labels text[],
	filters zdbquery[]) 
RETURNS TABLE (
	key text,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-adjacency-matrix-aggregation.html

A bucket aggregation returning a form of adjacency matrix. The request provides a collection of named filter expressions, similar to the filters aggregation request. Each bucket in the response represents a non-empty cell in the matrix of intersecting filters.

The labels and filters arguments must have the same bounds.


FUNCTION zdb.adjacency_matrix_2x2(
	index regclass,
	labels text[],
	filters zdbquery[]) 
RETURNS TABLE (
	"-" text,
	"1" text,
	"2" text)

This is a table-based version of zdb.adjacency_matrix() that outputs a 2x2 matrix.

The labels and filters arguments must have the same bounds.


FUNCTION zdb.adjacency_matrix_3x3(
	index regclass,
	labels text[],
	filters zdbquery[]) 
RETURNS TABLE (
	"-" text,
	"1" text,
	"2" text,
	"3" text)

This is a table-based version of zdb.adjacency_matrix() that outputs a 3x3 matrix.

The labels and filters arguments must have the same bounds.


FUNCTION zdb.adjacency_matrix_4x4(
	index regclass,
	labels text[],
	filters zdbquery[]) 
RETURNS TABLE (
	"-" text,
	"1" text,
	"2" text,
	"3" text,
	"4" text)

This is a table-based version of zdb.adjacency_matrix() that outputs a 4x4 matrix.

The labels and filters arguments must have the same bounds.


FUNCTION zdb.adjacency_matrix_5x5(
	index regclass,
	labels text[],
	filters zdbquery[]) 
RETURNS TABLE (
	"-" text,
	"1" text,
	"2" text,
	"3" text,
	"4" text,
	"5" text)

This is a table-based version of zdb.adjacency_matrix() that outputs a 5x5 matrix.

The labels and filters arguments must have the same bounds.


FUNCTION zdb.date_histogram(
	index regclass,
	field text,
	query zdbquery,
	"interval" text,
	format text DEFAULT 'yyyy-MM-dd') 
RETURNS TABLE (
	key numeric,
	key_as_string text,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

A multi-bucket aggregation similar to the histogram except it can only be applied on date values. Since dates are represented in Elasticsearch internally as long values, it is possible to use the normal histogram on dates as well, though accuracy will be compromised. The reason for this is in the fact that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, we need special support for time based data. From a functionality perspective, this histogram supports the same features as the normal histogram. The main difference is that the interval can be specified by date/time expressions.


FUNCTION zdb.date_range(
	index regclass,
	field text,
	query zdbquery,
	date_ranges_array json) 
RETURNS TABLE (
	key text,
	"from" numeric,
	from_as_string timestamp with time zone,
	"to" numeric,
	to_as_string timestamp with time zone,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-daterange-aggregation.html

A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned. Note that this aggregation includes the from value and excludes the to value for each range.


FUNCTION zdb.extended_stats(
	index regclass,
	field text,
	query zdbquery,
	sigma int DEFAULT 0) 
RETURNS TABLE (
	count bigint,
	min numeric,
	max numeric,
	avg numeric,
	sum numeric,
	sum_of_squares numeric,
	variance numeric,
	stddev numeric,
	stddev_upper numeric,
	stddev_lower numeric)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-extendedstats-aggregation.html

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.filters(
	index regclass,
	labels text[],
	filters zdbquery[]) 
RETURNS TABLE (
	label text,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filters-aggregation.html

Similar to zdb.count() except multiple queries (filters) are supported.

The labels and filters arguments must have the same bounds.


FUNCTION zdb.histogram(
	index regclass,
	field text,
	query zdbquery,
	"interval" float8) 
RETURNS TABLE (
	key numeric,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html

A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5 (in case of price it may represent $5).


FUNCTION zdb.ip_range(
	index regclass,
	field text,
	query zdbquery,
	ip_ranges_array json) 
RETURNS TABLE (
	key text,
	"from" inet,
	"to" inet,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-iprange-aggregation.html

Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IP typed fields.


FUNCTION zdb.matrix_stats(
	index regclass,
	fields text[],
	query zdbquery) 
RETURNS TABLE (
	name text,
	count bigint,
	mean numeric,
	variance numeric,
	skewness numeric,
	kurtosis numeric,
	covariance json,
	correlation json)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-matrix-stats-aggregation.html

The matrix_stats aggregation is a numeric aggregation that computes various statistics over a set of document fields.


FUNCTION zdb.percentile_ranks(
	index regclass,
	field text,
	query zdbquery,
	"values" text DEFAULT '') 
RETURNS TABLE (
	percentile numeric,
	value numeric)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-rank-aggregation.html

A multi-value metrics aggregation that calculates one or more percentile ranks over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.percentiles(
	index regclass,
	field text,
	query zdbquery,
	percents text DEFAULT '') 
RETURNS TABLE (
	percentile numeric,
	value numeric)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html

A multi-value metrics aggregation that calculates one or more percentiles over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.range(
	index regclass,
	field text,
	query zdbquery,
	ranges_array json) 
RETURNS TABLE (
	key text,
	"from" numeric,
	"to" numeric,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-range-aggregation.html

A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document. Note that this aggregation includes the from value and excludes the to value for each range.


FUNCTION zdb.significant_terms(
    index regclass, 
    field text, 
    query zdbquery, 
    include text DEFAULT '.*'::text, 
    size_limit integer DEFAULT 2147483647, 
    min_doc_count integer DEFAULT 3)
RETURNS TABLE (
	term text,
	doc_count bigint,
	score numeric,
	bg_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

An aggregation that returns interesting or unusual occurrences of terms in a set.


FUNCTION zdb.significant_terms_two_level(
	index regclass,
	first_field text,
	second_field text,
	query zdbquery,
	size bigint DEFAULT 0) 
RETURNS TABLE (
	first_term text,
	second_term text,
	doc_count bigint,
	score numeric,
	bg_count bigint,
	doc_count_error_upper_bound bigint,
	sum_other_doc_count bigint)

An adaption of zdb.significant_terms() where it uses zdb.terms() for the terms from first_field and zdb.significant_terms() for the terms for second_field.


FUNCTION zdb.significant_text(
	index regclass,
	field text,
	query zdbquery,
	sample_size int DEFAULT 0,
	filter_duplicate_text boolean DEFAULT true) 
RETURNS TABLE (
	term text,
	doc_count bigint,
	score numeric,
	bg_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significanttext-aggregation.html

An aggregation that returns interesting or unusual occurrences of free-text terms in a set. It is like the significant terms aggregation but differs in that:

  • It is specifically designed for use on type text fields
  • It does not require field data or doc-values
  • It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

This aggregate is only supported by Elasticsearch 6+ clusters.


FUNCTION zdb.suggest_terms(
    index regclass,
    field_name text,
    suggest test,
    query zdbquery,
) RETURNS TABLE (
        term text,
        offset bigint,
        length bigint,
        suggestion text,
        score double precision,
        frequency bigint,

)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html

While not necessarily an aggregate, zdb.terms_suggester will tokenize an input textual suggestion string and provide suggestions for each token that contains suggestions.

Useful for correcting misspellings -- ie, "Did you mean?"-style queries


FUNCTION zdb.stats(
	index regclass,
	field text,
	query zdbquery) 
RETURNS TABLE (
	count bigint,
	min numeric,
	max numeric,
	avg numeric,
	sum numeric)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-stats-aggregation.html

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.


FUNCTION zdb.tally(
    index regclass, 
    field_name text,
    [ is_nested bool],
    stem text, 
    query ZDBQuery, 
    size_limit integer DEFAULT '2147483647', 
    order_by TermsOrderBy DEFAULT 'count', 
    shard_size integer DEFAULT '2147483647', 
    count_nulls bool DEFAULT 'true'
) RETURNS TABLE (term text, count bigint)

index: The name of the a ZomboDB index to query
field)_name: The name of a field from which to derive terms
is_nested: Optional argument to indicate that the terms should only come from matching nested object sub-elements. Default is false
stem: a Regular expression by which to filter returned terms, or a date interval if the specified fieldname is a date or timestamp
query: a ZomboDB query
size_limit: maximum number of terms to return. A NULL value means "all terms". order_by: how to sort the results. one of 'count' (descending), 'term', 'reverse_count' (ascending), 'reverse_term'
shard_size: optional parameter that tells Elasticsearch how many terms to return from each shard. Default is zero, which means all terms
count_nulls: should a row containing the count of NULL (ie, missing) values be included in the results?

This function provides direct access to Elasticsearch's terms aggregate and cannot be used with fields of type fulltext. The results are MVCC-safe. Returned terms are forced to upper-case.

If a stem is not specified, no results will be returned. To match all terms, use a stem of ^.*

The order_by parameter defaults to count, which sorts documents by the occurrence count, largest to smallest. A value of reverse_count will sort them smallest to largest.

If the specified fieldname is a date/timestamp, then one of the following values are allowed for aggregating values into histogram buckets of the specified interval: year, quarter, month, week, day, hour, minute, second. In all cases, an optional offset value can be specified. For example: week:-1d will offset the dates by one day so that the first day of the week will be considered to be Sunday (instead of the default of Monday).

Example:

SELECT * FROM zdb.tally('products', 'keywords', '^.*', 'base* or distance', 5000, 'term');

    term      | count 
>---------------+-------
BASEBALL      |     1
COMMUNICATION |     1
PRIMITIVE     |     1
SPORTS        |     1
THOMAS EDISON |     1

Regarding the is_nested argument, consider data like this:

row #1: contributor_data=[ 
  { "name": "John Doe", "age": 42, "location": "TX", "tags": ["active"] },
  { "name": "Jane Doe", "age": 36, "location": "TX", "tags": ["nice"] }
>]
>
>row #2: contributor_data=[ 
  { "name": "Bob Dole", "age": 92, "location": "KS", "tags": ["nice", "politician"] },
  { "name": "Elizabth Dole", "age": 79, "location": "KS", "tags": ["nice"] }
>]

And a query where is_nested is false:

SELECT * FROM zdb.tally('idxproducts', 'contributor_data.name', false, '^.*', 'contributor_data.location:TX AND contributor_data.tags:nice', 5000, 'term');

returns:

    term   | count 
----------+-------
  JANE DOE |     1
  JOHN DOE |     1
(2 rows)

Whereas, if is_nested is true, only "JANE DOE" is returned because it's the only subelement of contributor_data that matched the query:

SELECT * FROM zdb.tally('idxproducts', 'contributor_data.name', true, '^.*', 'contributor_data.location:TX WITH contributor_data.tags:nice', 5000, 'term');

returns:

    term   | count 
----------+-------
  JANE DOE |     1
(1 row)
CREATE TYPE TermsOrderBy AS ENUM (
	'count',
	'term',
	'reverse_count',
	'reverse_term');

FUNCTION zdb.terms(
	index regclass,
	field text,
	query zdbquery,
	size_limit bigint DEFAULT 0,
	order_by TermsOrderBy DEFAULT 'count') 
RETURNS TABLE (
	term text,
	doc_count bigint)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.

Note that the order_by argument defines how to sort the results:

  • 'count' (ascending),
  • 'reverse_count' (ascending),
  • 'term' (ascending),
  • 'reverse_term' (descending)

FUNCTION zdb.terms_array(
	index regclass,
	field text,
	query zdbquery,
	size_limit bigint DEFAULT 0,
	order_by TermsOrderBy DEFAULT 'count') 
RETURNS text[]

A version of zdb.terms() that instead returns only the terms as a text[].


FUNCTION zdb.terms_two_level(
	index regclass,
	first_field text,
	second_field text,
	query zdbquery,
	order_by TwoLevelTermsOrderBy DEFAULT 'count',
	size bigint DEFAULT 0) 
RETURNS TABLE (
	first_term text,
	second_term text,
	doc_count bigint)

Similar to zdb.significant_terms_two_level(), this is an adaption of zdb.terms() to provide a two-level nested hierarchy of terms from two different fields.


FUNCTION zdb.top_hits(
	index regclass,
	fields text[],
	query zdbquery,
	size int) 
RETURNS TABLE (
	ctid tid,
	score float4,
	source json)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

A top_hits metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket.


FUNCTION zdb.top_hits_with_id(
	index regclass,
	fields text[],
	query zdbquery,
	size int) 
RETURNS TABLE (
	_id text,
	score float4,
	source json)

Similar to zdb.top_hits() above, but returns the Elasticsearch document _id value for each hit rather than the corresponding Postgres ctid value.