Support arbitrary aggregation functions during ANALYZE (v2) #14233

findepi · 2022-09-21T12:16:20Z

(an alternative to #14220, maintaining backwards compatibility)

A connector may ask engine to collect anything defined by ColumnStatisticType SPI enum. This is convenient, but sometimes a connector needs to provide its own way of calculating statistics.

For example, Iceberg statistics include

apache-datasketches-theta-v1 blob type

A serialized form of a "compact" Theta sketch produced by the Apache
DataSketches library. The sketch is obtained by
constructing Alpha family sketch with default seed, and feeding it with individual
distinct values converted to bytes using Iceberg's single-value serialization.

This has two components which are not supported today

a new data sketch with a specific configuration (so that results can be shared with different query engines)
a well-defined input pre-processing, which relies on existing Iceberg concepts, which are alien to Trino engine

This PR addresses the first limitation. It allows the connector to pick an aggregation function of its choice for statistics collection.

`ColumnStatisticMetadata` is used in `StatisticAggregationsDescriptor` as a map key. Before the change, a hand-written serialization was used for that. After the change, the map is replaced with a list of key/value pairs for the purpose of the serialization.

The aggregation function result type is known, it doesn't need to be given.

losipiuk

This looks nice. Should I look at the other one?

findepi · 2022-09-21T15:16:52Z

I think you don't need to, currently.

findepi · 2022-09-21T15:18:32Z

CI #14239

alexjo2144 · 2022-09-21T19:01:32Z

Can all the existing StatsiticsTypes be defined as aggregations? I'm just thinking if the statisticsType field of ColumnStatisticMetadata can be broadened to an aggregation instead of needing to support both separately.

findepi · 2022-09-21T19:42:56Z

@alexjo2144 this is exactly what i did initially, i.e. in #14220

It's nice, because it allows to decomission the old statistics definition mechanism quickly.
it also makes sure the new mechanism is very well exercised

however

it is non-backwards compatible change
it inflates PR scope (Support arbitrary aggregation functions during ANALYZE (v1) #14220 modifies 42 files while this modifies only 10)
it requires "publishing" internal aggregations like $internal$max_data_size_for_stats and $internal$sum_data_size_for_stats
- and these should eventually be removed, but this requires
  - being able to compose aggregation like sum + datasize, requires eg Support ANALYZE stats composed of expressions #14222
  - implementing something like estimated_data_size scalar function, equivalent of io.trino.spi.block.Block#getEstimatedDataSizeForStats
    - i started doing this but this is a piece of work; a scalar cannot just call Block#getEstimatedDataSizeForStats method

Thus

i think we should eventually remove `ColumnStatisticType
i think we should eventually remove/decompose $internal$max_data_size_for_stats and $internal$sum_data_size_for_stats as described above
I think these two things should be follow-ups, not to block the main objective -- being able to calculate Theta sketches for Iceberg NDV stats

findinpath · 2022-09-22T12:49:28Z

core/trino-main/src/main/java/io/trino/sql/planner/planprinter/PlanPrinter.java

+                else {
+                    FunctionName aggregation = columnStatistic.getKey().getAggregation();
+                    if (aggregation.getCatalogSchema().isPresent()) {
+                        aggregationName = aggregation.getCatalogSchema() + "." + aggregation.getName();


This seems to be a bug:

Optional<String> s1 = Optional.of("hello "); String s2 = "world!"; System.out.println(s1 + s2);

outputs
Optional[hello ]world!

good catch, thanks

findinpath · 2022-09-22T12:52:24Z

core/trino-main/src/main/java/io/trino/sql/planner/planprinter/PlanPrinter.java

@@ -1458,9 +1459,22 @@ private void printStatisticAggregationsInfo(
            }

            for (Map.Entry<ColumnStatisticMetadata, Symbol> columnStatistic : columnStatistics.entrySet()) {
+                String aggregationName;


Is this functionality covered by any tests?

No, there are no fine-grained tests for EXPLAIN.

findinpath · 2022-09-22T13:01:03Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

@@ -256,6 +256,8 @@
    public static final String ORC_BLOOM_FILTER_COLUMNS_KEY = "orc.bloom.filter.columns";
    public static final String ORC_BLOOM_FILTER_FPP_KEY = "orc.bloom.filter.fpp";

+    private static final FunctionName NUMBER_OF_DISTINCT_VALUES = new FunctionName("approx_distinct");


nit: change the name of the constant so that it doesn't collide with ColumnStatisticType.NUMBER_OF_DISTINCT_VALUES

it's intentionally

The `ColumnStatisticType` enum was defining what is possible to collect during statistics collection. While looking generic, the chosen options matched exactly what stats Hive metastore collects. Different metadata storages may require different statistics to be collected, for example data sketches with some specific configuration. This change allows a connector to pick any existing aggregation function.

findepi added the enhancement New feature or request label Sep 21, 2022

findepi requested review from martint, losipiuk, ebyhr, alexjo2144 and findinpath September 21, 2022 12:16

cla-bot bot added the cla-signed label Sep 21, 2022

findepi mentioned this pull request Sep 21, 2022

Support arbitrary aggregation functions during ANALYZE (v1) #14220

Closed

findepi changed the title ~~Support arbitrary aggregation functions during ANALYZE (2)~~ Support arbitrary aggregation functions during ANALYZE (v2) Sep 21, 2022

Simplify error message construction

14fd5ef

findepi force-pushed the findepi/arbitrary-stats-compatibly branch 2 times, most recently from b8908f1 to 40b246d Compare September 21, 2022 12:25

github-actions bot added the tests:hive label Sep 21, 2022

findepi mentioned this pull request Sep 21, 2022

Remove supported statistics from metastore recording #14234

Merged

findepi added 2 commits September 21, 2022 14:30

Remove redundant method parameter

52f6e01

The aggregation function result type is known, it doesn't need to be given.

findepi force-pushed the findepi/arbitrary-stats-compatibly branch from 40b246d to 63a179f Compare September 21, 2022 12:30

losipiuk reviewed Sep 21, 2022

View reviewed changes

findepi requested a review from losipiuk September 21, 2022 19:44

losipiuk approved these changes Sep 22, 2022

View reviewed changes

findinpath reviewed Sep 22, 2022

View reviewed changes

findepi force-pushed the findepi/arbitrary-stats-compatibly branch from bf4d570 to 30b39e1 Compare September 22, 2022 14:30

empty

f53f92f

findepi merged commit dfd3e9a into trinodb:master Sep 23, 2022

findepi deleted the findepi/arbitrary-stats-compatibly branch September 23, 2022 15:28

findepi mentioned this pull request Sep 23, 2022

Release notes for 398 #14245

Closed

github-actions bot added this to the 398 milestone Sep 23, 2022

findepi mentioned this pull request Sep 26, 2022

Migrate Hive stats definition to use standard aggregations #14294

Merged

colebow mentioned this pull request Sep 28, 2022

Add Trino 398 release notes #14319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arbitrary aggregation functions during ANALYZE (v2) #14233

Support arbitrary aggregation functions during ANALYZE (v2) #14233

findepi commented Sep 21, 2022

losipiuk left a comment

findepi commented Sep 21, 2022

findepi commented Sep 21, 2022

alexjo2144 commented Sep 21, 2022

findepi commented Sep 21, 2022

findinpath Sep 22, 2022

findepi Sep 22, 2022

findinpath Sep 22, 2022

findepi Sep 22, 2022

findinpath Sep 22, 2022

findepi Sep 22, 2022

Support arbitrary aggregation functions during ANALYZE (v2) #14233

Support arbitrary aggregation functions during ANALYZE (v2) #14233

Conversation

findepi commented Sep 21, 2022

apache-datasketches-theta-v1 blob type

losipiuk left a comment

Choose a reason for hiding this comment

findepi commented Sep 21, 2022

findepi commented Sep 21, 2022

alexjo2144 commented Sep 21, 2022

findepi commented Sep 21, 2022

findinpath Sep 22, 2022

Choose a reason for hiding this comment

findepi Sep 22, 2022

Choose a reason for hiding this comment

findinpath Sep 22, 2022

Choose a reason for hiding this comment

findepi Sep 22, 2022

Choose a reason for hiding this comment

findinpath Sep 22, 2022

Choose a reason for hiding this comment

findepi Sep 22, 2022

Choose a reason for hiding this comment

`apache-datasketches-theta-v1` blob type