Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary aggregation functions during ANALYZE (v2) #14233

Merged
merged 5 commits into from
Sep 23, 2022

Conversation

findepi
Copy link
Member

@findepi findepi commented Sep 21, 2022

(an alternative to #14220, maintaining backwards compatibility)

A connector may ask engine to collect anything defined by ColumnStatisticType SPI enum. This is convenient, but sometimes a connector needs to provide its own way of calculating statistics.

For example, Iceberg statistics include

apache-datasketches-theta-v1 blob type

A serialized form of a "compact" Theta sketch produced by the Apache
DataSketches
library. The sketch is obtained by
constructing Alpha family sketch with default seed, and feeding it with individual
distinct values converted to bytes using Iceberg's single-value serialization.

This has two components which are not supported today

  • a new data sketch with a specific configuration (so that results can be shared with different query engines)
  • a well-defined input pre-processing, which relies on existing Iceberg concepts, which are alien to Trino engine

This PR addresses the first limitation. It allows the connector to pick an aggregation function of its choice for statistics collection.

@findepi findepi added the enhancement New feature or request label Sep 21, 2022
@cla-bot cla-bot bot added the cla-signed label Sep 21, 2022
@findepi findepi changed the title Support arbitrary aggregation functions during ANALYZE (2) Support arbitrary aggregation functions during ANALYZE (v2) Sep 21, 2022
@findepi findepi force-pushed the findepi/arbitrary-stats-compatibly branch 2 times, most recently from b8908f1 to 40b246d Compare September 21, 2022 12:25
`ColumnStatisticMetadata` is used in `StatisticAggregationsDescriptor`
as a map key. Before the change, a hand-written serialization was used
for that. After the change, the map is replaced with a list of key/value
pairs for the purpose of the serialization.
The aggregation function result type is known, it doesn't need to be
given.
@findepi findepi force-pushed the findepi/arbitrary-stats-compatibly branch from 40b246d to 63a179f Compare September 21, 2022 12:30
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice. Should I look at the other one?

@findepi
Copy link
Member Author

findepi commented Sep 21, 2022

I think you don't need to, currently.

@findepi
Copy link
Member Author

findepi commented Sep 21, 2022

CI #14239

@alexjo2144
Copy link
Member

Can all the existing StatsiticsTypes be defined as aggregations? I'm just thinking if the statisticsType field of ColumnStatisticMetadata can be broadened to an aggregation instead of needing to support both separately.

@findepi
Copy link
Member Author

findepi commented Sep 21, 2022

@alexjo2144 this is exactly what i did initially, i.e. in #14220

  • It's nice, because it allows to decomission the old statistics definition mechanism quickly.
  • it also makes sure the new mechanism is very well exercised

however

  • it is non-backwards compatible change
  • it inflates PR scope (Support arbitrary aggregation functions during ANALYZE (v1) #14220 modifies 42 files while this modifies only 10)
  • it requires "publishing" internal aggregations like $internal$max_data_size_for_stats and $internal$sum_data_size_for_stats
    • and these should eventually be removed, but this requires
      • being able to compose aggregation like sum + datasize, requires eg Support ANALYZE stats composed of expressions #14222
      • implementing something like estimated_data_size scalar function, equivalent of io.trino.spi.block.Block#getEstimatedDataSizeForStats
        • i started doing this but this is a piece of work; a scalar cannot just call Block#getEstimatedDataSizeForStats method

Thus

  • i think we should eventually remove `ColumnStatisticType
  • i think we should eventually remove/decompose $internal$max_data_size_for_stats and $internal$sum_data_size_for_stats as described above
  • I think these two things should be follow-ups, not to block the main objective -- being able to calculate Theta sketches for Iceberg NDV stats

@findepi findepi requested a review from losipiuk September 21, 2022 19:44
else {
FunctionName aggregation = columnStatistic.getKey().getAggregation();
if (aggregation.getCatalogSchema().isPresent()) {
aggregationName = aggregation.getCatalogSchema() + "." + aggregation.getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a bug:

        Optional<String> s1 = Optional.of("hello ");
        String s2 = "world!";
        System.out.println(s1 + s2);

outputs
Optional[hello ]world!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks

@@ -1458,9 +1459,22 @@ private void printStatisticAggregationsInfo(
}

for (Map.Entry<ColumnStatisticMetadata, Symbol> columnStatistic : columnStatistics.entrySet()) {
String aggregationName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this functionality covered by any tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are no fine-grained tests for EXPLAIN.

@@ -256,6 +256,8 @@
public static final String ORC_BLOOM_FILTER_COLUMNS_KEY = "orc.bloom.filter.columns";
public static final String ORC_BLOOM_FILTER_FPP_KEY = "orc.bloom.filter.fpp";

private static final FunctionName NUMBER_OF_DISTINCT_VALUES = new FunctionName("approx_distinct");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change the name of the constant so that it doesn't collide with ColumnStatisticType.NUMBER_OF_DISTINCT_VALUES

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's intentionally

The `ColumnStatisticType` enum was defining what is possible to collect
during statistics collection. While looking generic, the chosen options
matched exactly what stats Hive metastore collects. Different metadata
storages may require different statistics to be collected, for example
data sketches with some specific configuration.

This change allows a connector to pick any existing aggregation
function.
@findepi findepi force-pushed the findepi/arbitrary-stats-compatibly branch from bf4d570 to 30b39e1 Compare September 22, 2022 14:30
@findepi findepi merged commit dfd3e9a into trinodb:master Sep 23, 2022
@findepi findepi deleted the findepi/arbitrary-stats-compatibly branch September 23, 2022 15:28
@findepi findepi mentioned this pull request Sep 23, 2022
@github-actions github-actions bot added this to the 398 milestone Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

4 participants