-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arbitrary aggregation functions during ANALYZE (v2) #14233
Support arbitrary aggregation functions during ANALYZE (v2) #14233
Conversation
b8908f1
to
40b246d
Compare
`ColumnStatisticMetadata` is used in `StatisticAggregationsDescriptor` as a map key. Before the change, a hand-written serialization was used for that. After the change, the map is replaced with a list of key/value pairs for the purpose of the serialization.
The aggregation function result type is known, it doesn't need to be given.
40b246d
to
63a179f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice. Should I look at the other one?
I think you don't need to, currently. |
CI #14239 |
Can all the existing |
@alexjo2144 this is exactly what i did initially, i.e. in #14220
however
Thus
|
else { | ||
FunctionName aggregation = columnStatistic.getKey().getAggregation(); | ||
if (aggregation.getCatalogSchema().isPresent()) { | ||
aggregationName = aggregation.getCatalogSchema() + "." + aggregation.getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a bug:
Optional<String> s1 = Optional.of("hello ");
String s2 = "world!";
System.out.println(s1 + s2);
outputs
Optional[hello ]world!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thanks
@@ -1458,9 +1459,22 @@ private void printStatisticAggregationsInfo( | |||
} | |||
|
|||
for (Map.Entry<ColumnStatisticMetadata, Symbol> columnStatistic : columnStatistics.entrySet()) { | |||
String aggregationName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this functionality covered by any tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there are no fine-grained tests for EXPLAIN.
@@ -256,6 +256,8 @@ | |||
public static final String ORC_BLOOM_FILTER_COLUMNS_KEY = "orc.bloom.filter.columns"; | |||
public static final String ORC_BLOOM_FILTER_FPP_KEY = "orc.bloom.filter.fpp"; | |||
|
|||
private static final FunctionName NUMBER_OF_DISTINCT_VALUES = new FunctionName("approx_distinct"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change the name of the constant so that it doesn't collide with ColumnStatisticType.NUMBER_OF_DISTINCT_VALUES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's intentionally
The `ColumnStatisticType` enum was defining what is possible to collect during statistics collection. While looking generic, the chosen options matched exactly what stats Hive metastore collects. Different metadata storages may require different statistics to be collected, for example data sketches with some specific configuration. This change allows a connector to pick any existing aggregation function.
bf4d570
to
30b39e1
Compare
(an alternative to #14220, maintaining backwards compatibility)
A connector may ask engine to collect anything defined by
ColumnStatisticType
SPI enum. This is convenient, but sometimes a connector needs to provide its own way of calculating statistics.For example, Iceberg statistics include
This has two components which are not supported today
This PR addresses the first limitation. It allows the connector to pick an aggregation function of its choice for statistics collection.