-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze Iceberg tables #13636
Analyze Iceberg tables #13636
Conversation
427d7f5
to
92d8312
Compare
92d8312
to
e1d1c7d
Compare
public ConnectorAnalyzeMetadata getStatisticsCollectionMetadata(ConnectorSession session, ConnectorTableHandle tableHandle, Map<String, Object> analyzeProperties) | ||
{ | ||
IcebergTableHandle handle = (IcebergTableHandle) tableHandle; | ||
checkArgument(handle.getTableType() == DATA, "Cannot analyze non-DATA table: %s", handle.getTableType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should getTableHandleForExecute
have a similar check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we verify that the table's snapshot is the recent one? Not time traveling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added both
|
||
Set<ColumnStatisticMetadata> columnStatistics = tableMetadata.getColumns().stream() | ||
.filter(column -> !column.isHidden()) | ||
.filter(column -> analyzeColumnNames.contains(column.getName())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to lowercase analyzeColumnNames
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user needs to provide them in lowercase. This makes it easier to support non-lowercase column names in the future.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java
Outdated
Show resolved
Hide resolved
e1d1c7d
to
b47566d
Compare
@alexjo2144 thanks for your review. |
b47566d
to
b3efc19
Compare
Prefix extracted: #13823 |
b7bb888
to
60f88fb
Compare
rebasing after that one merged. Was green before (https://github.com/trinodb/trino/runs/8002212012?check_suite_focus=true) |
60f88fb
to
5e9f939
Compare
if (shouldDenyPrivilege(context.getIdentity().getUser(), table + "." + procedure, EXECUTE_TABLE_PROCEDURE)) { | ||
denyExecuteTableProcedure(table.toString(), procedure); | ||
} | ||
if (denyPrivileges.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am dumbly following the pattern in this class.
@@ -917,6 +933,18 @@ private Optional<ConnectorTableExecuteHandle> getTableHandleForOptimize(Connecto | |||
icebergTable.location())); | |||
} | |||
|
|||
private Optional<ConnectorTableExecuteHandle> getTableHandleForDropExtendedStats(ConnectorSession session, IcebergTableHandle tableHandle, Map<String, Object> executeProperties) | |||
{ | |||
checkProcedureArgument(executeProperties.isEmpty(), "Unexpected properties"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: quote properties in error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do. but that's just some check and the proper checks are done in the engine.
in fact, see getTableHandleForOptimize
. it gets a propery value from the Map, but doesn't check that the map doesn't have extra entries.
so, to keep code simpler, will just drop the check. this will prevent someone in the future to be forced to do solid validation here
.orElse(allColumnNames); | ||
|
||
Set<ColumnStatisticMetadata> columnStatistics = tableMetadata.getColumns().stream() | ||
.filter(column -> !column.isHidden()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like filtering out hidden is not needed here as you enforce that analyzeColumnNames
not contains hidden columns above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed and renamed allColumnNames
to allDataColumnNames
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed and renamed allColumnNames
to allDataColumnNames
.
@@ -2847,6 +2847,47 @@ public void testBasicTableStatistics() | |||
dropTable(tableName); | |||
} | |||
|
|||
@Test | |||
public void testBasicAnalyze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can also test if stats go away if you explicitly drop them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered by io.trino.plugin.iceberg.TestIcebergAnalyze#testDropExtendedStats
60598bb
to
d5870f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn’t add this temporary feature. We’ll have to support it forever on the read side since users will rely on it, as the stats will be stored permanently in their tables and dropping it will break their queries.
What is the migration plan for the new implementation?
d5870f7
to
5c5c16c
Compare
discussing over slack |
(just rebased) |
5c5c16c
to
474b632
Compare
CI #12385 & sth other |
String tableName = "test_analyze"; | ||
assertUpdate("CREATE TABLE " + tableName + " AS SELECT * FROM tpch.tiny.region", 5); | ||
|
||
// no NDV information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The indentation seems wrong. Please ignore if it's intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
d9af33c
to
f4b0fd4
Compare
Discussed offline and there were no objections overall. I will still add a toggle to make this an opt-in, experimental feature so that we can accelerate deprecation path. |
f4b0fd4
to
cb74fab
Compare
(just rebased) |
Per code style rule
3655c82
to
95ccfb6
Compare
Support `ANALYZE` in Iceberg connector. This collects number distinct values (NDV) of selected columns and stores that in table properties. This is interim solution until Iceberg library has first-class statistics files support.
95ccfb6
to
0901525
Compare
No release notes because this is experimental. |
@findepi where is the ndv generated ? from this PR it seems it's already in the table properties. how does it work currently? |
Support
ANALYZE
in Iceberg connector. This collects number distinctvalues (NDV) of selected columns and stores that in table properties.
This is interim solution until Iceberg library has first-class
statistics files support.