Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze Iceberg tables #13636

Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Aug 12, 2022

Support ANALYZE in Iceberg connector. This collects number distinct
values (NDV) of selected columns and stores that in table properties.
This is interim solution until Iceberg library has first-class
statistics files support.

@findepi findepi added enhancement New feature or request performance labels Aug 12, 2022
@cla-bot cla-bot bot added the cla-signed label Aug 12, 2022
@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch 3 times, most recently from 427d7f5 to 92d8312 Compare August 12, 2022 09:41
@findepi findepi mentioned this pull request Aug 12, 2022
@findepi findepi self-assigned this Aug 12, 2022
@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from 92d8312 to e1d1c7d Compare August 12, 2022 16:22
public ConnectorAnalyzeMetadata getStatisticsCollectionMetadata(ConnectorSession session, ConnectorTableHandle tableHandle, Map<String, Object> analyzeProperties)
{
IcebergTableHandle handle = (IcebergTableHandle) tableHandle;
checkArgument(handle.getTableType() == DATA, "Cannot analyze non-DATA table: %s", handle.getTableType());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should getTableHandleForExecute have a similar check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we verify that the table's snapshot is the recent one? Not time traveling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added both


Set<ColumnStatisticMetadata> columnStatistics = tableMetadata.getColumns().stream()
.filter(column -> !column.isHidden())
.filter(column -> analyzeColumnNames.contains(column.getName()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to lowercase analyzeColumnNames ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user needs to provide them in lowercase. This makes it easier to support non-lowercase column names in the future.

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from e1d1c7d to b47566d Compare August 24, 2022 09:48
@findepi
Copy link
Member Author

findepi commented Aug 24, 2022

@alexjo2144 thanks for your review.
AC. Had to rebase due to a conflict.

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from b47566d to b3efc19 Compare August 24, 2022 09:52
@findepi findepi requested review from losipiuk and alexjo2144 and removed request for alexjo2144 August 24, 2022 09:52
@findepi
Copy link
Member Author

findepi commented Aug 24, 2022

Prefix extracted: #13823

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch 2 times, most recently from b7bb888 to 60f88fb Compare August 24, 2022 18:59
@findepi
Copy link
Member Author

findepi commented Aug 25, 2022

Prefix extracted: #13823

rebasing after that one merged. Was green before (https://github.com/trinodb/trino/runs/8002212012?check_suite_focus=true)

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from 60f88fb to 5e9f939 Compare August 25, 2022 07:46
if (shouldDenyPrivilege(context.getIdentity().getUser(), table + "." + procedure, EXECUTE_TABLE_PROCEDURE)) {
denyExecuteTableProcedure(table.toString(), procedure);
}
if (denyPrivileges.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this if?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am dumbly following the pattern in this class.

@@ -917,6 +933,18 @@ private Optional<ConnectorTableExecuteHandle> getTableHandleForOptimize(Connecto
icebergTable.location()));
}

private Optional<ConnectorTableExecuteHandle> getTableHandleForDropExtendedStats(ConnectorSession session, IcebergTableHandle tableHandle, Map<String, Object> executeProperties)
{
checkProcedureArgument(executeProperties.isEmpty(), "Unexpected properties");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: quote properties in error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. but that's just some check and the proper checks are done in the engine.
in fact, see getTableHandleForOptimize. it gets a propery value from the Map, but doesn't check that the map doesn't have extra entries.

so, to keep code simpler, will just drop the check. this will prevent someone in the future to be forced to do solid validation here

.orElse(allColumnNames);

Set<ColumnStatisticMetadata> columnStatistics = tableMetadata.getColumns().stream()
.filter(column -> !column.isHidden())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like filtering out hidden is not needed here as you enforce that analyzeColumnNames not contains hidden columns above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed and renamed allColumnNames to allDataColumnNames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed and renamed allColumnNames to allDataColumnNames.

@@ -2847,6 +2847,47 @@ public void testBasicTableStatistics()
dropTable(tableName);
}

@Test
public void testBasicAnalyze()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can also test if stats go away if you explicitly drop them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by io.trino.plugin.iceberg.TestIcebergAnalyze#testDropExtendedStats

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch 2 times, most recently from 60598bb to d5870f7 Compare August 25, 2022 11:39
Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn’t add this temporary feature. We’ll have to support it forever on the read side since users will rely on it, as the stats will be stored permanently in their tables and dropping it will break their queries.

@electrum electrum dismissed their stale review August 25, 2022 16:09

What is the migration plan for the new implementation?

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from d5870f7 to 5c5c16c Compare August 26, 2022 10:23
@findepi
Copy link
Member Author

findepi commented Aug 26, 2022

What is the migration plan for the new implementation?

discussing over slack

@findepi
Copy link
Member Author

findepi commented Aug 30, 2022

(just rebased)

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from 5c5c16c to 474b632 Compare August 30, 2022 13:52
@findepi
Copy link
Member Author

findepi commented Aug 30, 2022

CI #12385 & sth other

String tableName = "test_analyze";
assertUpdate("CREATE TABLE " + tableName + " AS SELECT * FROM tpch.tiny.region", 5);

// no NDV information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The indentation seems wrong. Please ignore if it's intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from d9af33c to f4b0fd4 Compare August 31, 2022 10:40
@findepi
Copy link
Member Author

findepi commented Sep 6, 2022

We shouldn’t add this temporary feature. We’ll have to support it forever on the read side since users will rely on it, as the stats will be stored permanently in their tables and dropping it will break their queries.

Discussed offline and there were no objections overall.

I will still add a toggle to make this an opt-in, experimental feature so that we can accelerate deprecation path.

@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from f4b0fd4 to cb74fab Compare September 6, 2022 09:14
@findepi
Copy link
Member Author

findepi commented Sep 6, 2022

(just rebased)

Per code style rule
@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch 2 times, most recently from 3655c82 to 95ccfb6 Compare September 6, 2022 10:51
Support `ANALYZE` in Iceberg connector. This collects number distinct
values (NDV) of selected columns and stores that in table properties.
This is interim solution until Iceberg library has first-class
statistics files support.
@findepi findepi force-pushed the findepi/iceberg-analyze-ndv-sans-stats-file branch from 95ccfb6 to 0901525 Compare September 6, 2022 14:03
@findepi findepi merged commit da150ae into trinodb:master Sep 7, 2022
@findepi findepi deleted the findepi/iceberg-analyze-ndv-sans-stats-file branch September 7, 2022 11:36
@findepi findepi added the no-release-notes This pull request does not require release notes entry label Sep 7, 2022
@findepi
Copy link
Member Author

findepi commented Sep 7, 2022

No release notes because this is experimental.

@github-actions github-actions bot added this to the 395 milestone Sep 7, 2022
@zinking
Copy link

zinking commented Oct 27, 2022

@findepi where is the ndv generated ? from this PR it seems it's already in the table properties. how does it work currently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed enhancement New feature or request no-release-notes This pull request does not require release notes entry performance
Development

Successfully merging this pull request may close these issues.

6 participants