Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Iceberg table statistics on inserts #15441

Merged
merged 9 commits into from
Mar 20, 2023

Conversation

findepi
Copy link
Member

@findepi findepi commented Dec 16, 2022

No description provided.

transaction = null;

// TODO (https://github.com/trinodb/trino/issues/15439): it would be good to publish data and stats atomically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move stats collection into Iceberg, we could do this automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we move stats collection to iceberg?

@findepi findepi changed the title Update Iceberg table statistics on writes Update Iceberg table statistics on inserts Dec 21, 2022
@findepi findepi force-pushed the findepi/iceberg-auto-update-ndv branch from 5121372 to 8ea0548 Compare March 13, 2023 16:25
@findepi findepi marked this pull request as ready for review March 13, 2023 16:26

@Config(COLLECT_EXTENDED_STATISTICS_ON_WRITE_CONFIG)
@ConfigDescription(COLLECT_EXTENDED_STATISTICS_ON_WRITE_DESCRIPTION)
public IcebergConfig setCollectExtendedStatisticsOnWrite(boolean collectExtendedStatisticsOnWrite)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add information to the documentation ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to document iceberg.extended-statistics.enabled before we document iceberg.extended-statistics.collect-on-write. Let's follow-up

Note that Delta's delta.extended-statistics.collect-on-write isn't documented either, so you may want to document it.

.collect(toImmutableList());
for (Pair<BlobMetadata, ByteBuffer> read : reader.readAll(toRead)) {
Integer fieldId = getOnlyElement(read.first().inputFields());
checkState(pendingPreviousNdvSketches.remove(fieldId), "Unwanted read of stats for field %s", fieldId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a safe assumption, unless I'm missing something.

You're walking the snapshot history back until you find all the columns you're looking for so you may find duplicates for the columns you did find already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't toRead pre-filtered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is, so will ignore this for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I missed that filter line

findepi added 8 commits March 16, 2023 22:40
`ByteBuffer.array()` is a convenient and efficient way to get bytes from
a `ByteBuffer`. It has, however, numerous preconditions that should be
checked before using the returned array. The commit replaces all
`ByteBuffer.array()` usages where these preconditions are assumed to be
true.
- prefer `assertThat(List).hasSize` to `assertThat(list.size())` as the
  former includes list's elements upon failure,
- drop table after test.
…tion

Before the change, the test used `getAllMetadataFilesFromTableDirectory`
utility, but it was listing all (metadata and data) files. Similar
method, `getAllMetadataFilesFromTableDirectoryForTable`, listing only
metadata files existed, but was not used in the test.

The change merges the utility methods: the correct logic comes from
`getAllMetadataFilesFromTableDirectoryForTable` (so no behavior change
for tests other than `testCleaningUpWithTableWithSpecifiedLocation`),
and the name comes from `getAllMetadataFilesFromTableDirectory`.
Iceberg connector will collect stats during writes, and it would be good
to test this together with ANALYZE. Rename class to contain all related
functionalities.
All Iceberg connector test classes match `TestIceberg*ConnectorTest`
pattern.
@findepi findepi force-pushed the findepi/iceberg-auto-update-ndv branch from f012672 to a2305fd Compare March 16, 2023 21:41
@findepi
Copy link
Member Author

findepi commented Mar 16, 2023

rebased to resolve conflicts & rerun flaky(?) cassandra test

@github-actions
Copy link

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4440329605

@findepi findepi force-pushed the findepi/iceberg-auto-update-ndv branch from a2305fd to 6d192be Compare March 20, 2023 12:00
@findepi
Copy link
Member Author

findepi commented Mar 20, 2023

/test-with-secrets sha=6d192be5d21b16f717b144ae8edc78ac08c113c2

@github-actions
Copy link

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4467931974

@findepi
Copy link
Member Author

findepi commented Mar 20, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4467931974

test (plugin/trino-bigquery) -> #16636

test (plugin/trino-bigquery, cloud-tests-arrow) - unrelated?

Error:  Failures: 
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:  io.trino.plugin.bigquery.TestBigQueryArrowConnectorTest.testInsertRowConcurrently
Error:    Run 1: TestBigQueryArrowConnectorTest>BaseConnectorTest.testInsertRowConcurrently:4357->BaseConnectorTest.lambda$testInsertRowConcurrently$55:4354->BaseConnectorTest.lambda$testInsertRowConcurrently$54:4354 » Runtime

cc @hashhar

test (plugin/trino-iceberg, cloud-tests) - related

test (plugin/trino-pinot) unrelated (#15429)
@elonazoulay can you see https://github.com/trinodb/trino/actions/runs/4467931974/jobs/7848098340?

@findepi findepi force-pushed the findepi/iceberg-auto-update-ndv branch from 6d192be to fb405ea Compare March 20, 2023 14:06
@findepi
Copy link
Member Author

findepi commented Mar 20, 2023

/test-with-secrets sha=fb405ea87bdffff9e66a8d7717cd86b5a99bc74d

@github-actions
Copy link

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4469322142

@findepi
Copy link
Member Author

findepi commented Mar 20, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4469322142

bigquery faulures, unreatled

ci / test (plugin/trino-iceberg, cloud-tests) failure at io.trino.plugin.iceberg.TestIcebergGcsConnectorSmokeTest.testDeleteRowsConcurrently also happening on master (eg https://github.com/trinodb/trino/actions/runs/4466763583/jobs/7845658911)
(#13995)

@findepi findepi merged commit bf04a72 into trinodb:master Mar 20, 2023
@findepi findepi deleted the findepi/iceberg-auto-update-ndv branch March 20, 2023 21:25
@findepi findepi mentioned this pull request Mar 20, 2023
@github-actions github-actions bot added this to the 411 milestone Mar 20, 2023
"('mommy', 4)," +
"('moscow', 5)," +
"('Kielce', 4)," +
"('Kiev', 5)," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer :) But just wanted to point out that the preferred romanization is "Kyiv". Sadly, there's no timezone "Europe/Kyiv".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs enhancement New feature or request iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

8 participants