Update Iceberg table statistics on inserts #15441

findepi · 2022-12-16T16:47:15Z

No description provided.

rdblue · 2022-12-19T18:48:53Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

        transaction = null;

+        // TODO (https://github.com/trinodb/trino/issues/15439): it would be good to publish data and stats atomically


If we move stats collection into Iceberg, we could do this automatically.

how can we move stats collection to iceberg?

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/encoder/json/JsonRowEncoder.java

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/encoder/raw/RawRowEncoder.java

homar · 2023-03-13T16:56:51Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

+
+    @Config(COLLECT_EXTENDED_STATISTICS_ON_WRITE_CONFIG)
+    @ConfigDescription(COLLECT_EXTENDED_STATISTICS_ON_WRITE_DESCRIPTION)
+    public IcebergConfig setCollectExtendedStatisticsOnWrite(boolean collectExtendedStatisticsOnWrite)


Could you also add information to the documentation ?

We also need to document iceberg.extended-statistics.enabled before we document iceberg.extended-statistics.collect-on-write. Let's follow-up

Note that Delta's delta.extended-statistics.collect-on-write isn't documented either, so you may want to document it.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java

alexjo2144 · 2023-03-16T16:07:15Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java

+                                    .collect(toImmutableList());
+                            for (Pair<BlobMetadata, ByteBuffer> read : reader.readAll(toRead)) {
+                                Integer fieldId = getOnlyElement(read.first().inputFields());
+                                checkState(pendingPreviousNdvSketches.remove(fieldId), "Unwanted read of stats for field %s", fieldId);


This doesn't seem like a safe assumption, unless I'm missing something.

You're walking the snapshot history back until you find all the columns you're looking for so you may find duplicates for the columns you did find already.

isn't toRead pre-filtered?

i think it is, so will ignore this for now

You are right, I missed that filter line

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java

`ByteBuffer.array()` is a convenient and efficient way to get bytes from a `ByteBuffer`. It has, however, numerous preconditions that should be checked before using the returned array. The commit replaces all `ByteBuffer.array()` usages where these preconditions are assumed to be true.

- prefer `assertThat(List).hasSize` to `assertThat(list.size())` as the former includes list's elements upon failure, - drop table after test.

…tion Before the change, the test used `getAllMetadataFilesFromTableDirectory` utility, but it was listing all (metadata and data) files. Similar method, `getAllMetadataFilesFromTableDirectoryForTable`, listing only metadata files existed, but was not used in the test. The change merges the utility methods: the correct logic comes from `getAllMetadataFilesFromTableDirectoryForTable` (so no behavior change for tests other than `testCleaningUpWithTableWithSpecifiedLocation`), and the name comes from `getAllMetadataFilesFromTableDirectory`.

Iceberg connector will collect stats during writes, and it would be good to test this together with ANALYZE. Rename class to contain all related functionalities.

All Iceberg connector test classes match `TestIceberg*ConnectorTest` pattern.

findepi · 2023-03-16T21:41:46Z

rebased to resolve conflicts & rerun flaky(?) cassandra test

github-actions · 2023-03-16T22:06:17Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4440329605

findepi · 2023-03-20T12:01:14Z

/test-with-secrets sha=6d192be5d21b16f717b144ae8edc78ac08c113c2

github-actions · 2023-03-20T12:39:18Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4467931974

findepi · 2023-03-20T13:57:21Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4467931974

test (plugin/trino-bigquery) -> #16636

test (plugin/trino-bigquery, cloud-tests-arrow) - unrelated?

Error:  Failures: 
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testColumnName:4686->BaseConnectorTest.testColumnName:4696->AbstractTestQueryFramework.assertUpdate:396->AbstractTestQueryFramework.assertUpdate:401 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:    TestBigQueryArrowConnectorTest>BaseConnectorTest.testCommentColumnName:3604->BaseConnectorTest.testCommentColumnName:3611 » QueryFailed
Error:  io.trino.plugin.bigquery.TestBigQueryArrowConnectorTest.testInsertRowConcurrently
Error:    Run 1: TestBigQueryArrowConnectorTest>BaseConnectorTest.testInsertRowConcurrently:4357->BaseConnectorTest.lambda$testInsertRowConcurrently$55:4354->BaseConnectorTest.lambda$testInsertRowConcurrently$54:4354 » Runtime

cc @hashhar

test (plugin/trino-iceberg, cloud-tests) - related

test (plugin/trino-pinot) unrelated (#15429)
@elonazoulay can you see https://github.com/trinodb/trino/actions/runs/4467931974/jobs/7848098340?

findepi · 2023-03-20T14:06:26Z

/test-with-secrets sha=fb405ea87bdffff9e66a8d7717cd86b5a99bc74d

github-actions · 2023-03-20T16:37:09Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4469322142

findepi · 2023-03-20T21:25:41Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4469322142

bigquery faulures, unreatled

ci / test (plugin/trino-iceberg, cloud-tests) failure at io.trino.plugin.iceberg.TestIcebergGcsConnectorSmokeTest.testDeleteRowsConcurrently also happening on master (eg https://github.com/trinodb/trino/actions/runs/4466763583/jobs/7845658911)
(#13995)

ksobolew · 2023-03-21T10:58:52Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

-                "('mommy', 4)," +
-                "('moscow', 5)," +
+                "('Kielce', 4)," +
+                "('Kiev', 5)," +


This is much nicer :) But just wanted to point out that the preferred romanization is "Kyiv". Sadly, there's no timezone "Europe/Kyiv".

findepi added the enhancement New feature or request label Dec 16, 2022

cla-bot bot added the cla-signed label Dec 16, 2022

findepi marked this pull request as draft December 16, 2022 16:47

This was referenced Dec 16, 2022

Provide Puffin reader API allowing read without decompression #15440

Open

Extends Iceberg table stats API to allow publish data and stats atomically #15439

Open

github-actions bot added docs tests:hive labels Dec 16, 2022

findepi force-pushed the findepi/iceberg-auto-update-ndv branch from a7e45da to 5121372 Compare December 19, 2022 16:34

rdblue reviewed Dec 19, 2022

View reviewed changes

martint force-pushed the master branch from 40dbb4f to 0d73d10 Compare December 19, 2022 20:15

findepi changed the title ~~Update Iceberg table statistics on writes~~ Update Iceberg table statistics on inserts Dec 21, 2022

findepi mentioned this pull request Jan 19, 2023

Add a Spark procedure to collect NDV apache/iceberg#6582

Closed

findepi force-pushed the findepi/iceberg-auto-update-ndv branch from 5121372 to 8ea0548 Compare March 13, 2023 16:25

findepi marked this pull request as ready for review March 13, 2023 16:26

findepi requested review from losipiuk, homar, electrum, krvikash, alexjo2144, findinpath and ebyhr March 13, 2023 16:26

findepi commented Mar 13, 2023

View reviewed changes

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/encoder/json/JsonRowEncoder.java Show resolved Hide resolved

plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/encoder/raw/RawRowEncoder.java Show resolved Hide resolved

findepi requested review from hashhar and charlesjmorgan March 13, 2023 16:29