-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Iceberg table statistics on inserts #15441
Update Iceberg table statistics on inserts #15441
Conversation
a7e45da
to
5121372
Compare
transaction = null; | ||
|
||
// TODO (https://github.com/trinodb/trino/issues/15439): it would be good to publish data and stats atomically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move stats collection into Iceberg, we could do this automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we move stats collection to iceberg?
5121372
to
8ea0548
Compare
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/encoder/json/JsonRowEncoder.java
Show resolved
Hide resolved
plugin/trino-kafka/src/main/java/io/trino/plugin/kafka/encoder/raw/RawRowEncoder.java
Show resolved
Hide resolved
|
||
@Config(COLLECT_EXTENDED_STATISTICS_ON_WRITE_CONFIG) | ||
@ConfigDescription(COLLECT_EXTENDED_STATISTICS_ON_WRITE_DESCRIPTION) | ||
public IcebergConfig setCollectExtendedStatisticsOnWrite(boolean collectExtendedStatisticsOnWrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add information to the documentation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to document iceberg.extended-statistics.enabled
before we document iceberg.extended-statistics.collect-on-write
. Let's follow-up
Note that Delta's delta.extended-statistics.collect-on-write
isn't documented either, so you may want to document it.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java
Show resolved
Hide resolved
.collect(toImmutableList()); | ||
for (Pair<BlobMetadata, ByteBuffer> read : reader.readAll(toRead)) { | ||
Integer fieldId = getOnlyElement(read.first().inputFields()); | ||
checkState(pendingPreviousNdvSketches.remove(fieldId), "Unwanted read of stats for field %s", fieldId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like a safe assumption, unless I'm missing something.
You're walking the snapshot history back until you find all the columns you're looking for so you may find duplicates for the columns you did find already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't toRead
pre-filtered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is, so will ignore this for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I missed that filter
line
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java
Show resolved
Hide resolved
`ByteBuffer.array()` is a convenient and efficient way to get bytes from a `ByteBuffer`. It has, however, numerous preconditions that should be checked before using the returned array. The commit replaces all `ByteBuffer.array()` usages where these preconditions are assumed to be true.
- prefer `assertThat(List).hasSize` to `assertThat(list.size())` as the former includes list's elements upon failure, - drop table after test.
…tion Before the change, the test used `getAllMetadataFilesFromTableDirectory` utility, but it was listing all (metadata and data) files. Similar method, `getAllMetadataFilesFromTableDirectoryForTable`, listing only metadata files existed, but was not used in the test. The change merges the utility methods: the correct logic comes from `getAllMetadataFilesFromTableDirectoryForTable` (so no behavior change for tests other than `testCleaningUpWithTableWithSpecifiedLocation`), and the name comes from `getAllMetadataFilesFromTableDirectory`.
Iceberg connector will collect stats during writes, and it would be good to test this together with ANALYZE. Rename class to contain all related functionalities.
All Iceberg connector test classes match `TestIceberg*ConnectorTest` pattern.
f012672
to
a2305fd
Compare
rebased to resolve conflicts & rerun flaky(?) cassandra test |
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4440329605 |
a2305fd
to
6d192be
Compare
/test-with-secrets sha=6d192be5d21b16f717b144ae8edc78ac08c113c2 |
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4467931974 |
cc @hashhar
|
6d192be
to
fb405ea
Compare
/test-with-secrets sha=fb405ea87bdffff9e66a8d7717cd86b5a99bc74d |
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4469322142 |
bigquery faulures, unreatled
|
"('mommy', 4)," + | ||
"('moscow', 5)," + | ||
"('Kielce', 4)," + | ||
"('Kiev', 5)," + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much nicer :) But just wanted to point out that the preferred romanization is "Kyiv". Sadly, there's no timezone "Europe/Kyiv".
No description provided.