-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip single file in a partition from OPTIMIZE #23864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only cosmetics spotted.
I appreciate seeing the testing coverage for the newly added functionality.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java
Outdated
Show resolved
Hide resolved
List<Object> splitsInfo = ImmutableList.copyOf(scannedFiles.build()); | ||
log.info("Generated %d splits, skipped %d files for OPTIMIZE", splitsInfo.size(), filesSkipped); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's useful enough to be at info, until we enhance optimize to print some useful summary at the end of execution. We get a bunch of useless logs out of iceberg library anyway in IcebergSplitSource, this isn't adding much to the noise.
private ImmutableSet.Builder<DataFileWithDeleteFiles> scannedFiles = ImmutableSet.builder(); | ||
@Nullable | ||
private Map<StructLikeWrapperWithFieldIdToIndex, Optional<FileScanTaskWithDomain>> scannedFilesByPartition = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that there is no need to log how many files were skipped.
Consider using a local variable in the method instead of adding more state to the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This state is required regardless of the log line. We need to maintain scannedFilesByPartition
for the entire lifetime of the split source as file tasks iteration is not guaranteed to be partition at a time.
Having the log line is useful for knowing what happened. We should probably extend OPTIMIZE to print some useful stats like this at the end of the execution.
04ebc29
to
f8684a0
Compare
f8684a0
to
6549eb1
Compare
@@ -304,6 +277,54 @@ private CompletableFuture<ConnectorSplitBatch> getNextBatchInternal(int maxSize) | |||
return completedFuture(new ConnectorSplitBatch(splits, isFinished())); | |||
} | |||
|
|||
private Iterator<FileScanTaskWithDomain> prepareFileTasksIterator(List<FileScanTaskWithDomain> fileScanTasks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are materializing FileScanTaskWithDomain
so just return LIst< FileScanTaskWithDomain>
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more convenient to return iterator as the rest of the existing code is written to work on an iterator
import java.util.Objects; | ||
import java.util.stream.IntStream; | ||
|
||
public class StructLikeWrapperWithFieldIdToIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice class name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
credit to @homar ;)
...trino-iceberg/src/main/java/io/trino/plugin/iceberg/StructLikeWrapperWithFieldIdToIndex.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit hard to read - but logic seems fine
Keeping domain attached to the relevant FileScanTask is safer than handling it as member variable of the class
This will be used in subsequent commit to skip unncessary files from optimize
Single file without a delete in a partition can't be optimized any further
6549eb1
to
84ac714
Compare
Description
Single file without a delete in a partition can't be optimized any further.
Avoiding rewriting such files improves the performance of OPTIMIZE on
large partitioned tables
Additional context and related issues
Fixes #10785
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: