-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table metadata changes for statistics information in table metadata #5450
Conversation
2fcf034
to
f439151
Compare
api/src/main/java/org/apache/iceberg/UpdateTableStatistics.java
Outdated
Show resolved
Hide resolved
30f6308
to
6b6ca9e
Compare
Thanks @rdblue for a thorough review. Comments applied, please take another look if you want. |
76d0743
to
c00b965
Compare
Per #5450 (comment), i have rebased this off of #5021. @rdblue please take another look when you want |
} | ||
} | ||
|
||
class RemoveStatistics implements MetadataUpdate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this going to be used? Should we have a single ReplaceStatistics
operation that is idempotent instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When working on Trino stats support in Hive and Delta we found out that ability to remove stats is useful. We need a way to remove statistics from a table, for example if we detect that application that wrote stats did it incorrectly, or when stats are correct, but a query engine makes wrong conclusions using those stats and plans are bad in practice.
@rdblue how would you want to model this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. Is SetStatistics
intended to be idempotent? That would make sense so that we don't need both remove and set if we want to replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, SetStatistics
should be idempotent and also should allow replacing an existing stat file, so we don't need remove when we want to set.
core/src/test/java/org/apache/iceberg/TestMetadataUpdateParser.java
Outdated
Show resolved
Hide resolved
c00b965
to
c5ad62f
Compare
Thanks @rdblue for your review. Updated the PR thanks to your suggestions. |
@@ -817,6 +824,7 @@ public static class Builder { | |||
private long currentSnapshotId; | |||
private List<Snapshot> snapshots; | |||
private final Map<String, SnapshotRef> refs; | |||
private final Map<Long, List<StatisticsFile>> statisticsFiles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a multimap? Shouldn't setStatistics
replace the previous file? I think that's the behavior that is currently implemented, so I think this should just be a regular map to ensure there aren't somehow multiple files for a snapshot ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a multimap?
This is so per #5021 (comment)
Please clarify where I can use a Map and where I should not, the lifecycle of these metadata objects is still not crystal clear to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so the idea is to simply preserve cases where there are multiple stats files? I guess I'm okay with that. I'd also be fine with discarding one of them.
c5ad62f
to
ad6c355
Compare
Looks good. I'll merge when tests are passing. |
thanks for all the review comments! |
Thanks, @findepi! |
Extracted from #5021
#5021 (comment)