-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support setting table statistics #5794
Conversation
f0a5a23
to
931d3b0
Compare
85f8530
to
645b815
Compare
645b815
to
e4750b5
Compare
Applied comments, @rdblue please take another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one comment
|
||
@Override | ||
public void commit() { | ||
TableMetadata base = ops.current(); // or ops.refresh() ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to use ops.refresh()
here as it seems other places that implement PendingUpdate
use the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i am reading implementations of org.apache.iceberg.PendingUpdate#commit
correctly
BaseUpdatePartitionSpec
,PropertiesUpdate
,SchemaUpdate
,UpdateSnapshotReferencesOperation
,SnapshotProducer
usecurrent
(acquired at the constructor invocation time?)SetLocation
,RemoveSnapshots
,BaseReplaceSortOrder
userefresh
Now, it's maybe apply
should call the refresh
?
SnapshotProducer
,SetSnapshotOperation
do soUpdateSnapshotReferencesOperation
,BaseUpdatePartitionSpec
do not seem to do that (and doesn't refresh at commit time)RemoveSnapshots
,SchemaUpdate
do not do that (maybe they should?), but then callrefresh
at commit time)BaseReplaceSortOrder
,SetLocation
do not do that (do not need to), but then callrefresh
at commit time)
That's as far as the code lecture goes. I am not expert on the lifecycle and concurrency of these entities, so it's hard for me to judge which one to use (hence the code comment). The code frequency stats don't seem to determine that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple questions:
- Should a commit call
refresh
or attempt with the current metadata? - Should refresh be called in
apply
or incommit
?
For 1, what matters is whether the operation can retry and how expensive that retry is.
SchemaUpdate
can't retry because it tracks changes based on the current schema. For example, updateColumn("id", LongType.get())
will track a replacement id
field with its comment unchanged. If a concurrent update modifies the comment, then committing the changes would inadvertently revert the concurrent change. That's why SchemaUpdate
doesn't retry. As a result, current
must be used to swap for the state when the update was started.
Operations that support retry typically use refresh
instead of current
so that changes are based on the latest state of the table. That is basically to prevent wasted work when Iceberg can detect that a retry would be necessary. This is why SnapshotProducer
calls refresh
.
Many changes that can be retried don't have expensive retries and it's better to avoid calling refresh
. PropertiesUpdate
, for example, is a metadata only change so it's better to attempt with the current metadata to avoid an extra call to the catalog service.
For 2, it depends on how apply
is used. If apply
is called by commit
then you'll probably see the refresh
call there. But if apply
and commit
depend on common logic, apply
will generally not call refresh
(because it is a preview) and commit
will call it.
Hopefully that clarifies why there are differences! Here, the changes are idempotent and only affect metadata so I think we should use current
instead of refresh
, like the PropertiesUpdate
operation.
e4750b5
to
899e306
Compare
} | ||
|
||
@Override | ||
public List<StatisticsFile> apply() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent for this method is to give a preview of what will be committed for validation, like a "dry run" option. We use it for that purpose in tests. That's why apply
is typically called from commit
and the result is used to update the table metadata.
I think it's a good idea to have the output of apply
be the final list of statistics files. But that leaves a strange case where you wouldn't want to call apply
from commit
because the TableMetadata.Builder
methods are responsible for applying changes. The logic in apply
should be the same as the logic in commit
, though.
I'd solve by adding a common method, internalApply
that returns the TableMetadata
and is called here and by commit
to ensure consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
|
||
Assert.assertSame( | ||
"Base metadata should not change when commit is created", base, table.ops().current()); | ||
Assert.assertEquals("Table should be on version 1", 1, (int) version()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the table version change? It looks correct as long as the assertSame
passes, but it seems strange that the table version was incremented here, but not in the transaction case below. Maybe it's a side effect of the test table operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why. I copied the version-related assertions from TestTransaction
, but the other tests don't do that. I can remove it from here.
BTW if i add the following to TestSortOrder
, it passes for me locally, indicating that no-op change results in a table version bump there too.
@Test
public void testNoopUpdateBumpsVersion() {
PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).withSpecId(5).identity("data").build();
SortOrder order =
SortOrder.builderFor(SCHEMA)
.withOrderId(10)
.asc("s.id", NULLS_LAST)
.desc(truncate("data", 10), NULLS_FIRST)
.build();
TestTables.TestTable table =
TestTables.create(tableDir, "test", SCHEMA, spec, order, formatVersion);
Assert.assertEquals("Table should be on version 0", 0, (int) TestTables.metadataVersion("test"));
table.replaceSortOrder().commit();
Assert.assertEquals("Table should be on version 1", 1, (int) TestTables.metadataVersion("test"));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good other than the separate logic between apply
and commit
. I think refactoring to use a common internalApply
is the only thing blocking this.
899e306
to
d633dc2
Compare
Thank you @rdblue for your review!
done! |
I pushed empty commit to diagnose the build failure that occurred in the previous run https://github.com/apache/iceberg/actions/runs/3126381931/jobs/5071814830 The build is green now (https://github.com/apache/iceberg/actions/runs/3127126451/jobs/5073407795) Empty commits get removed on rebase-and-merge (should someone want to merge this PR), but to avoid confusion I will still remove it from here. |
Implement `Transaction.updateStatistics` API.
67b52d5
to
5211c2f
Compare
Thanks, @findepi! |
Thank you @rdblue for the merge, and all your review comments! |
Implement
Transaction.updateStatistics
API.Extracted from #4741 per #4741 (comment). Probably not testable before that PR is merged.