Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support setting table statistics #5794

Merged
merged 1 commit into from
Sep 26, 2022
Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Sep 19, 2022

Implement Transaction.updateStatistics API.

Extracted from #4741 per #4741 (comment). Probably not testable before that PR is merged.

@findepi findepi force-pushed the findepi/stats-updates branch from 645b815 to e4750b5 Compare September 20, 2022 07:25
@findepi
Copy link
Member Author

findepi commented Sep 20, 2022

Applied comments, @rdblue please take another look

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment


@Override
public void commit() {
TableMetadata base = ops.current(); // or ops.refresh() ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to use ops.refresh() here as it seems other places that implement PendingUpdate use the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i am reading implementations of org.apache.iceberg.PendingUpdate#commit correctly

  • BaseUpdatePartitionSpec, PropertiesUpdate, SchemaUpdate, UpdateSnapshotReferencesOperation, SnapshotProducer use current (acquired at the constructor invocation time?)
  • SetLocation, RemoveSnapshots, BaseReplaceSortOrder use refresh

Now, it's maybe apply should call the refresh?

  • SnapshotProducer, SetSnapshotOperation do so
  • UpdateSnapshotReferencesOperation, BaseUpdatePartitionSpec do not seem to do that (and doesn't refresh at commit time)
  • RemoveSnapshots, SchemaUpdate do not do that (maybe they should?), but then call refresh at commit time)
  • BaseReplaceSortOrder, SetLocation do not do that (do not need to), but then call refresh at commit time)

That's as far as the code lecture goes. I am not expert on the lifecycle and concurrency of these entities, so it's hard for me to judge which one to use (hence the code comment). The code frequency stats don't seem to determine that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple questions:

  1. Should a commit call refresh or attempt with the current metadata?
  2. Should refresh be called in apply or in commit?

For 1, what matters is whether the operation can retry and how expensive that retry is.

SchemaUpdate can't retry because it tracks changes based on the current schema. For example, updateColumn("id", LongType.get()) will track a replacement id field with its comment unchanged. If a concurrent update modifies the comment, then committing the changes would inadvertently revert the concurrent change. That's why SchemaUpdate doesn't retry. As a result, current must be used to swap for the state when the update was started.

Operations that support retry typically use refresh instead of current so that changes are based on the latest state of the table. That is basically to prevent wasted work when Iceberg can detect that a retry would be necessary. This is why SnapshotProducer calls refresh.

Many changes that can be retried don't have expensive retries and it's better to avoid calling refresh. PropertiesUpdate, for example, is a metadata only change so it's better to attempt with the current metadata to avoid an extra call to the catalog service.

For 2, it depends on how apply is used. If apply is called by commit then you'll probably see the refresh call there. But if apply and commit depend on common logic, apply will generally not call refresh (because it is a preview) and commit will call it.

Hopefully that clarifies why there are differences! Here, the changes are idempotent and only affect metadata so I think we should use current instead of refresh, like the PropertiesUpdate operation.

@findepi findepi force-pushed the findepi/stats-updates branch from e4750b5 to 899e306 Compare September 21, 2022 10:32
@findepi
Copy link
Member Author

findepi commented Sep 21, 2022

rebased after #5799 merged, no other changes

@rdblue please take a look

}

@Override
public List<StatisticsFile> apply() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent for this method is to give a preview of what will be committed for validation, like a "dry run" option. We use it for that purpose in tests. That's why apply is typically called from commit and the result is used to update the table metadata.

I think it's a good idea to have the output of apply be the final list of statistics files. But that leaves a strange case where you wouldn't want to call apply from commit because the TableMetadata.Builder methods are responsible for applying changes. The logic in apply should be the same as the logic in commit, though.

I'd solve by adding a common method, internalApply that returns the TableMetadata and is called here and by commit to ensure consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!


Assert.assertSame(
"Base metadata should not change when commit is created", base, table.ops().current());
Assert.assertEquals("Table should be on version 1", 1, (int) version());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the table version change? It looks correct as long as the assertSame passes, but it seems strange that the table version was incremented here, but not in the transaction case below. Maybe it's a side effect of the test table operations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why. I copied the version-related assertions from TestTransaction, but the other tests don't do that. I can remove it from here.

BTW if i add the following to TestSortOrder, it passes for me locally, indicating that no-op change results in a table version bump there too.

@Test
public void testNoopUpdateBumpsVersion() {
  PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).withSpecId(5).identity("data").build();
  SortOrder order =
      SortOrder.builderFor(SCHEMA)
          .withOrderId(10)
          .asc("s.id", NULLS_LAST)
          .desc(truncate("data", 10), NULLS_FIRST)
          .build();
  TestTables.TestTable table =
      TestTables.create(tableDir, "test", SCHEMA, spec, order, formatVersion);
  Assert.assertEquals("Table should be on version 0", 0, (int) TestTables.metadataVersion("test"));

  table.replaceSortOrder().commit();

  Assert.assertEquals("Table should be on version 1", 1, (int) TestTables.metadataVersion("test"));
}

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good other than the separate logic between apply and commit. I think refactoring to use a common internalApply is the only thing blocking this.

@findepi findepi force-pushed the findepi/stats-updates branch from 899e306 to d633dc2 Compare September 26, 2022 08:55
@findepi
Copy link
Member Author

findepi commented Sep 26, 2022

Thank you @rdblue for your review!

This looks good other than the separate logic between apply and commit. I think refactoring to use a common internalApply is the only thing blocking this.

done!
please take a look

@findepi
Copy link
Member Author

findepi commented Sep 26, 2022

I pushed empty commit to diagnose the build failure that occurred in the previous run https://github.com/apache/iceberg/actions/runs/3126381931/jobs/5071814830

The build is green now (https://github.com/apache/iceberg/actions/runs/3127126451/jobs/5073407795)

Empty commits get removed on rebase-and-merge (should someone want to merge this PR), but to avoid confusion I will still remove it from here.

Implement `Transaction.updateStatistics` API.
@findepi findepi force-pushed the findepi/stats-updates branch from 67b52d5 to 5211c2f Compare September 26, 2022 14:29
@rdblue rdblue merged commit b9bbcfb into apache:master Sep 26, 2022
@rdblue
Copy link
Contributor

rdblue commented Sep 26, 2022

Thanks, @findepi!

@findepi findepi deleted the findepi/stats-updates branch September 27, 2022 07:25
@findepi
Copy link
Member Author

findepi commented Sep 27, 2022

Thank you @rdblue for the merge, and all your review comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants