Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain table statistics during orphan files removal #5795

Merged
merged 1 commit into from
Sep 30, 2022

Conversation

findepi
Copy link
Member

@findepi findepi commented Sep 19, 2022

Do not delete table statistics files when running remove_orphan_files.

Extracted from #4741 and based on that PR, and also on #5794 and #5799.

@findepi findepi force-pushed the findepi/stats-orphans branch from fb784e8 to 658bd27 Compare September 19, 2022 18:30
@findepi findepi requested a review from rdblue September 19, 2022 18:31
@findepi findepi force-pushed the findepi/stats-orphans branch 3 times, most recently from 6d4dfc2 to 30a48af Compare September 21, 2022 10:33
@findepi
Copy link
Member Author

findepi commented Sep 21, 2022

rebased after #5799 merged, no other changes

currently, depends on #5794

@findepi findepi force-pushed the findepi/stats-orphans branch from 30a48af to 379b9ff Compare September 27, 2022 11:42
@findepi
Copy link
Member Author

findepi commented Sep 27, 2022

Rebased after #5794 is merged.

@rdblue please take a look

@findepi findepi requested review from rdblue and removed request for rdblue September 27, 2022 11:42
@github-actions github-actions bot removed the API label Sep 27, 2022
ByteBuffer.wrap("blob content".getBytes(StandardCharsets.UTF_8))));
puffinWriter.finish();
statisticsFile =
new GenericStatisticsFile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can PuffinWriter expose toStatisticsFile instead of relying on the caller to do this work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it currently does not have the information for org.apache.iceberg.StatisticsFile#snapshotId field.

*/
public static List<String> statisticsFilesLocations(Table table) {
List<String> statisticsFilesLocations = Lists.newArrayList();
TableOperations ops = ((HasTableOperations) table).operations();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this depends on the public API. Let's get that one in first so we don't need to go around the API here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue here I am following existing code in the ReachableFileUtil class

TableOperations ops = ((HasTableOperations) table).operations();
TableMetadata tableMetadata = ops.current();
metadataFileLocations.add(tableMetadata.metadataFileLocation());

since this class already casts to HasTableOperations, I assumed it's OK to do that and therefore removed public API changes from this PR.
If you want first merge Table API changes, the #4741 is ready for your review

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, but it requires the public API to return statisticsFiles from Table so we should get that one in first.

@findepi
Copy link
Member Author

findepi commented Sep 29, 2022

This looks fine, but it requires the public API to return statisticsFiles from Table so we should get that one in first.

please see my response: #5795 (comment)

@findepi findepi force-pushed the findepi/stats-orphans branch from 379b9ff to 8dffa41 Compare September 29, 2022 09:39
@findepi
Copy link
Member Author

findepi commented Sep 29, 2022

Applied or responded to comments.
I didn't do anything with the public Table API yet.
I think, however, it shouldn't be a blocker, since the affected ReachableFileUtil class already depends on non-API information in other methods. We can improve ReachableFileUtil after #4741, or we can land #4741 first and improve here.

@rdblue please take another look.

@findepi findepi force-pushed the findepi/stats-orphans branch from 8dffa41 to 623b11a Compare September 30, 2022 09:49
@findepi
Copy link
Member Author

findepi commented Sep 30, 2022

(just rebased after #4741 merged, no changes yet)

Do not delete table statistics files when running remove_orphan_files.
@findepi findepi force-pushed the findepi/stats-orphans branch from 623b11a to 2c09948 Compare September 30, 2022 09:50
@findepi
Copy link
Member Author

findepi commented Sep 30, 2022

This looks fine, but it requires the public API to return statisticsFiles from Table so we should get that one in first.

Done now.

@rdblue please take another look

@rdblue rdblue merged commit b18de17 into apache:master Sep 30, 2022
@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2022

Thanks, @findepi! I merged this.

@findepi findepi deleted the findepi/stats-orphans branch September 30, 2022 18:00
@findepi
Copy link
Member Author

findepi commented Sep 30, 2022

thank you @rdblue for the merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants