Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Support committing delete files with multiple specs #2985

Merged
merged 2 commits into from
Sep 21, 2021

Conversation

aokolnychyi
Copy link
Contributor

This PR enables committing delete files that belong to different specs in a single operation. Previously, we only supported row deltas where all delete and data files were part of the same spec.

@github-actions github-actions bot added the core label Aug 17, 2021
protected PartitionSpec writeSpec() {
Preconditions.checkState(spec != null,
"Cannot determine partition spec: no data or delete files have been added");
protected PartitionSpec dataSpec() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming this does require touching more places but I think keeping it writeSpec will be confusing.

PartitionSpec fileSpec = ops.current().spec(file.specId());
List<DeleteFile> deleteFiles = newDeleteFiles.computeIfAbsent(file.specId(), specId -> Lists.newArrayList());
deleteFiles.add(file);
addedFilesSummary.addedFile(fileSpec, file);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file spec is only used for partition summaries. I added a test that shows it works as expected.

@aokolnychyi
Copy link
Contributor Author

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look and change looks good to me

addedFilesSummary.addedFile(writeSpec(), file);
Preconditions.checkNotNull(file, "Invalid delete file: null");
PartitionSpec fileSpec = ops.current().spec(file.specId());
List<DeleteFile> deleteFiles = newDeleteFiles.computeIfAbsent(file.specId(), specId -> Lists.newArrayList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but for me this would be easier to understand if it was deleteFilesForSpec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update that. You refer to the map name, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the map name.

}
this.cachedNewDeleteManifests.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the explicit clear here? Are we just trying to free it up for GC early?

Copy link
Member

@RussellSpitzer RussellSpitzer Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems a little different than it was previously?

Before

if committed doesn't contain cachedNewDeleteManifest
  deleteFile()
  clear cachedNewDeleteManifest
for any cachedNewDeleteManifest
  if commited doesn't contain cachedNewDeleteManifest
     deleteFile
clear all cachedDeleteManifests

I'm still trying to understand the check here but it seems like we will clear out all manifests even if some of them are committed?

Seems like the equivalent would be something like

    for (ManifestFile cachedNewDeleteManifest : cachedNewDeleteManifests) {
      if (!committed.contains(cachedNewDeleteManifest)) {
        deleteFile(cachedNewDeleteManifest.path());
        this.cachedNewDeleteManifests.remove(cachedNewDeleteManifests) // Although this would be modifying the list as we iterated through it but you get the idea
      }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I'll update this place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use LinkedList and listIterator.

@aokolnychyi
Copy link
Contributor Author

This one is ready for another review round.

for (ManifestFile cachedNewDeleteManifest : cachedNewDeleteManifests) {
deleteFile(cachedNewDeleteManifest.path());
}
cachedNewDeleteManifests.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this will rewrite all delete manifests even if there is only one new delete file. I think it's fine to simplify it right now since we don't expect this case very often. But it would be good to note that this is something we can improve in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. I think this will be rare enough in real world so should be fine to optimize later.

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2021

@aokolnychyi, this looks good to me. I had a couple of minor comments, but merge when you're ready.

@aokolnychyi aokolnychyi merged commit e743063 into apache:master Sep 21, 2021
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @szehon-ho @rdblue @RussellSpitzer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants