Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use DH S3Instructions to build Iceberg AWS clients #6113
feat: Use DH S3Instructions to build Iceberg AWS clients #6113
Changes from 15 commits
8066110
4dc52bd
9c13e93
de78d8a
28f7643
c8ecd2c
3804ef9
385ac25
b909ff3
a4d2b5a
7da2369
a6f51a4
14aa4fb
b62debc
11ba983
6d0bbad
1340833
790f086
57b98a4
e3a9c65
3e78ce8
bb7e239
098388f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
hadoopConfig
used in an S3-backed adapter with S3 file IO? Can we drop this as a parameter and create an empty config internal to this function?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question;
org.apache.iceberg.aws.s3.S3FileIO
does not use hadoop conf. That said, this method is not imposingS3FileIO
. It looks like GlueCatalog is technically written in a way where hadoopConf can be passed along if something besides S3FileIO is used... maybe it's possible to use GlueCatalog and not use S3 as the warehouse... for example, maybe some sort of other AWS NFS storage, I'm not sure. I'm going to add more explicit documentation about this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I had to do something similar for the
S3Request
objects, Ryan suggested usingCleanupReferenceProcessor
. You can check that too, if that has any advantages.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "register" method on Cleaner is very nice. I tried to add a similarly helpful register method to CleanupReferenceProcessor (essentially, creating a reference behind the scenes that ties an object and a cleanup action), and it almost worked... for some reason though, the caller needs to explicitly retain the returned reference, whereas the same limitation does not apply to Cleaner. It could be I was missing some subtle aspect of the Reference stuff - will have convo w/ Ryan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a separate PR that adds similar functionality to CleanupReferenceProcessor; #6213
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we had to deprecate these?
They might still give us coverage on a lot of smaller cases that Larry tested.
Is it deprecated in a sense that new tests should not be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly that I think we should not add new tests here, and work to migrating them as mentioned in
IcebergToolsTest
. @lbooker42 has so far owned this layer, but it should be relatively easy to migrate to db_resource, and then we get the benefit of:I see db_resource as mainly a way to test out how well we can interoperate with Iceberg that has been written via different processes (pyiceberg, spark, etc).
For more thorough testing (once we have our own writing support) we should be able to extend
SqliteCatalogBase
(or create further specialized tests that look similar to it), which can work with any warehouse - currently the same logic is tested out via local disk, minio, and localstack.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, these new catalog testing code is the more comprehensive way for all future tests. Once we can migrate these tests so we don't lose any coverage, we should remove these as well as the IcebergTestCatalog class.
cc: @lbooker42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur.