-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use DH S3Instructions to build Iceberg AWS clients #6113
feat: Use DH S3Instructions to build Iceberg AWS clients #6113
Conversation
extensions/iceberg/s3/src/main/java/io/deephaven/iceberg/util/IcebergToolsS3.java
Outdated
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/iceberg/util/IcebergToolsS3.java
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/iceberg/util/IcebergToolsS3.java
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ClientFactory.java
Outdated
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Outdated
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ClientFactory.java
Outdated
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/iceberg/util/IcebergToolsS3.java
Outdated
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/iceberg/util/IcebergToolsS3.java
Show resolved
Hide resolved
@@ -15,9 +16,11 @@ | |||
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT; | |||
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY; | |||
|
|||
@Tag("testcontainers") | |||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we had to deprecate these?
They might still give us coverage on a lot of smaller cases that Larry tested.
Is it deprecated in a sense that new tests should not be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly that I think we should not add new tests here, and work to migrating them as mentioned in IcebergToolsTest
. @lbooker42 has so far owned this layer, but it should be relatively easy to migrate to db_resource, and then we get the benefit of:
- No separate container (s3) needed (+ no need to upload files)
- Ne need to have custom catalog IcebergTestCatalog
- On disk JDBC catalog + on disk warehouse
I see db_resource as mainly a way to test out how well we can interoperate with Iceberg that has been written via different processes (pyiceberg, spark, etc).
For more thorough testing (once we have our own writing support) we should be able to extend SqliteCatalogBase
(or create further specialized tests that look similar to it), which can work with any warehouse - currently the same logic is tested out via local disk, minio, and localstack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, these new catalog testing code is the more comprehensive way for all future tests. Once we can migrate these tests so we don't lose any coverage, we should remove these as well as the IcebergTestCatalog class.
cc: @lbooker42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur.
extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTools.java
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Show resolved
Hide resolved
extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTools.java
Show resolved
Hide resolved
@@ -15,9 +16,11 @@ | |||
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT; | |||
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY; | |||
|
|||
@Tag("testcontainers") | |||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, these new catalog testing code is the more comprehensive way for all future tests. Once we can migrate these tests so we don't lose any coverage, we should remove these as well as the IcebergTestCatalog class.
cc: @lbooker42
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Outdated
Show resolved
Hide resolved
CLEANER.register(adapter.catalog(), cleanup); | ||
return adapter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I had to do something similar for the S3Request
objects, Ryan suggested using CleanupReferenceProcessor
. You can check that too, if that has any advantages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "register" method on Cleaner is very nice. I tried to add a similarly helpful register method to CleanupReferenceProcessor (essentially, creating a reference behind the scenes that ties an object and a cleanup action), and it almost worked... for some reason though, the caller needs to explicitly retain the returned reference, whereas the same limitation does not apply to Cleaner. It could be I was missing some subtle aspect of the Reference stuff - will have convo w/ Ryan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a separate PR that adds similar functionality to CleanupReferenceProcessor; #6213
public static IcebergCatalogAdapter createAdapter( | ||
@Nullable final String name, | ||
@NotNull final Map<String, String> properties, | ||
@NotNull final Map<String, String> hadoopConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is hadoopConfig
used in an S3-backed adapter with S3 file IO? Can we drop this as a parameter and create an empty config internal to this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question; org.apache.iceberg.aws.s3.S3FileIO
does not use hadoop conf. That said, this method is not imposing S3FileIO
. It looks like GlueCatalog is technically written in a way where hadoopConf can be passed along if something besides S3FileIO is used... maybe it's possible to use GlueCatalog and not use S3 as the warehouse... for example, maybe some sort of other AWS NFS storage, I'm not sure. I'm going to add more explicit documentation about this.
@@ -15,9 +16,11 @@ | |||
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT; | |||
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY; | |||
|
|||
@Tag("testcontainers") | |||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur.
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Outdated
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Outdated
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/iceberg/util/IcebergToolsS3.java
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ClientFactory.java
Show resolved
Hide resolved
extensions/iceberg/s3/src/main/java/io/deephaven/extensions/s3/DeephavenAwsClientFactory.java
Outdated
Show resolved
Hide resolved
In cases where the caller prefers to use Iceberg's AWS properties, the parity of construction logic will be limited | ||
to what Deephaven is able to infer; in advanced cases, it's possible that there will be a difference in construction | ||
logic between the Iceberg-managed and Deephaven-managed AWS clients which manifests itself as being able to browse | ||
Catalog metadata, but not retrieve Table data. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little unclear without an example. Also it might be nice to point out a possible solution for the problem of not being able to read table data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it's a very opened thing, and it's tough to provide generalizabled example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving, but since auto-merge is enabled leaving this for a subsequent approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python LGTM
Labels indicate documentation is required. Issues for documentation have been opened: Community: deephaven/deephaven-docs-community#349 |
This provides a way for users who are responsible for providing AWS / S3 credentials to specify it in a way where Deephaven can own the S3 client building logic for the Iceberg Catalog in additional to our own data access layer.
Note, this does not deprecate
DataInstructionsProviderPlugin
, as there may be cases where the user is not responsible for providing these credentials, and it is instead provided via the catalog after catalog authorization. See #6191