Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use DH S3Instructions to build Iceberg AWS clients #6113

Merged
merged 23 commits into from
Oct 29, 2024

Conversation

devinrsmith
Copy link
Member

@devinrsmith devinrsmith commented Sep 23, 2024

This provides a way for users who are responsible for providing AWS / S3 credentials to specify it in a way where Deephaven can own the S3 client building logic for the Iceberg Catalog in additional to our own data access layer.

Note, this does not deprecate DataInstructionsProviderPlugin, as there may be cases where the user is not responsible for providing these credentials, and it is instead provided via the catalog after catalog authorization. See #6191

@devinrsmith devinrsmith added this to the 0.37.0 milestone Sep 23, 2024
@devinrsmith devinrsmith self-assigned this Sep 23, 2024
@@ -15,9 +16,11 @@
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT;
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY;

@Tag("testcontainers")
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we had to deprecate these?
They might still give us coverage on a lot of smaller cases that Larry tested.
Is it deprecated in a sense that new tests should not be added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly that I think we should not add new tests here, and work to migrating them as mentioned in IcebergToolsTest. @lbooker42 has so far owned this layer, but it should be relatively easy to migrate to db_resource, and then we get the benefit of:

  1. No separate container (s3) needed (+ no need to upload files)
  2. Ne need to have custom catalog IcebergTestCatalog
  3. On disk JDBC catalog + on disk warehouse

I see db_resource as mainly a way to test out how well we can interoperate with Iceberg that has been written via different processes (pyiceberg, spark, etc).

For more thorough testing (once we have our own writing support) we should be able to extend SqliteCatalogBase (or create further specialized tests that look similar to it), which can work with any warehouse - currently the same logic is tested out via local disk, minio, and localstack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these new catalog testing code is the more comprehensive way for all future tests. Once we can migrate these tests so we don't lose any coverage, we should remove these as well as the IcebergTestCatalog class.
cc: @lbooker42

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur.

py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
@@ -15,9 +16,11 @@
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT;
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY;

@Tag("testcontainers")
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these new catalog testing code is the more comprehensive way for all future tests. Once we can migrate these tests so we don't lose any coverage, we should remove these as well as the IcebergTestCatalog class.
cc: @lbooker42

Comment on lines 146 to 147
CLEANER.register(adapter.catalog(), cleanup);
return adapter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I had to do something similar for the S3Request objects, Ryan suggested using CleanupReferenceProcessor. You can check that too, if that has any advantages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "register" method on Cleaner is very nice. I tried to add a similarly helpful register method to CleanupReferenceProcessor (essentially, creating a reference behind the scenes that ties an object and a cleanup action), and it almost worked... for some reason though, the caller needs to explicitly retain the returned reference, whereas the same limitation does not apply to Cleaner. It could be I was missing some subtle aspect of the Reference stuff - will have convo w/ Ryan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a separate PR that adds similar functionality to CleanupReferenceProcessor; #6213

public static IcebergCatalogAdapter createAdapter(
@Nullable final String name,
@NotNull final Map<String, String> properties,
@NotNull final Map<String, String> hadoopConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hadoopConfig used in an S3-backed adapter with S3 file IO? Can we drop this as a parameter and create an empty config internal to this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question; org.apache.iceberg.aws.s3.S3FileIO does not use hadoop conf. That said, this method is not imposing S3FileIO. It looks like GlueCatalog is technically written in a way where hadoopConf can be passed along if something besides S3FileIO is used... maybe it's possible to use GlueCatalog and not use S3 as the warehouse... for example, maybe some sort of other AWS NFS storage, I'm not sure. I'm going to add more explicit documentation about this.

@@ -15,9 +16,11 @@
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT;
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY;

@Tag("testcontainers")
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur.

malhotrashivam
malhotrashivam previously approved these changes Oct 15, 2024
@devinrsmith devinrsmith enabled auto-merge (squash) October 18, 2024 19:57
Comment on lines 274 to 278
In cases where the caller prefers to use Iceberg's AWS properties, the parity of construction logic will be limited
to what Deephaven is able to infer; in advanced cases, it's possible that there will be a difference in construction
logic between the Iceberg-managed and Deephaven-managed AWS clients which manifests itself as being able to browse
Catalog metadata, but not retrieve Table data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little unclear without an example. Also it might be nice to point out a possible solution for the problem of not being able to read table data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it's a very opened thing, and it's tough to provide generalizabled example.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, but since auto-merge is enabled leaving this for a subsequent approval.

@devinrsmith devinrsmith requested a review from chipkent October 29, 2024 19:29
@devinrsmith devinrsmith requested a review from chipkent October 29, 2024 21:44
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python LGTM

@devinrsmith devinrsmith merged commit 59f226d into deephaven:main Oct 29, 2024
16 checks passed
@devinrsmith devinrsmith deleted the dh-iceberg-s3-client branch October 29, 2024 22:15
@github-actions github-actions bot locked and limited conversation to collaborators Oct 29, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#349

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants