Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add generic Iceberg catalog adapter creation to Java / Python #5754

Merged
merged 18 commits into from
Sep 6, 2024

Conversation

lbooker42
Copy link
Contributor

@lbooker42 lbooker42 commented Jul 10, 2024

Java, connecting to a RESTCatalog using MinIO

import io.deephaven.iceberg.util.*;

properties = new HashMap<>();
properties.put("type", "rest");
properties.put("uri", "http://rest:8181");

properties.put("client.region", "us-east-1");

properties.put("s3.access-key-id", "admin");
properties.put("s3.secret-access-key", "password");
properties.put("s3.endpoint", "http://minio:9000");

adapter = IcebergTools.createAdapter("generic-adapter", properties);

Python, connecting to a RESTCatalog using MinIO

from deephaven.experimental import iceberg

adapter = iceberg.adapter(name="generic-adapter", properties={
    "type" : "rest",
    "uri" : "http://rest:8181",
    "client.region" : "us-east-1",
    "s3.access-key-id" : "admin",
    "s3.secret-access-key" : "password",
    "s3.endpoint" : "http://minio:9000"
});

Java, connecting to AWS Glue

NOTE: credentials set in local environment

import io.deephaven.iceberg.util.*;

properties = new HashMap<>();
properties.put("type", "glue");
properties.put("uri", "s3://lab-warehouse/sales");

adapter = IcebergTools.createAdapter("generic-adapter", properties);

Python, connecting to AWS Glue

NOTE: credentials set in local environment

from deephaven.experimental import iceberg

adapter = iceberg.adapter(name="generic-adapter", properties={
    "type" : "glue",
    "uri" : "s3://lab-warehouse/sales",
    "warehouse" : "s3://lab-warehouse/sales",
});

@lbooker42 lbooker42 self-assigned this Jul 10, 2024
@lbooker42 lbooker42 added this to the 0.36.0 milestone Jul 10, 2024
@lbooker42
Copy link
Contributor Author

We include the support libraries for REST and Glue catalogs but others will need the user to have the libraries in the class path. These are the catalogs that will need additional files:

  • org.apache.iceberg.hive.HiveCatalog
  • org.apache.iceberg.hadoop.HadoopCatalog
  • org.apache.iceberg.nessie.NessieCatalog
  • org.apache.iceberg.jdbc.JdbcCatalog

Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious; how does pyiceberg deduce the catalog in the following example:

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "default",
    **{
        "uri": "http://rest:8181/",
        "s3.endpoint": "http://minio:9000/",
        "s3.access-key-id": "minioadmin",
        "s3.secret-access-key": "minioadmin",
    }
)

taxi_dataset = catalog.load_table("default.taxi_dataset").to_arrow()

Maybe we could use the same strategy? (I'm looking into it and will report back.)

@devinrsmith
Copy link
Member

Looks like they infer the type if the uri property is provided. https://github.com/apache/iceberg-python/blob/pyiceberg-0.6.1/pyiceberg/catalog/__init__.py#L155C5-L182

            if uri.startswith("http"):
                return CatalogType.REST
            elif uri.startswith("thrift"):
                return CatalogType.HIVE
            elif uri.startswith(("sqlite", "postgresql")):
                return CatalogType.SQL

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed. Seems like a nice PR, if it works.

Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviwed python

py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try and use it and report back.

Comment on lines 255 to 258
def adapter(
properties: Dict[str, str],
name: Optional[str] = None
) -> IcebergCatalogAdapter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should compare this directly to https://py.iceberg.apache.org/reference/pyiceberg/catalog/#pyiceberg.catalog.load_catalog, https://github.com/apache/iceberg-python/blob/pyiceberg-0.6.1/pyiceberg/catalog/__init__.py#L185-L200

def load_catalog(name: Optional[str] = None, **properties: Optional[str]) -> Catalog:

Should we call this load_catalog_adapter, load_catalog, or load_adapter instead? Should we match the same sort of signature, ie using **properties?

In addition, I think from python standpoint, we also need to compare to the wider pyiceberg API to compare / contrast, https://py.iceberg.apache.org/api/

I don't think we necessarily need to dive deep into the details today, but we could potentially add a layer of yaml config in the future like pyiceberg does with ~/.pyiceberg.yaml (in which case, the config is looked up by name).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the order of args, we are not able to do adapter("name", { "properties": ... }), which seems like the more natural order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new call is IcebergCatalogAdapter.adapter(), I think that is pretty clear but I'm open to new name.

py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
Comment on lines 69 to 70
throw new IllegalArgumentException(String.format("Catalog type or implementation property '%s' is required",
CatalogProperties.CATALOG_IMPL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this and it will bring us closer to PyIceberg configuration. My approach was not to repeat all the hand-holding that pyiceberg does, but to mirror the Java Iceberg API which is more barebones.

Not sure which approach is better, but I lean toward less code and more user responsibility (esp. where it matches Iceberg Java API).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with introducing these kinds of helpers, as long as we can be sure that they are:

  1. Unambiguous
  2. Unlikely to break later because of changes to the expected properties

That said, we can merge property-driven catalog support independently and add these kinds of "nice to have" features later; they will not change the interface, as far as I can tell.

…to-create data-instructions from property collection.
@lbooker42 lbooker42 requested a review from devinrsmith July 30, 2024 04:44
@malhotrashivam malhotrashivam self-requested a review August 27, 2024 17:02
@rcaudy rcaudy dismissed their stale review September 6, 2024 16:16

Accidental approve.

@lbooker42 lbooker42 requested a review from chipkent September 6, 2024 17:38
@@ -245,3 +246,73 @@ def adapter_aws_glue(
except Exception as e:
raise DHError(e, "Failed to build Iceberg Catalog Adapter") from e


def adapter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be unit tested yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like it is not testable yet.

chipkent
chipkent previously approved these changes Sep 6, 2024
rcaudy
rcaudy previously approved these changes Sep 6, 2024
@lbooker42 lbooker42 merged commit 1bb5f09 into deephaven:main Sep 6, 2024
16 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 6, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#305

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants