Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Iceberg catalog warehouse property in REST #15762

Merged
merged 2 commits into from
Feb 21, 2023

Conversation

danielcweeks
Copy link
Contributor

Description

This PR adds support for configuring the warehouse catalog property for the REST Catalog which is part of the REST Spec.

Additional context and related issues

N/A

Release notes

(X) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Jan 18, 2023
@github-actions github-actions bot added the docs label Jan 18, 2023
@krvikash
Copy link
Contributor

Hi @danielcweeks, Could you check the CI failure?

@findepi
Copy link
Member

findepi commented Jan 19, 2023

LGTM

@nastra nastra force-pushed the iceberg-catalog-warehouse-property branch from c4cdc9e to 8fa7519 Compare January 19, 2023 10:52
@nastra
Copy link
Contributor

nastra commented Jan 19, 2023

I've fixed CI failures and tests should pass now

Copy link
Contributor

@krvikash krvikash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. CC: @colebow for the doc review.

@@ -95,6 +95,9 @@ Property Name Description
``iceberg.rest-catalog.uri`` REST server API endpoint URI (required).
Example: ``http://iceberg-with-rest:8181``

``iceberg.rest-catalog.warehouse`` Warehouse identifier/location for the catalog (optional).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in JDBC catalog we call it iceberg.jdbc-catalog.default-warehouse-dir
with description "The default warehouse directory to use for JDBC"

the "identifier" is misleading
the config name is unnecessarily different

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosabua do we write "(optional)" for optional properties, or just for required ones (as above), or ?

Comment on lines +92 to +96
@Config("iceberg.rest-catalog.warehouse")
@ConfigDescription("The warehouse location/identifier to use with the REST catalog server")
public IcebergRestCatalogConfig setWarehouse(String warehouse)
{
this.warehouse = Optional.ofNullable(warehouse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sync this with io.trino.plugin.iceberg.catalog.jdbc.IcebergJdbcCatalogConfig#setDefaultWarehouseDir

(config name, description, property name, field name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment below, the REST catalog makes this a little more flexible and doesn't necessarily require an actual directory (it could just be a name that the catalog maps to a directory).

@@ -38,6 +39,7 @@
private final CatalogName catalogName;
private final String trinoVersion;
private final URI serverUri;
private final Optional<String> warehouse;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultWarehouseDir ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi this isn't exactly the same thing as with JDBC since the warehouse property isn't required to be a directory (it could just be an identifier that may map to a location/bucket/etc.). The property itself is really just 'warehouse' and we probably don't want to restrict that here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is warehouse identifier?

are there any docs for that that we should reference from our docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An identifier could be anything that the REST Server would use to map to the warehouse being addressed. This could be as simple as something like prod which the server then would map to whatever the canonical location is for prod. For the docs, we could just reference the spec, but there's no strict requirement on format or how a server uses the value.

A good example of why this is useful is if you have something like the JDBCCatalog where you have to explicitly define the location in the config, but if something like Flink is also using the same catalog backend, but unintentionally points to a different location, you have a single catalog with data that's split across multiple locations. This can even result in individual tables having data end up split across multiple locations depending on writes/overwrites.

@findepi findepi requested a review from ebyhr January 19, 2023 12:34
@@ -71,6 +74,7 @@ public synchronized TrinoCatalog create(ConnectorIdentity identity)
if (icebergCatalog == null) {
ImmutableMap.Builder<String, String> properties = ImmutableMap.builder();
properties.put(CatalogProperties.URI, serverUri.toString());
warehouse.ifPresent(location -> properties.put(CatalogProperties.WAREHOUSE_LOCATION, location));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks I set random characters with iceberg.rest-catalog.warehouse in IcebergRestQueryRunnerMain, but I didn't find any change. I expected the REST catalog throws an error or log something because the behind JDBC catalogs uses the different warehouse catalog property. Can you share how REST catalog behaves in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ebyhr ebyhr Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but the link doesn't answer the above question in my opinion. What's the expected behavior in the above random warehouse case?

Copy link
Contributor Author

@danielcweeks danielcweeks Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr I tried to explain in this comment, but this can't really be compared to how Hive works with locations and the expectation is not that the value should necessarily dictate anything about the path.

Because Iceberg has multiple implementations for Catalog, all with different behaviors, the warehouse property is just a standard mechanism by which to indicate intent to the implementation. For simple implementations, like Hadoop or Hive Catalog, that could be just a physical location to store data. However, it is more of an abstraction that can be used by the implementation.

For example, the catalog service at Netflix has multiple catalogs (prod, test, shadow, etc.), but there's just a single service endpoint (say metacat.data.netflix.net). To address each of those independently, you need to either have different domains for each or expose different ports to the same service. With RESTCatalog implementation, you're just providing a request to address a specific one without having to change the surrounding infrastructure (and something like prod is a lot easier to reason about than port 337033).

The REST spec defines this as an option, but does not dictate how it is handled (and ignoring it is also an option, since that is a catalog behavior).

We could write tests, but at that point we'd be writing both the test and the expected behavior, so the test would largely be meaningless.

@nastra nastra closed this Jan 20, 2023
@nastra nastra reopened this Jan 20, 2023
@findepi
Copy link
Member

findepi commented Jan 20, 2023

image

@nastra FYI i push empty commit called empty when i want a CI re-run.

@github-actions
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 20, 2023
@danielcweeks
Copy link
Contributor Author

@findepi or @ebyhr is there anything blocking to get this in?

@bitsondatadev
Copy link
Member

@findepi is currently on vacation, @ebyhr any thoughts on this being merge ready?

@github-actions github-actions bot removed the stale label Feb 21, 2023
@ebyhr ebyhr merged commit fba39d6 into trinodb:master Feb 21, 2023
@github-actions github-actions bot added this to the 408 milestone Feb 22, 2023
@danielcweeks danielcweeks deleted the iceberg-catalog-warehouse-property branch February 24, 2023 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants