Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]Support jdbc catalog #1459

Closed
melin opened this issue Oct 26, 2022 · 11 comments
Closed

[Feature Request]Support jdbc catalog #1459

melin opened this issue Oct 26, 2022 · 11 comments
Labels
enhancement New feature or request

Comments

@melin
Copy link

melin commented Oct 26, 2022

Reference iceberg jdbc catalog:https://iceberg.apache.org/docs/latest/jdbc/

Store metadata directly into a relational database, independent of hms. It can also be customized based on the jdbc catalog
@zsxwing

@melin melin added the enhancement New feature or request label Oct 26, 2022
@zsxwing
Copy link
Member

zsxwing commented Oct 27, 2022

Delta by design stores its metadata on the storage. Could you explain why you want to move the metadata to a relational database?

@melin
Copy link
Author

melin commented Oct 27, 2022

Delta by design stores its metadata on the storage. Could you explain why you want to move the metadata to a relational database?

Only the table name and location are stored in the relational database, and the other metadata is stored on the storage system.
Iceberg storage: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java

Iceberg can also customize the catalog, such as the jdbc catalog, to store the table name in a customized Catalog System.

@zsxwing
Copy link
Member

zsxwing commented Oct 27, 2022

Adding a new catalog implementation is not a mission of Delta Lake. We would like to focus on the storage format and integrate with popular catalog systems (such as Hive Metastore, AWS Glue) instead.

@melin
Copy link
Author

melin commented Oct 27, 2022

Adding a new catalog implementation is not a mission of Delta Lake. We would like to focus on the storage format and integrate with popular catalog systems (such as Hive Metastore, AWS Glue) instead.

Delta does not provide a concrete implementation, but provides an interface that the user can customize.

@zsxwing
Copy link
Member

zsxwing commented Oct 27, 2022

Delta does not provide a concrete implementation, but provides an interface that the user can customize.

I think Spark has already provided an interface for custom catalog implementation.

I'm not super familiar with Iceberg. But I think Iceberg introduced this because catalog is a fundamental concept in Iceberg and Iceberg is heavily coupled with catalog. Delta has a different design principle and it decouples from catalog. Adding a catalog interface to Delta would break our design principle.

@melin
Copy link
Author

melin commented Oct 27, 2022

Delta does not provide a concrete implementation, but provides an interface that the user can customize.

I think Spark has already provided an interface for custom catalog implementation.

I'm not super familiar with Iceberg. But I think Iceberg introduced this because catalog is a fundamental concept in Iceberg and Iceberg is heavily coupled with catalog. Delta has a different design principle and it decouples from catalog. Adding a catalog interface to Delta would break our design principle.

There is a scenario where multiple different delta tables are written to dell ecs storage. If the management table name, iceberg jdbc catalog can write the table name into the relational data and record the table location.
Want to know that delta has a solution?

@zsxwing
Copy link
Member

zsxwing commented Oct 27, 2022

Want to know that delta has a solution?

Catalog is the solution. For example, can you use Hive Metastore? Hive Metastore is just using relational databases.

@melin
Copy link
Author

melin commented Oct 28, 2022

Want to know that delta has a solution?

Catalog is the solution. For example, can you use Hive Metastore? Hive Metastore is just using relational databases.

Relying on the hadoop ecosystem, using hms is a heavy solution. Direct jdbc storage is simpler.

@zsxwing
Copy link
Member

zsxwing commented Nov 11, 2022

Relying on the hadoop ecosystem, using hms is a heavy solution. Direct jdbc storage is simpler.

Totally agree that hms is heavier. But it's a de facto standard. In addition, Delta mostly just relies on Spark's catalog APIs. You can implement Spark's APIs and just use Delta Lake with that.

@melin
Copy link
Author

melin commented Nov 17, 2022

Relying on the hadoop ecosystem, using hms is a heavy solution. Direct jdbc storage is simpler.

Totally agree that hms is heavier. But it's a de facto standard. In addition, Delta mostly just relies on Spark's catalog APIs. You can implement Spark's APIs and just use Delta Lake with that.

hudi is also developing the jdbc catalog

@melin melin closed this as completed Nov 17, 2022
@dennyglee
Copy link
Contributor

While Iceberg and Hudi are developing a JDBC catalog, this is because they rely on the catalog for their metadata. As @zsxwing noted, Delta does not require a catalog for its metadata, and there are architectural advantages to this approach. In addition to using the Spark APIs, you can also Delta Standalone (Scala/Java), Delta Rust (or delta-rs), and/or Delta-python (delta.rs python bindings) to query the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants