Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deltalake feature #2025

Closed
matthewmturner opened this issue Mar 16, 2022 · 6 comments
Closed

Add deltalake feature #2025

matthewmturner opened this issue Mar 16, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@matthewmturner
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

I would like to be able to register a delta lake deltatable as a table from SQL as part of working on datafusion-tui. For example:

CREATE EXTERNAL TABLE dt
STORED AS DELTATABLE
LOCATION 's3://bucket/schema/table'

From what ive seen this would require adding a FileType and FileFormat for deltatable under deltalake feature, similar to how there is avro feature.

While I understand a delta table isnt exactly a file type / format - i think for the purposes of what were doing with those it meets the definition. Ive played with querying delta tables before and they use register_table as opposed to register_listing_table. So i think we would just need to match based on FileType and then for delta table use register_table instead.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Enable deltatable FileFormat and FileType as features under deltalake

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@matthewmturner matthewmturner added the enhancement New feature or request label Mar 16, 2022
@matthewmturner
Copy link
Contributor Author

@houqp i imagine youll have a view on this.

@houqp
Copy link
Member

houqp commented Mar 20, 2022

I think you should be able to add deltalake support to datafusion-tui by leveraging the existing table provider directly without touching datafusion core, see: https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs.

There is no need for adding new filetype or file format because deltalake only uses parquet, which we already support in datafusion.

@matthewmturner
Copy link
Contributor Author

Interesting, ok. I thought there was some more magic going on under the hood (but I hadn't really had chance to look into it) but maybe that only comes into play with some of delta lakes more advanced features like time travel which I don't think is doable without sql extensions.

I'll try it out and get back to you. Thanks!

@avantgardnerio
Copy link
Contributor

I think @matthewmturner is on to something. The SQL in this issue is straightforward and makes sense from an intuitive user perspective. Why it doesn't work seems like a limitation of DataFusion:

  1. Additional TableProviders can be registered in Rust applications (i.e. datafusion-tui or ballista)
  2. Files can be registered in SQL dynamically at run time - but for built-in TableProviders only
  3. However, there is no dynamic (SQL) way to register a new table with a custom table provider

Our use-case is running a Ballista server, with delta-rs compiled in, with the intention of allowing users to register tables in locations we can't know at compile time. Unfortunately, I think the way FileFormats work currently doesn't make this possible?

@Swoorup
Copy link

Swoorup commented Oct 12, 2024

What's currently remaining for this feature?

@matthewmturner
Copy link
Contributor Author

i dont think deltalake will be added as a "feature" to core datafusion (i.e. this repo) - however, deltalake provides a TableFactoryProvider that makes it very easy to register delta tables to a datafusion SessionContext. it's added as a feature in dft if youre interested in playing with it or want to see how to add it to your own application.

i am going to close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants