-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to Ballista extensibility #8
Comments
It may be useful to see how substrait is handling extensions as well - https://substrait.io/extensions/ |
This would be a great improvement 👍 I will follow the design and PRs |
Thanks @thinkharderdev for proposing these potentials.
For this, actually our team has implemented for the HDFS. To avoid new object store registration, our workaround is to make the path self description with its scheme, like hdfs:://localhost:15050/..../file.parquet. Then with the scheme, we will know which kind of remote object store we needs.
Maybe it's better to introduce the substrait integration into the roadmap. |
+1 |
FWIW, Andy wrote a substrait rust implementation: https://github.com/andygrove/substrait-rs. |
Agree on the substrait integration. It would definitely be nice to have a universal serializable representation and a way to configure extensions delcaritively. I posted a draft PR apache/datafusion#1677 which I think can solve the immediate term issues with extensibility and also will be useful in migrating to a substrait-based implementation. By decoupling the representation from the core execution engine we can avoid a "Big Bang" migration (not to mention and endless parade of painful rebases while in development :)) |
I had a comment on apache/datafusion#1677 (review) which I think is worth considering |
Related questions after tinkering a bit more today: Should |
I can see a rationale for making schema provider methods However, in general if one has to do network IO to figure out what tables exist or their schemas, it may be hard to get adequate performance |
Thank's for your CC. |
I think all the points in this have been addressed so we can close this issue. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)
Currently, we are working with DataFusion/Ballista as a query execution engine. One of the primary selling points for DataFusion is extensibility but it is not currently possible to use the many extension points in DataFusion with Ballista.
This is primarily due to the constraints of serializing all logical and physical plans as Protobuf messages.
Ideally we would like to use Ballista to execute:
Describe the solution you'd like
A clear and concise description of what you want to happen.
There are two things ideally:
ExecutionContext
so we can leverage optimizers, extension planners, and udf/udafDescribe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
There currently is no workaround for this but we have been prototyping possible solutions which we'd be interested in upstreaming.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: