-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add udf/udaf plugin #1881
add udf/udaf plugin #1881
Conversation
I have meet the same issue, Thanks for your work! |
fixed in #1880 |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @gaojun2048 . I haven't had a chance to go through this entire PR yet, but I do wonder if we need a dynamic plugin manager to support UDF in ballista.
There are at least two distinct use cases:
- You are compiling a custom version of ballista and need to use udfs
- You want to use a unmodified version of ballista and register udfs that you compiled into your own shared library
A plugin manager is required for the second usecase but not the first. I wonder if you need the flexibility of the second usecase or if we could get away with less of a change if you are building your own version of ballista.
If we need a plugin manager, I would like to see it more fully integrated so that it is covered by existing (as well as the new test). This would mean remove the list of scalar_functions
and aggegate_functions
on ExecutionContext
and replace them with a plugin manager that was always present
Thank you for your advice @alamb . In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
Thanks a lot, can you give me more advice on these? |
Yes. from this issue rust-lang/rfcs#600 I sea Rust doesn’t have a stable ABI, meaning different compiler versions can generate incompatible code. For these reasons, the UDF plug-in must be compiled using the same version of rustc as datafusion. |
I sea in this pr : #1887 the scalar_function and aggregate_function deserialization and serialization is move to |
That makes sense -- it might help to add a comment to the source code explaining that rationale (so that future readers understand as well) |
I agree and this makes sense
💯 agree here too
I think the idea of moving the plugin module into ballista makes a lot of sense to me
Thank you for your clear explination and justification 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your efforts @gaojun2048 . I skimmed through the code again and I think if we move the plugin manager to ballista it would be good to go from my perspective 👍 .
cc @andygrove @thinkharderdev @edrevo @matthewmturner @liukun4515 and @realno (I am sorry if you work together or already know about this work)
use std::any::Any; | ||
use std::sync::Arc; | ||
|
||
/// this examples show how to implements a udf plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// this examples show how to implements a udf plugin | |
/// this examples show how to implements a udf plugin for Ballista |
For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own |
Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion? |
Agree with your opinion. |
Maybe I need to take time to look at this. |
@alamb @gaojun2048 this is an interesting point, could you explain a bit more? I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF? |
I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code.
Yes, it is important and required for DF to support udf. But for those who use DF, it is not necessary to support the So I would say that the people who need the Finally, if we define Ballista's goal as a distributed implementation of datafusion, a library that needs to be used as a dependency of other projects, rather than a distributed computing engine (like presto/spark) that can be directly downloaded and deployed and used. It seems to me that the udf plugin is not necessary, because the core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF. Those projects that use ballista as depencency can manage their own udf and decide when to register their udf into DF. |
@alamb CI get stuck . Can you help me retry? |
Hi @gaojun2048 -- I think github has been having some issues: https://www.githubstatus.com/history I re kicked off the jobs here: https://github.com/apache/arrow-datafusion/actions/runs/2040619251 Hopefully they will complete this time |
@liukun4515 @alamb @thinkharderdev Everything is ok. And the udaf test success now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datafusion changes look good to me. Thank you very much @gaojun2048. Can someone please review the Ballista changes?
Perhaps @liukun4515 @mingmwang @thinkharderdev has the time and expertise?
I can review today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Today I tried to resolve the conflict, but I found it very difficult. In |
It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions |
Ok. I will update code and use TaskContext to serialization and deserialization UDF |
Hi @gaojun2048. I had to implement this on our fork for our project so I went ahead and PR'd it here #2130. Hope that helps! |
Ok, Can I submit the plugin related code first, regardless of the serialization and deserialization parts of UDF? |
@thinkharderdev I push a sub PR of this PR. please help me review. |
Yes, there are several changes to SessionContext in those days. The Executor does not have a global SessionContext now.
In Ballista Scheduler side, there is no global SessionContext either, SessionContext is created on users' requests.
|
From #2130 I found serialize is changing. So I push a sub PR #2131 which is only includes plugin manager. |
1 similar comment
From #2130 I found serialize is changing. So I push a sub PR #2131 which is only includes plugin manager. |
I believe this feature was added in #2131 and so this PR is no longer needed so closing. Please let me know if I got that wrong / reopen it so |
closes #1882
In this PR, I have implemented the plug-in of UDF. In the next PR, I will complete the serialization and deserialization of UDF / udaf by ballista relying on UDF plugin.