add udf/udaf plugin #1881

EricJoy2048 · 2022-02-25T06:57:20Z

In this PR, I have implemented the plug-in of UDF. In the next PR, I will complete the serialization and deserialization of UDF / udaf by ballista relying on UDF plugin.

EricJoy2048 · 2022-02-25T09:15:08Z

The clippy check failed because package fuzz-utils. But I don't modify this package.

Ted-Jiang · 2022-02-26T04:46:18Z

I have meet the same issue, Thanks for your work!

Ted-Jiang · 2022-02-26T04:49:06Z

The clippy check failed because package fuzz-utils. But I don't modify this package.

fixed in #1880

Ted-Jiang · 2022-02-26T04:49:25Z

retest this please

datafusion/src/plugin/udf.rs

datafusion/build.rs

alamb

Thank you for this contribution @gaojun2048 . I haven't had a chance to go through this entire PR yet, but I do wonder if we need a dynamic plugin manager to support UDF in ballista.

There are at least two distinct use cases:

You are compiling a custom version of ballista and need to use udfs
You want to use a unmodified version of ballista and register udfs that you compiled into your own shared library

A plugin manager is required for the second usecase but not the first. I wonder if you need the flexibility of the second usecase or if we could get away with less of a change if you are building your own version of ballista.

If we need a plugin manager, I would like to see it more fully integrated so that it is covered by existing (as well as the new test). This would mean remove the list of scalar_functions and aggegate_functions on ExecutionContext and replace them with a plugin manager that was always present

datafusion/build.rs

EricJoy2048 · 2022-03-02T02:43:59Z

Thank you for your advice @alamb .
Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.

In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.

People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserialization udf use the udf's name get_scalar_udf_by_name(&self, fun_name: &str). These operations are completed through the trail UDFPlugin. Ballista does not need to know who has implemented the UDF plugin.
I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

Thanks a lot, can you give me more advice on these?

EricJoy2048 · 2022-03-02T05:19:05Z

Can you explain the need for getting the rustc version?
#1881 (comment)

Yes. from this issue rust-lang/rfcs#600 I sea Rust doesn’t have a stable ABI, meaning different compiler versions can generate incompatible code. For these reasons, the UDF plug-in must be compiled using the same version of rustc as datafusion.

…atafusion

EricJoy2048 · 2022-03-02T08:40:40Z

Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.

In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.

People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.

People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserialization udf use the udf's name get_scalar_udf_by_name(&self, fun_name: &str). These operations are completed through the trail UDFPlugin. Ballista does not need to know who has implemented the UDF plugin.

I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

Thanks a lot, can you give me more advice on these?

I sea in this pr : #1887 the scalar_function and aggregate_function deserialization and serialization is move to datafusion-serialization. In datafusion-serialization we need use udf plugin to serialization and deserialization udf. So, Where should we put plugin mod？

alamb · 2022-03-02T16:43:59Z

Yes. from this issue rust-lang/rfcs#600 I sea Rust doesn’t have a stable ABI, meaning different compiler versions can generate incompatible code. For these reasons, the UDF plug-in must be compiled using the same version of rustc as datafusion.

That makes sense -- it might help to add a comment to the source code explaining that rationale (so that future readers understand as well)

alamb · 2022-03-02T16:46:17Z

Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. ... I believe this is a more friendly way for those who actually use ballista as a computing engine.

I agree and this makes sense

In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.

💯 agree here too

I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

I think the idea of moving the plugin module into ballista makes a lot of sense to me

Thanks a lot, can you give me more advice on these?

Thank you for your clear explination and justification 👍

alamb

Thank you for your efforts @gaojun2048 . I skimmed through the code again and I think if we move the plugin manager to ballista it would be good to go from my perspective 👍 .

cc @andygrove @thinkharderdev @edrevo @matthewmturner @liukun4515 and @realno (I am sorry if you work together or already know about this work)

alamb · 2022-03-02T16:46:39Z

datafusion-examples/examples/simple_udf_plugin.rs

+use std::any::Any;
+use std::sync::Arc;
+
+/// this examples show how to implements a udf plugin


Suggested change

/// this examples show how to implements a udf plugin

/// this examples show how to implements a udf plugin for Ballista

thinkharderdev · 2022-03-02T20:38:45Z

Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.

In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.

People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.

People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserialization udf use the udf's name get_scalar_udf_by_name(&self, fun_name: &str). These operations are completed through the trail UDFPlugin. Ballista does not need to know who has implemented the UDF plugin.

I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

Thanks a lot, can you give me more advice on these?

For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own main function which registers desired UDF/UDAF in the global execution context.

EricJoy2048 · 2022-03-03T00:50:32Z

Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.

People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.

People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserialization udf use the udf's name get_scalar_udf_by_name(&self, fun_name: &str). These operations are completed through the trail UDFPlugin. Ballista does not need to know who has implemented the UDF plugin.

I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

Thanks a lot, can you give me more advice on these?

For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own main function which registers desired UDF/UDAF in the global execution context.

Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.


fn try_into_logical_plan(
        &self,
        ctx: &ExecutionContext,
        extension_codec: &dyn LogicalExtensionCodec,
    ) -> Result<LogicalPlan, BallistaError>;

But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?

liukun4515 · 2022-03-03T02:16:07Z

Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.

People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.

People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserialization udf use the udf's name get_scalar_udf_by_name(&self, fun_name: &str). These operations are completed through the trail UDFPlugin. Ballista does not need to know who has implemented the UDF plugin.

I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

Thanks a lot, can you give me more advice on these?

For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own main function which registers desired UDF/UDAF in the global execution context.

Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
fn try_into_logical_plan(
        &self,
        ctx: &ExecutionContext,
        extension_codec: &dyn LogicalExtensionCodec,
    ) -> Result<LogicalPlan, BallistaError>;
But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?

Agree with your opinion.
The ballista is a distributed computed engine like spark and others.
Users who want to use the udf and just don't need to recompile the codes.

liukun4515 · 2022-03-03T02:18:07Z

Thank you for your efforts @gaojun2048 . I skimmed through the code again and I think if we move the plugin manager to ballista it would be good to go from my perspective 👍 .

cc @andygrove @thinkharderdev @edrevo @matthewmturner @liukun4515 and @realno (I am sorry if you work together or already know about this work)

Maybe I need to take time to look at this.
I can finish reviewing this today.

realno · 2022-03-03T03:53:40Z

In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.

💯 agree here too

@alamb @gaojun2048 this is an interesting point, could you explain a bit more?

I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?

EricJoy2048 · 2022-03-03T06:13:52Z

I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?

I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code.

I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?

Yes, it is important and required for DF to support udf. But for those who use DF, it is not necessary to support the udf plugin to dynamically load udf. Because for people who use DF as a dependency to develop their own calculation engine, such as Ballista. Imagine one, if Ballista and DF are not under the same repository, but two separate projects, as a Ballista developer, I need to add my own udf to meet my special analysis needs. What I'm most likely to do is to manage my own udf, such as writing the implementation of udf directly in the Ballista crate. Or add a udf plugin to Ballista like this pr, which supports dynamic loading of udfs developed by Ballista users (not Ballista developers). Then I decide when to call the register_udf method of the DF to register these udfs in the ExecutionContext so that the DF can be used for calculation. Of course, we can directly put the udf plugin in DF, but this feature is not necessary for DF, and doing so will make the register_udf method look redundant, but make the design of DF's udf not easy to understand.

So I would say that the people who need the udf plugin the most are those who use Ballista as a full-fledged computing engine, and they just download and deploy Ballista. They don't modify the source code of Ballista and DF because that would mean a better understanding of Ballista and DF. And once the source code of Ballista and DF is modified, it means that they need to invest more cost to merge and build when upgrading Ballista. But now if the user just downloads and deploys Ballista for use, there is no way for the user to register his udf into the DF. The core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF.

Finally, if we define Ballista's goal as a distributed implementation of datafusion, a library that needs to be used as a dependency of other projects, rather than a distributed computing engine (like presto/spark) that can be directly downloaded and deployed and used. It seems to me that the udf plugin is not necessary, because the core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF. Those projects that use ballista as depencency can manage their own udf and decide when to register their udf into DF.

EricJoy2048 · 2022-03-25T12:05:10Z

@alamb CI get stuck . Can you help me retry?

alamb · 2022-03-25T17:17:51Z

Hi @gaojun2048 -- I think github has been having some issues: https://www.githubstatus.com/history

I re kicked off the jobs here: https://github.com/apache/arrow-datafusion/actions/runs/2040619251

Hopefully they will complete this time

EricJoy2048 · 2022-03-27T15:31:28Z

@liukun4515 @alamb @thinkharderdev Everything is ok. And the udaf test success now.

alamb

The datafusion changes look good to me. Thank you very much @gaojun2048. Can someone please review the Ballista changes?

Perhaps @liukun4515 @mingmwang @thinkharderdev has the time and expertise?

thinkharderdev · 2022-03-29T10:31:57Z

The datafusion changes look good to me. Thank you very much @gaojun2048. Can someone please review the Ballista changes?

Perhaps @liukun4515 @mingmwang @thinkharderdev has the time and expertise?

I can review today

thinkharderdev

Awesome!

EricJoy2048 · 2022-03-30T02:53:18Z

Awesome!

Today I tried to resolve the conflict, but I found it very difficult. In AsExecutionPlan.try_into_physical_plan SessionContext is removed. So I can not serialization and deserialization UDF with SessionContext. So I will update my code and serialization and deserialization UDF with udf_plugin.

thinkharderdev · 2022-03-30T12:12:32Z

Awesome!

Today I tried to resolve the conflict, but I found it very difficult. In AsExecutionPlan.try_into_physical_plan SessionContext is removed. So I can not serialization and deserialization UDF with SessionContext. So I will update my code and serialization and deserialization UDF with udf_plugin.

It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions

EricJoy2048 · 2022-03-30T13:45:16Z

It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions

Ok. I will update code and use TaskContext to serialization and deserialization UDF

thinkharderdev · 2022-03-31T20:08:25Z

It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions

Ok. I will update code and use TaskContext to serialization and deserialization UDF

Hi @gaojun2048. I had to implement this on our fork for our project so I went ahead and PR'd it here #2130. Hope that helps!

EricJoy2048 · 2022-04-01T01:55:33Z

#2130

Ok, Can I submit the plugin related code first, regardless of the serialization and deserialization parts of UDF?

EricJoy2048 · 2022-04-01T02:36:38Z

@thinkharderdev I push a sub PR of this PR. please help me review.

#2131

mingmwang · 2022-04-01T06:58:19Z

It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions

Ok. I will update code and use TaskContext to serialization and deserialization UDF

Yes, there are several changes to SessionContext in those days. The Executor does not have a global SessionContext now.
You can have your UDF Plugin Manager load all the dynamic UDFs/UDAFs to Executor's member. I had added a TOTO note .

impl Executor {
    /// Create a new executor instance
    pub fn new(
        metadata: ExecutorRegistration,
        work_dir: &str,
        runtime: Arc<RuntimeEnv>,
    ) -> Self {
        Self {
            metadata,
            work_dir: work_dir.to_owned(),
            // TODO add logic to dynamically load UDF/UDAFs libs from files
            scalar_functions: HashMap::new(),
            aggregate_functions: HashMap::new(),
            runtime,
        }
    }
}

In Ballista Scheduler side, there is no global SessionContext either, SessionContext is created on users' requests.
You can add the UDF Plugin Manager to Ballista SchedulerServer, when the new session context was created, you can
call the register the UDF/UDAFs to the created session context.

/// Create a DataFusion session context that is compatible with Ballista Configuration
pub fn create_datafusion_context(
    config: &BallistaConfig,
    session_builder: SessionBuilder,
) -> Arc<SessionContext> {
    let config = SessionConfig::new()
        .with_target_partitions(config.default_shuffle_partitions())
        .with_batch_size(config.default_batch_size())
        .with_repartition_joins(config.repartition_joins())
        .with_repartition_aggregations(config.repartition_aggregations())
        .with_repartition_windows(config.repartition_windows())
        .with_parquet_pruning(config.parquet_pruning());
    let session_state = session_builder(config);
    Arc::new(SessionContext::with_state(session_state))
    /// Add logic to register UDF/UDFS to context.
}

EricJoy2048 · 2022-04-01T10:14:07Z

#2130

It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions

Ok. I will update code and use TaskContext to serialization and deserialization UDF

Yes, there are several changes to SessionContext in those days. The Executor does not have a global SessionContext now. You can have your UDF Plugin Manager load all the dynamic UDFs/UDAFs to Executor's member. I had added a TOTO note .
impl Executor {
    /// Create a new executor instance
    pub fn new(
        metadata: ExecutorRegistration,
        work_dir: &str,
        runtime: Arc<RuntimeEnv>,
    ) -> Self {
        Self {
            metadata,
            work_dir: work_dir.to_owned(),
            // TODO add logic to dynamically load UDF/UDAFs libs from files
            scalar_functions: HashMap::new(),
            aggregate_functions: HashMap::new(),
            runtime,
        }
    }
}
In Ballista Scheduler side, there is no global SessionContext either, SessionContext is created on users' requests. You can add the UDF Plugin Manager to Ballista SchedulerServer, when the new session context was created, you can call the register the UDF/UDAFs to the created session context.
/// Create a DataFusion session context that is compatible with Ballista Configuration
pub fn create_datafusion_context(
    config: &BallistaConfig,
    session_builder: SessionBuilder,
) -> Arc<SessionContext> {
    let config = SessionConfig::new()
        .with_target_partitions(config.default_shuffle_partitions())
        .with_batch_size(config.default_batch_size())
        .with_repartition_joins(config.repartition_joins())
        .with_repartition_aggregations(config.repartition_aggregations())
        .with_repartition_windows(config.repartition_windows())
        .with_parquet_pruning(config.parquet_pruning());
    let session_state = session_builder(config);
    Arc::new(SessionContext::with_state(session_state))
    /// Add logic to register UDF/UDFS to context.
}

From #2130 I found serialize is changing. So I push a sub PR #2131 which is only includes plugin manager.

EricJoy2048 · 2022-04-01T10:14:07Z

#2130

It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions

Ok. I will update code and use TaskContext to serialization and deserialization UDF

Yes, there are several changes to SessionContext in those days. The Executor does not have a global SessionContext now. You can have your UDF Plugin Manager load all the dynamic UDFs/UDAFs to Executor's member. I had added a TOTO note .
impl Executor {
    /// Create a new executor instance
    pub fn new(
        metadata: ExecutorRegistration,
        work_dir: &str,
        runtime: Arc<RuntimeEnv>,
    ) -> Self {
        Self {
            metadata,
            work_dir: work_dir.to_owned(),
            // TODO add logic to dynamically load UDF/UDAFs libs from files
            scalar_functions: HashMap::new(),
            aggregate_functions: HashMap::new(),
            runtime,
        }
    }
}
In Ballista Scheduler side, there is no global SessionContext either, SessionContext is created on users' requests. You can add the UDF Plugin Manager to Ballista SchedulerServer, when the new session context was created, you can call the register the UDF/UDAFs to the created session context.
/// Create a DataFusion session context that is compatible with Ballista Configuration
pub fn create_datafusion_context(
    config: &BallistaConfig,
    session_builder: SessionBuilder,
) -> Arc<SessionContext> {
    let config = SessionConfig::new()
        .with_target_partitions(config.default_shuffle_partitions())
        .with_batch_size(config.default_batch_size())
        .with_repartition_joins(config.repartition_joins())
        .with_repartition_aggregations(config.repartition_aggregations())
        .with_repartition_windows(config.repartition_windows())
        .with_parquet_pruning(config.parquet_pruning());
    let session_state = session_builder(config);
    Arc::new(SessionContext::with_state(session_state))
    /// Add logic to register UDF/UDFS to context.
}

From #2130 I found serialize is changing. So I push a sub PR #2131 which is only includes plugin manager.

alamb · 2022-04-15T14:51:50Z

I believe this feature was added in #2131 and so this PR is no longer needed so closing. Please let me know if I got that wrong / reopen it so

EricJoy2048 added 3 commits February 25, 2022 14:36

add udf/udaf plugin

41cab96

add udf/udaf plugin

ce884bd

add udf/udaf plugin

2c661b0

github-actions bot added the datafusion Changes in the datafusion crate label Feb 25, 2022

EricJoy2048 added 2 commits February 25, 2022 15:45

drop rustc_version dependency

21aa025

add license head to build.rs file

c4f29d2

Ted-Jiang reviewed Feb 26, 2022

View reviewed changes

datafusion/src/plugin/udf.rs Outdated Show resolved Hide resolved

delete chinese comment

7a1b4ac

jimexist reviewed Feb 27, 2022

View reviewed changes

datafusion/build.rs Outdated Show resolved Hide resolved

EricJoy2048 mentioned this pull request Feb 27, 2022

UDF/UDAF plugin #1882

Closed

alamb reviewed Mar 1, 2022

View reviewed changes

datafusion/build.rs Outdated Show resolved Hide resolved

EricJoy2048 added 3 commits March 2, 2022 15:22

fix plugin load bug

db5abb6

Merge remote-tracking branch 'apache/master' into add_udf_plugin_to_d…

12d1c3a

…atafusion

update udf plugin example

e882af6

alamb reviewed Mar 2, 2022

View reviewed changes

EricJoy2048 added 3 commits March 25, 2022 10:50

fmt toml file

500411b

fmt toml file

a6678b0

fix fmt error

edb19a6

fix fmt error

b0e94de

EricJoy2048 added 2 commits March 26, 2022 10:54

retry ci

a0a546a

fix udaf test stuck

a399bfd

EricJoy2048 requested review from alamb and jimexist March 26, 2022 10:53

Merge branch 'master' into add_udf_plugin_to_datafusion

b95a526

alamb reviewed Mar 28, 2022

View reviewed changes

thinkharderdev approved these changes Mar 29, 2022

View reviewed changes

thinkharderdev mentioned this pull request Mar 31, 2022

Introduce datafusion-objectstore-hdfs as optional features in the datafusion core #2111

Closed

EricJoy2048 mentioned this pull request Apr 1, 2022

[Ballista] Add ballista plugin manager and UDF plugin #2131

Merged

thinkharderdev mentioned this pull request Apr 6, 2022

No suitable object store found for *** #2136

Open

alamb closed this Apr 15, 2022

yongda-fan mentioned this pull request Mar 12, 2024

Support Rust UDF apache/datafusion-ballista#993

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add udf/udaf plugin #1881

add udf/udaf plugin #1881

EricJoy2048 commented Feb 25, 2022 •

edited

Loading

EricJoy2048 commented Feb 25, 2022 •

edited

Loading

Ted-Jiang commented Feb 26, 2022

Ted-Jiang commented Feb 26, 2022

Ted-Jiang commented Feb 26, 2022

alamb left a comment

EricJoy2048 commented Mar 2, 2022 •

edited

Loading

EricJoy2048 commented Mar 2, 2022

EricJoy2048 commented Mar 2, 2022

alamb commented Mar 2, 2022

alamb commented Mar 2, 2022

alamb left a comment

alamb Mar 2, 2022

thinkharderdev commented Mar 2, 2022

EricJoy2048 commented Mar 3, 2022 •

edited

Loading

liukun4515 commented Mar 3, 2022 •

edited

Loading

liukun4515 commented Mar 3, 2022

realno commented Mar 3, 2022 •

edited

Loading

EricJoy2048 commented Mar 3, 2022

EricJoy2048 commented Mar 25, 2022

alamb commented Mar 25, 2022

EricJoy2048 commented Mar 27, 2022

alamb left a comment •

edited

Loading

thinkharderdev commented Mar 29, 2022

thinkharderdev left a comment

EricJoy2048 commented Mar 30, 2022

thinkharderdev commented Mar 30, 2022

EricJoy2048 commented Mar 30, 2022

thinkharderdev commented Mar 31, 2022

EricJoy2048 commented Apr 1, 2022

EricJoy2048 commented Apr 1, 2022

mingmwang commented Apr 1, 2022

EricJoy2048 commented Apr 1, 2022

EricJoy2048 commented Apr 1, 2022

alamb commented Apr 15, 2022

	/// this examples show how to implements a udf plugin
	/// this examples show how to implements a udf plugin for Ballista

add udf/udaf plugin #1881

add udf/udaf plugin #1881

Conversation

EricJoy2048 commented Feb 25, 2022 • edited Loading

EricJoy2048 commented Feb 25, 2022 • edited Loading

Ted-Jiang commented Feb 26, 2022

Ted-Jiang commented Feb 26, 2022

Ted-Jiang commented Feb 26, 2022

alamb left a comment

Choose a reason for hiding this comment

EricJoy2048 commented Mar 2, 2022 • edited Loading

EricJoy2048 commented Mar 2, 2022

EricJoy2048 commented Mar 2, 2022

alamb commented Mar 2, 2022

alamb commented Mar 2, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 2, 2022

Choose a reason for hiding this comment

thinkharderdev commented Mar 2, 2022

EricJoy2048 commented Mar 3, 2022 • edited Loading

liukun4515 commented Mar 3, 2022 • edited Loading

liukun4515 commented Mar 3, 2022

realno commented Mar 3, 2022 • edited Loading

EricJoy2048 commented Mar 3, 2022

EricJoy2048 commented Mar 25, 2022

alamb commented Mar 25, 2022

EricJoy2048 commented Mar 27, 2022

alamb left a comment • edited Loading

Choose a reason for hiding this comment

thinkharderdev commented Mar 29, 2022

thinkharderdev left a comment

Choose a reason for hiding this comment

EricJoy2048 commented Mar 30, 2022

thinkharderdev commented Mar 30, 2022

EricJoy2048 commented Mar 30, 2022

thinkharderdev commented Mar 31, 2022

EricJoy2048 commented Apr 1, 2022

EricJoy2048 commented Apr 1, 2022

mingmwang commented Apr 1, 2022

EricJoy2048 commented Apr 1, 2022

EricJoy2048 commented Apr 1, 2022

alamb commented Apr 15, 2022

EricJoy2048 commented Feb 25, 2022 •

edited

Loading

EricJoy2048 commented Feb 25, 2022 •

edited

Loading

EricJoy2048 commented Mar 2, 2022 •

edited

Loading

EricJoy2048 commented Mar 3, 2022 •

edited

Loading

liukun4515 commented Mar 3, 2022 •

edited

Loading

realno commented Mar 3, 2022 •

edited

Loading

alamb left a comment •

edited

Loading