Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ballista] Add ballista plugin manager and UDF plugin #2131

Merged
merged 1 commit into from
Apr 7, 2022

Conversation

EricJoy2048
Copy link
Member

A sub pr of #1881
Because #1881 It includes plugin, plugin load and serialization and deserialization. Then, the serialization and deserialization communities of LogicalPlan and PhysicalPlan have been changing the implementation. So, I push this pr, This PR only includes plugin, plugin loader, Not includes serialization and deserialization for UDF/UDAF.

@mingmwang
Copy link
Contributor

Currently I believe the "plugin_dir is a local dir, I think it is better to support distributed file systems(HDFS/Object store) so that both the Executors and Scheduler can load the plugin files from a single place.

@EricJoy2048
Copy link
Member Author

Good Idea, I will add this feature in the future.

@EricJoy2048
Copy link
Member Author

@thinkharderdev

@andygrove
Copy link
Member

Currently I believe the "plugin_dir is a local dir, I think it is better to support distributed file systems(HDFS/Object store) so that both the Executors and Scheduler can load the plugin files from a single place.

Alternatively, users could package up dependencies in a Docker container and deploy that way. This could be more efficient in the case where multiple executors are running on the same node since the image will be downloaded once and cached. It also provides better version control - all executors will be guaranteed to be running the same code (assume a specific version of the image is deployed).

I would be interested to hear more about the use case of loading dependencies from object store though. What would be the motivation of this approach?

@jiangzhx
Copy link
Contributor

jiangzhx commented Apr 2, 2022

Currently I believe the "plugin_dir is a local dir, I think it is better to support distributed file systems(HDFS/Object store) so that both the Executors and Scheduler can load the plugin files from a single place.

Alternatively, users could package up dependencies in a Docker container and deploy that way. This could be more efficient in the case where multiple executors are running on the same node since the image will be downloaded once and cached. It also provides better version control - all executors will be guaranteed to be running the same code (assume a specific version of the image is deployed).

I would be interested to hear more about the use case of loading dependencies from object store though. What would be the motivation of this approach?

maybe in the future, we can support create custom udf&udaf like hive.

CREATE FUNCTION myfunc AS 'myclass' USING JAR 'hdfs:///path/to/jar';

@EricJoy2048
Copy link
Member Author

Who can review and merge this pr? We need use this feature in our ballista cluster.

@mingmwang
Copy link
Contributor

@liukun4515 @yjshen

@yjshen
Copy link
Member

yjshen commented Apr 7, 2022

I'm not quite qualified to review the code in Ballista, but I could help merge the PR once consensus is reached.

@yahoNanJing @mingmwang, do you want to give a review pass since you are actively working on this?

@alamb
Copy link
Contributor

alamb commented Apr 7, 2022

I think that since @thinkharderdev has reviewed this and we have talked about it for a while, I will merge the code in and we can iterate on it as needed.

Thank you for your patience and perseverance @gaojun2048

@alamb alamb merged commit 70f2b1a into apache:master Apr 7, 2022
@yahoNanJing
Copy link
Contributor

@andygrove, if the udf/udaf libraries can only be loaded from local disk, we need to build a new image and redeploy the whole cluster when there's any changes for the libraries. Otherwise, if the udf/udaf libraries can be loaded from a shared remote storage, the image does not depend on the libraries and it will be easier to handle the changes.

@EricJoy2048 EricJoy2048 deleted the add_ballista_plugin branch April 8, 2022 02:51
@EricJoy2048
Copy link
Member Author

Thank you all. I will iterate on it lately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants