-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] JIT support for DataFusion
#2703
Comments
DataFusion
DataFusion
Actually, this is basically the same as #1861 so closing in favor of that ticket |
Hi @alamb , do we need to support expression JIT for performance like ClickHouse: https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit |
Hi @leoluan2009 In my opinion, I don't think DataFusion needs JIT to get good performance. In general, I find the paper "Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" to explain the tradeoffs well DataFusion is a vectorized engine and we haven't found areas where JIt would be compelling compared to vectorized code. The only area I can really think of would be to implement type specialized comparisons for sorting (to avoid the RowFormat) but we would need to have a pretty compelling benchmark showing improvements to justify I think |
Though the paper that you have mentioned admits that JIT-compilation is beneficial for OLTP workloads:
If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines. |
FWIW there is no reason you couldn't JIT compile arbitrary expressions and run them as UDFs. The API is now basically complete: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html It would also be possible to do the same with aggregates / windows / etc |
I think that compiling SQL-expressions to UDFs by hand would kinda kill the whole point of the framework, but it seems like most of the framework would be irrelevant for the in-memory transformation of Online Features, so I guess it would be easier to build the same thing from scratch, though using the same ideas, like for example SQL and Arrow format for intermediate data representation. |
FIW there is a lot more to SQL evaluation than just the expression evaluation, so that might be a reason to use DataFusion even if you had to implement your own expressions 🤔 |
Summary
TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)
Background
DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de
As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.
Another example can be found in these slides from this presentation
@yjshen added initial support for JIT'ing in #1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in #2375
This ticket aims to be a central location for tracking the status of JIT compiling expressions for anyone who wants to contribute to this effort
Describe the solution you'd like
RecordBatches
#2122The text was updated successfully, but these errors were encountered: