Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] JIT support for DataFusion #2703

Closed
2 tasks
alamb opened this issue Jun 6, 2022 · 7 comments
Closed
2 tasks

[EPIC] JIT support for DataFusion #2703

alamb opened this issue Jun 6, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

Summary
TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)

Background

DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.

Another example can be found in these slides from this presentation

@yjshen added initial support for JIT'ing in #1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in #2375

This ticket aims to be a central location for tracking the status of JIT compiling expressions for anyone who wants to contribute to this effort

Describe the solution you'd like

@alamb alamb added the enhancement New feature or request label Jun 6, 2022
@alamb alamb changed the title [EPIC] Full JIT support for DataFusion [EPIC] JIT support for DataFusion Jun 6, 2022
@alamb
Copy link
Contributor Author

alamb commented Jun 6, 2022

Actually, this is basically the same as #1861 so closing in favor of that ticket

@alamb alamb closed this as completed Jun 6, 2022
@leoluan2009
Copy link

Hi @alamb , do we need to support expression JIT for performance like ClickHouse: https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit

@alamb
Copy link
Contributor Author

alamb commented May 20, 2024

Hi @leoluan2009

In my opinion, I don't think DataFusion needs JIT to get good performance.

In general, I find the paper "Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" to explain the tradeoffs well

DataFusion is a vectorized engine and we haven't found areas where JIt would be compelling compared to vectorized code. The only area I can really think of would be to implement type specialized comparisons for sorting (to avoid the RowFormat) but we would need to have a pretty compelling benchmark showing improvements to justify I think

@faucct
Copy link

faucct commented May 24, 2024

Though the paper that you have mentioned admits that JIT-compilation is beneficial for OLTP workloads:

Besides OLAP performance, other factors also play an important role. Compilation-based engines have advantages in
< OLTP as they can create fast stored procedures

If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.

@alamb
Copy link
Contributor Author

alamb commented May 24, 2024

If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.

FWIW there is no reason you couldn't JIT compile arbitrary expressions and run them as UDFs. The API is now basically complete: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html

It would also be possible to do the same with aggregates / windows / etc

@faucct
Copy link

faucct commented May 24, 2024

I think that compiling SQL-expressions to UDFs by hand would kinda kill the whole point of the framework, but it seems like most of the framework would be irrelevant for the in-memory transformation of Online Features, so I guess it would be easier to build the same thing from scratch, though using the same ideas, like for example SQL and Arrow format for intermediate data representation.

@alamb
Copy link
Contributor Author

alamb commented May 24, 2024

FIW there is a lot more to SQL evaluation than just the expression evaluation, so that might be a reason to use DataFusion even if you had to implement your own expressions 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants