Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT-Compile DataFusion Expressions to create RecordBatches #2122

Open
Tracked by #2703
Dandandan opened this issue Mar 29, 2022 · 2 comments
Open
Tracked by #2703

JIT-Compile DataFusion Expressions to create RecordBatches #2122

Dandandan opened this issue Mar 29, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Mar 29, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We should be able to compile.

The benefit of this is that we can speed up complex / nested expressions by avoiding unnecesarry allocations

Describe the solution you'd like

We should be able to take in a collection of a RecordBatch / named Arrays and compile an expression like (a + b)/ 2 to a loop that results in a new Array.

fn compile(schema: SchemaRef, expr: Expr) ->  CompiledFunction {

}

The loop itself also must be included in the to-be compiled expression, to remove call overhead and allow for possible use of SIMD instructionsinstructions, either explicitly by instrumenting cranelift enough or through auto-vectorization.

Describe alternatives you've considered
n/a

Additional context

@Dandandan Dandandan added the enhancement New feature or request label Mar 29, 2022
@alamb alamb changed the title JIT-Compile DataFusion Expressions [EPIC] JIT-Compile DataFusion Expressions Jun 6, 2022
@alamb alamb changed the title [EPIC] JIT-Compile DataFusion Expressions JIT-Compile DataFusion Expressions to create RecordBatches Jun 6, 2022
@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

datafusion PhysicalExpr and arrow-rs library currently evaluate expressions by "materializing intermediate results" -- for example (a + b) + c results in first evaluating (a+b) to a temporary location and then adding c to form the final result.

Note however, there is a tradeoff here between the speed gained using the LLVM optimized vectorized kernels in arrow-rs and cranelift generated JIT expressions where JIT may not actually be faster. I think this is what @Dandandan is referring to when he says "allow for possible use of SIMD instructions, either explicitly by instrumenting cranelift enough or through auto-vectorization."

@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

Another example can be found in these slides from this presentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants