[FEA] Abstract Syntax Tree evaluator #5493

bdice · 2020-06-17T16:29:04Z

Purpose

Discussions on #5397 and #5401 have led to a desire for constructing and evaluating abstract syntax trees (ASTs), within some reasonable limitations. This is meant to assist with inequality joins, by enabling complex join expressions like WHERE A.x = B.x AND A.y < (B.y + 100) * 2 to be resolved. An AST that could be evaluated on the device would circumvent the need to materialize intermediate results (e.g. an entire column with the value (B.y + 100) * 2, or worse, an entire cross join just to act as a filter).

Practical use

Engines like Spark Catalyst and Apache Calcite can separate join expressions into "equijoin" parts like A.x = B.x that can be efficiently performed using a hash-join approach, and a separate part for the inequalities. The inequalities can be represented as an abstract syntax tree using boolean operators, comparators, and unary/binary mathematical operations like addition. We seek a way to evaluate these ASTs without using JIT (the performance cost of JIT is large).

Even equijoins might benefit from this feature for cases like A.x - 1 = B.x * 5 AND A.y + 3 = B.y * 2 if there are multiple fields that would be memory-intensive to materialize. In that case, it may be possible to use ASTs as inputs to the hasher, which would produce the output values to be hashed and compared.

Design concepts (subject to change)

Data inputs shall be specified for every expression operand, and shall be sourced from a table column, a "literal" value, or an intermediate value produced by some subtree of the abstract syntax tree.
- The AST will be "linearized" into a series of operators and data source ids that ensure that intermediate values from a sub-expression are produced before they are consumed by a higher expression. I plan to store intermediate values in GPU shared memory; each thread will own a chunk of the shared memory for its own intermediate storage.
- The row(s) associated with a table column are specified when the AST is executed by a thread; only the column index is referenced when building the AST.
- Data resolution (fetching values) will be performed by individual threads when executing the AST in a kernel.
The ASTs may be of arbitrary depth and be composed of any combination of these allowed expressions:
- AST binary (unary) "math" expressions (e.g. +, *, abs) shall map from two (one) Elements of the same type to one Element. Subtrees of expressions must have all inputs match a single Element type.
- AST comparators (e.g. <, <=, !=) shall map from two (one) Elements of the same type to a bool.
- AST boolean operators (e.g. AND) maps two bools to a bool.

Critique is welcome. 🌲

The text was updated successfully, but these errors were encountered:

jlowe · 2020-06-17T17:40:41Z

Is the data type explicitly specified in each Element of the AST or are data types implicit based on the inputs specified to the AST when evaluated? More concretely, if I build an AST for a + b * c is that AST reusable across different column types for a, b, and c?

bdice · 2020-06-19T15:04:10Z

@jlowe I had drafted a reply to this earlier but the API is changing a lot as I discuss with @jrhemstad. We're planning to validate the AST on construction, and the validation will require a few things. In particular, operands to a binary operation will have to be of the same type. This will be validated on construction of the tree since the definition of the operand may just be a column index, and we'll have to check the data types of all inputs as we convert the AST into a linearized form that is GPU-friendly. ASTs are probably not going to be reusable in the way you're asking, because of how we bind the data sources into the AST definition. It would be possible to create a function that constructs a particular AST structure with user-defined data sources, which could allow the behavior you're describing, but it would be on the user side since the structure is user-defined. Is reusability with different column types / data sources important to you?

jlowe · 2020-06-19T15:52:28Z

Sorry, I didn't mean to imply I need reusability. I was trying to get more specifics on what Element actually looked like, e.g.: does each Element know its output datatype? It sounds like the answer is yes which is perfectly fine. The reuse scenario was just a use-case example to test which way it was designed to work. Apologies for the confusion.

bdice added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jun 17, 2020

bdice self-assigned this Jun 17, 2020

bdice mentioned this issue Jun 17, 2020

[REVIEW] Abstract Syntax Tree evaluator #5494

Merged

8 tasks

bdice mentioned this issue Jul 17, 2020

[REVIEW] Add a double type dispatcher. #5716

Merged

kkraus14 closed this as completed in #5494 Oct 6, 2020

beckernick added this to the Conditional Joins milestone Jul 23, 2021

GregoryKimball modified the milestones: Conditional Joins, Expression evaluation Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Abstract Syntax Tree evaluator #5493

[FEA] Abstract Syntax Tree evaluator #5493

bdice commented Jun 17, 2020 •

edited

Loading

jlowe commented Jun 17, 2020

bdice commented Jun 19, 2020

jlowe commented Jun 19, 2020

[FEA] Abstract Syntax Tree evaluator #5493

[FEA] Abstract Syntax Tree evaluator #5493

Comments

bdice commented Jun 17, 2020 • edited Loading

Purpose

Practical use

Design concepts (subject to change)

jlowe commented Jun 17, 2020

bdice commented Jun 19, 2020

jlowe commented Jun 19, 2020

bdice commented Jun 17, 2020 •

edited

Loading