Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Abstract Syntax Tree evaluator #5493

Closed
bdice opened this issue Jun 17, 2020 · 3 comments · Fixed by #5494
Closed

[FEA] Abstract Syntax Tree evaluator #5493

bdice opened this issue Jun 17, 2020 · 3 comments · Fixed by #5494
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@bdice
Copy link
Contributor

bdice commented Jun 17, 2020

Purpose

Discussions on #5397 and #5401 have led to a desire for constructing and evaluating abstract syntax trees (ASTs), within some reasonable limitations. This is meant to assist with inequality joins, by enabling complex join expressions like WHERE A.x = B.x AND A.y < (B.y + 100) * 2 to be resolved. An AST that could be evaluated on the device would circumvent the need to materialize intermediate results (e.g. an entire column with the value (B.y + 100) * 2, or worse, an entire cross join just to act as a filter).

Practical use

Engines like Spark Catalyst and Apache Calcite can separate join expressions into "equijoin" parts like A.x = B.x that can be efficiently performed using a hash-join approach, and a separate part for the inequalities. The inequalities can be represented as an abstract syntax tree using boolean operators, comparators, and unary/binary mathematical operations like addition. We seek a way to evaluate these ASTs without using JIT (the performance cost of JIT is large).

Even equijoins might benefit from this feature for cases like A.x - 1 = B.x * 5 AND A.y + 3 = B.y * 2 if there are multiple fields that would be memory-intensive to materialize. In that case, it may be possible to use ASTs as inputs to the hasher, which would produce the output values to be hashed and compared.

Design concepts (subject to change)

  • Data inputs shall be specified for every expression operand, and shall be sourced from a table column, a "literal" value, or an intermediate value produced by some subtree of the abstract syntax tree.
    • The AST will be "linearized" into a series of operators and data source ids that ensure that intermediate values from a sub-expression are produced before they are consumed by a higher expression. I plan to store intermediate values in GPU shared memory; each thread will own a chunk of the shared memory for its own intermediate storage.
    • The row(s) associated with a table column are specified when the AST is executed by a thread; only the column index is referenced when building the AST.
    • Data resolution (fetching values) will be performed by individual threads when executing the AST in a kernel.
  • The ASTs may be of arbitrary depth and be composed of any combination of these allowed expressions:
    • AST binary (unary) "math" expressions (e.g. +, *, abs) shall map from two (one) Elements of the same type to one Element. Subtrees of expressions must have all inputs match a single Element type.
    • AST comparators (e.g. <, <=, !=) shall map from two (one) Elements of the same type to a bool.
    • AST boolean operators (e.g. AND) maps two bools to a bool.

Critique is welcome. 🌲

@bdice bdice added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jun 17, 2020
@bdice bdice self-assigned this Jun 17, 2020
@jlowe
Copy link
Member

jlowe commented Jun 17, 2020

Is the data type explicitly specified in each Element of the AST or are data types implicit based on the inputs specified to the AST when evaluated? More concretely, if I build an AST for a + b * c is that AST reusable across different column types for a, b, and c?

@bdice
Copy link
Contributor Author

bdice commented Jun 19, 2020

@jlowe I had drafted a reply to this earlier but the API is changing a lot as I discuss with @jrhemstad. We're planning to validate the AST on construction, and the validation will require a few things. In particular, operands to a binary operation will have to be of the same type. This will be validated on construction of the tree since the definition of the operand may just be a column index, and we'll have to check the data types of all inputs as we convert the AST into a linearized form that is GPU-friendly. ASTs are probably not going to be reusable in the way you're asking, because of how we bind the data sources into the AST definition. It would be possible to create a function that constructs a particular AST structure with user-defined data sources, which could allow the behavior you're describing, but it would be on the user side since the structure is user-defined. Is reusability with different column types / data sources important to you?

@jlowe
Copy link
Member

jlowe commented Jun 19, 2020

Sorry, I didn't mean to imply I need reusability. I was trying to get more specifics on what Element actually looked like, e.g.: does each Element know its output datatype? It sounds like the answer is yes which is perfectly fine. The reuse scenario was just a use-case example to test which way it was designed to work. Apologies for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants