-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improved control over AST operator casting behavior #8979
Comments
This PR resolves #8979, adding support for a few casting operators in AST code. These operators can be used to perform operations between columns with mismatched data types without materializing intermediates as new columns. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Conor Hoekstra (https://github.com/codereport) - Karthikeyan (https://github.com/karthikeyann) - Jason Lowe (https://github.com/jlowe) URL: #9379
@jlowe good call. What do you think would constitute sufficient work to close this issue? Introducing cast operators to cover all cudf types? Our main concern was simply ballooning the size of the operator dispatch by doing that, and the ones we added were the crucial ones to make it possible to leverage the AST without materializing new columns. Is it important to also prioritize adding the downcast operators? I think we were planning to take a wait-and-see approach on those until there was a critical use-case. CC @jrhemstad |
@jlowe while you're not wrong, I don't see when would this be all that important. Just for saving memory in the final output? Unfortunately, it's not likely we'll be able to support casting any type to any other type. The goal #9379 was to help accommodate the fact that all the binary operators require the inputs to be the same type. By upcasting to the widest type, we give you a way to avoid having to materialize your inputs in all the same type to begin with. |
It would primarily be to avoid materializing intermediates before the final output. As one trivial example, I want to evaluate a Spark projection expression as an AST expression, and it's an add of inputs INT16 and INT32. Spark will add the cast of the INT16 to INT32 and then the add operation resulting in an INT32 output, but this can't be expressed in AST. We could perform the add by upcasting both sides to INT64 and then materialize that output. I then have to add a post-processing step to reduce the INT64 to INT32. Closing this for now. For purposes of performing a boolean predicate in a join, which is our main use case for AST, I think we can work with the casts in #9379. It will add some complexity in translating the Spark expression to AST since we can't rely on AST cast operators to "follow along" with the Spark types in the expression tree. |
Is it not okay to just keep the data in INT64? Or would that break the Catalyst expression? |
Yes, if we don't produce the same result as Catalyst expects from an operation then we risk the next operation failing due to unexpected input types. |
We're hoping that addressing #9557 will help make this less of an issue. |
Is your feature request related to a problem? Please describe.
AST operators are often upcasting inputs to INT32 implicitly. For example, adding two INT8 inputs results in an INT32 output, or taking the absolute value of an INT8 input results in an INT32 output. If the intent is to produce an INT8 output, there's currently no mechanism to achieve this via the AST.
Describe the solution you'd like
The AST could provide a CAST unary operator that could be placed in the AST to force a cast of the implicitly-upcasted INT32 output to an INT8. CAST would come in handy in other cases where the original inputs are not the same type yet AST requires that the binary operator types match.
The text was updated successfully, but these errors were encountered: