Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improved control over AST operator casting behavior #8979

Closed
jlowe opened this issue Aug 5, 2021 · 7 comments · Fixed by #9379
Closed

[FEA] Improved control over AST operator casting behavior #8979

jlowe opened this issue Aug 5, 2021 · 7 comments · Fixed by #9379
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Contributor

jlowe commented Aug 5, 2021

Is your feature request related to a problem? Please describe.
AST operators are often upcasting inputs to INT32 implicitly. For example, adding two INT8 inputs results in an INT32 output, or taking the absolute value of an INT8 input results in an INT32 output. If the intent is to produce an INT8 output, there's currently no mechanism to achieve this via the AST.

Describe the solution you'd like
The AST could provide a CAST unary operator that could be placed in the AST to force a cast of the implicitly-upcasted INT32 output to an INT8. CAST would come in handy in other cases where the original inputs are not the same type yet AST requires that the binary operator types match.

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Aug 5, 2021
@vyasr vyasr added this to the Conditional Joins milestone Aug 5, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Aug 23, 2021
@vyasr vyasr self-assigned this Oct 4, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 25, 2021
This PR resolves #8979, adding support for a few casting operators in AST code. These operators can be used to perform operations between columns with mismatched data types without materializing intermediates as new columns.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Conor Hoekstra (https://github.com/codereport)
  - Karthikeyan (https://github.com/karthikeyann)
  - Jason Lowe (https://github.com/jlowe)

URL: #9379
@jlowe
Copy link
Contributor Author

jlowe commented Oct 25, 2021

@vyasr should this be closed? I got the impression the cast support that was added in #9379 is relatively limited and would not cover the cases enumerated in the description.

@vyasr
Copy link
Contributor

vyasr commented Oct 26, 2021

@jlowe good call. What do you think would constitute sufficient work to close this issue? Introducing cast operators to cover all cudf types? Our main concern was simply ballooning the size of the operator dispatch by doing that, and the ones we added were the crucial ones to make it possible to leverage the AST without materializing new columns. Is it important to also prioritize adding the downcast operators? I think we were planning to take a wait-and-see approach on those until there was a critical use-case. CC @jrhemstad

@vyasr vyasr reopened this Oct 26, 2021
@jrhemstad
Copy link
Contributor

If the intent is to produce an INT8 output, there's currently no mechanism to achieve this via the AST.

@jlowe while you're not wrong, I don't see when would this be all that important. Just for saving memory in the final output?

Unfortunately, it's not likely we'll be able to support casting any type to any other type. The goal #9379 was to help accommodate the fact that all the binary operators require the inputs to be the same type. By upcasting to the widest type, we give you a way to avoid having to materialize your inputs in all the same type to begin with.

@jlowe
Copy link
Contributor Author

jlowe commented Oct 27, 2021

I don't see when would this be all that important. Just for saving memory in the final output?

It would primarily be to avoid materializing intermediates before the final output. As one trivial example, I want to evaluate a Spark projection expression as an AST expression, and it's an add of inputs INT16 and INT32. Spark will add the cast of the INT16 to INT32 and then the add operation resulting in an INT32 output, but this can't be expressed in AST. We could perform the add by upcasting both sides to INT64 and then materialize that output. I then have to add a post-processing step to reduce the INT64 to INT32.

Closing this for now. For purposes of performing a boolean predicate in a join, which is our main use case for AST, I think we can work with the casts in #9379. It will add some complexity in translating the Spark expression to AST since we can't rely on AST cast operators to "follow along" with the Spark types in the expression tree.

@jlowe jlowe closed this as completed Oct 27, 2021
@jrhemstad
Copy link
Contributor

I then have to add a post-processing step to reduce the INT64 to INT32.

Is it not okay to just keep the data in INT64? Or would that break the Catalyst expression?

@jlowe
Copy link
Contributor Author

jlowe commented Oct 28, 2021

Or would that break the Catalyst expression?

Yes, if we don't produce the same result as Catalyst expects from an operation then we risk the next operation failing due to unexpected input types.

@vyasr
Copy link
Contributor

vyasr commented Nov 1, 2021

We're hoping that addressing #9557 will help make this less of an issue.

@GregoryKimball GregoryKimball removed this from the Conditional Joins milestone Oct 24, 2022
@GregoryKimball GregoryKimball added this to the Expression evaluation milestone Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants