-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] String support for AST expressions #8858
Comments
Can you help answer the two problems encountered during use? First: When using the c++ interface about AST, is there a conversion process for c++ expressions (for example: "(a+b)*2+(c-d)/f") into cudf-expression objects in cudf? Is there a corresponding calling interface? Also need to write the corresponding conversion function code in the use process to realize the conversion of character expression into expression in cudf? second: In the process of using , I encountered many string-related operations. This is currently not supported (why is it not implemented?). Can you support it in a high level? Is it difficult to implement this function? |
No, not yet. Most clients are forming these expressions at runtime where such an interface could not be used and as such it has has not been a priority.
An AST expression is accepted in only two places currently:
I don't understand the question.
As the above issue description states, we do not currently support string related operations in AST expressions. Supporting these operations is more complicated and it is something we are actively working on. |
Okay, thank you very much. We are very interested in AST support for string, which can solve our big problems. Support string should be more complicated. |
Can you be more specific on what string-based expressions you are needing? The etc part of the above sentence is a bit concerning. |
Yeah, I'm currently using cudf to achieve data filtering of bool expressions(last result is bool) From the current ast implementation, the data types are not supported now, such as string and float, and the operations are not supported, such as match and not_ match |
What is |
sorry, My fault, I didn't make it clear, they are regular expressions |
The "etc" is sorta "as much parity as we can get with existing libcudf string column functionality," and I understand that AST support will always lag far behind what libcudf can do directly on string columns. For the purposes of this issue being resolved, I'd like string comparison and string literals to be supported. As for followup work on other AST string operations, I'd roughly rank the priority as follows (most desired to least desired) from the Spark perspective:
I don't see how regex can be supported without taking a huge hit to the AST kernel performance and memory footprint. Maybe there's a separate AST kernel that's slower to execute when this operator appears in the AST expression tree? Thinking along the lines of how regex today executes different kernels to allow simple expressions to remain fast while still supporting more complicated ones. We may need to take a similar approach just for string support in general if the AST kernel slows down significantly when even "simple" string operations are added. |
String comparison should be fairly easy. String casting and substring matching predicates maybe. String concat is extremely unlikely as the intermediate data size would be unbounded. I don't think there's a world in which we're going to try and support regex within the context of AST evaluation. |
This issue has been labeled |
Still desired |
Thanks for this discussion. Now that we've been through a similar exercise with string UDFs in cuDF-python, I'd like to take another pass over the utility of certain operators with string input types. First, I expect string support only for input types in the foreseeable future. Second, we should again reiterate that expanding string operators to include regex is not currently in scope. I would imagine only the following operations to be supported at first:
@jlowe Given the constraint that the operators must be non-regex and only return int or boolean types, would you please share which string operators would be most useful? |
I'd start with basic comparison operators first. We just ran across a join today using an OR condition on two different string equality conditions. Next in line would likely be |
I would say start with either string length or string comparison. Once the internals are in place to support that operations accepting strings, adding support for operations like We will need to be extremely careful in benchmarking how adding these operators affects the performance of existing operators. |
In discussion with @GregoryKimball, we identified three measurements needed for any major changes to AST features:
Additionally, we have some potential mitigations worth mentioning. For instance, it would be possible to compile two variations of the mixed join kernel (the most expensive AST kernel) with different operator/type support. One variation would be fully-featured. The second variation would support a limited subset of operators/types (and remove the rest of the features at compile time), which enables it to have lower register usage and higher occupancy, yielding higher performance. If we dispatch to two mixed join kernels with different features (limited fast path and full-featured slow path), we will also need to split their source files as in #10671. |
Adding string scalar support in AST. A new generic scalar device view class is added in AST to support numeric, timestamp, duration and string scalars. Register count did not change, and benchmark results are almost same. Compile time - There is major increase in join.cu by 15%. Other files are in range of -2% to 7% Addressed part of #8858 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Mark Harris (https://github.com/harrism) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13061
Closing in favor of #13358 |
AST expressions currently only accept numeric, timestamp, and duration literals. Strings are very common in ETL processing, and it would be nice to be able to process string-based expressions such as string equality, comparison, etc. along with string literals within the expression.
The text was updated successfully, but these errors were encountered: