Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] String support for AST expressions #8858

Closed
jlowe opened this issue Jul 26, 2021 · 16 comments
Closed

[FEA] String support for AST expressions #8858

jlowe opened this issue Jul 26, 2021 · 16 comments
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Jul 26, 2021

AST expressions currently only accept numeric, timestamp, and duration literals. Strings are very common in ETL processing, and it would be nice to be able to process string-based expressions such as string equality, comparison, etc. along with string literals within the expression.

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jul 26, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Jul 27, 2021
@beckernick beckernick added this to the Conditional Joins milestone Jul 27, 2021
@lixiaolx
Copy link

Can you help answer the two problems encountered during use?

First: When using the c++ interface about AST, is there a conversion process for c++ expressions (for example: "(a+b)*2+(c-d)/f") into cudf-expression objects in cudf? Is there a corresponding calling interface? Also need to write the corresponding conversion function code in the use process to realize the conversion of character expression into expression in cudf?

second: In the process of using , I encountered many string-related operations. This is currently not supported (why is it not implemented?). Can you support it in a high level? Is it difficult to implement this function?

@jrhemstad
Copy link
Contributor

is there a conversion process for c++ expressions (for example: "(a+b)*2+(c-d)/f") into cudf-expression objects in cudf?

No, not yet. Most clients are forming these expressions at runtime where such an interface could not be used and as such it has has not been a priority.

Is there a corresponding calling interface?

An AST expression is accepted in only two places currently: compute_column and conditional joins.

Also need to write the corresponding conversion function code in the use process to realize the conversion of character expression into expression in cudf?

I don't understand the question.

This is currently not supported (why is it not implemented?). Can you support it in a high level? Is it difficult to implement this function?

As the above issue description states, we do not currently support string related operations in AST expressions. Supporting these operations is more complicated and it is something we are actively working on.

@lixiaolx
Copy link

As the above issue description states, we do not currently support string related operations in AST expressions. Supporting these operations is more complicated and it is something we are actively working on.

Okay, thank you very much. We are very interested in AST support for string, which can solve our big problems. Support string should be more complicated.
Can you help with high priority support? OR Can you help implement a draft version, and then I help improve it?

@davidwendt
Copy link
Contributor

it would be nice to be able to process string-based expressions such as string equality, comparison, etc. along with string literals within the expression.

Can you be more specific on what string-based expressions you are needing? The etc part of the above sentence is a bit concerning.

@lixiaolx
Copy link

Can you be more specific on what string-based expressions you are needing? The etc part of the above sentence is a bit concerning.

Yeah, I'm currently using cudf to achieve data filtering of bool expressions(last result is bool)
The main data types on the AST are: string, uint, float
The designed operations mainly include:AND(&&), OR(||), LE(<=), GE(>=), LT(<), GT(>), EQ(==), NE(!=), MATCH(for example: ~~ ".123."),NOT_MATCH(for example: !~ ".123.")

From the current ast implementation, the data types are not supported now, such as string and float, and the operations are not supported, such as match and not_ match

@davidwendt
Copy link
Contributor

What is MATCH and NOT_MATCH specifically? Are they just equal/not-equal for strings or are the regular expressions?

@lixiaolx
Copy link

sorry, My fault, I didn't make it clear, they are regular expressions

@jlowe
Copy link
Member Author

jlowe commented Sep 15, 2021

Can you be more specific on what string-based expressions you are needing?

The "etc" is sorta "as much parity as we can get with existing libcudf string column functionality," and I understand that AST support will always lag far behind what libcudf can do directly on string columns. For the purposes of this issue being resolved, I'd like string comparison and string literals to be supported.

As for followup work on other AST string operations, I'd roughly rank the priority as follows (most desired to least desired) from the Spark perspective:

  • Substring matching (i.e.: predicate function for does this string contain this other string) and the related predicate functions starts with and ends with.
  • String casting (to string and from string) between other AST supported types
  • String regular expression support
  • String concatenation

I don't see how regex can be supported without taking a huge hit to the AST kernel performance and memory footprint. Maybe there's a separate AST kernel that's slower to execute when this operator appears in the AST expression tree? Thinking along the lines of how regex today executes different kernels to allow simple expressions to remain fast while still supporting more complicated ones. We may need to take a similar approach just for string support in general if the AST kernel slows down significantly when even "simple" string operations are added.

@jrhemstad
Copy link
Contributor

String comparison should be fairly easy. String casting and substring matching predicates maybe. String concat is extremely unlikely as the intermediate data size would be unbounded.

I don't think there's a world in which we're going to try and support regex within the context of AST evaluation.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@jlowe
Copy link
Member Author

jlowe commented Dec 14, 2021

Still desired

@jrhemstad jrhemstad added the 0 - Backlog In queue waiting for assignment label Dec 15, 2021
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Jan 9, 2023
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 2, 2023

Thanks for this discussion. Now that we've been through a similar exercise with string UDFs in cuDF-python, I'd like to take another pass over the utility of certain operators with string input types. First, I expect string support only for input types in the foreseeable future. Second, we should again reiterate that expanding string operators to include regex is not currently in scope. I would imagine only the following operations to be supported at first:

length, find, rfind, starts_with, contains, ends_with, ==, !=, <

@jlowe Given the constraint that the operators must be non-regex and only return int or boolean types, would you please share which string operators would be most useful?
@bdice Would you please share your thoughts on which operators belong in the first extension of libcudf ASTs to include string input types?

@jlowe
Copy link
Member Author

jlowe commented Feb 2, 2023

I'd start with basic comparison operators first. We just ran across a join today using an OR condition on two different string equality conditions. Next in line would likely be contains and starts_with.

@vyasr
Copy link
Contributor

vyasr commented Feb 4, 2023

I would say start with either string length or string comparison. Once the internals are in place to support that operations accepting strings, adding support for operations like contains or starts_with should be an easier lift. Also for reference, this earlier comment contains @jlowe's original wish list for Spark.

We will need to be extremely careful in benchmarking how adding these operators affects the performance of existing operators.

@bdice
Copy link
Contributor

bdice commented Feb 22, 2023

We will need to be extremely careful in benchmarking how adding these operators affects the performance of existing operators.

In discussion with @GregoryKimball, we identified three measurements needed for any major changes to AST features:

Additionally, we have some potential mitigations worth mentioning. For instance, it would be possible to compile two variations of the mixed join kernel (the most expensive AST kernel) with different operator/type support. One variation would be fully-featured. The second variation would support a limited subset of operators/types (and remove the rest of the features at compile time), which enables it to have lower register usage and higher occupancy, yielding higher performance.

If we dispatch to two mixed join kernels with different features (limited fast path and full-featured slow path), we will also need to split their source files as in #10671.

@bdice bdice closed this as completed Feb 22, 2023
@bdice bdice reopened this Feb 22, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 20, 2023
Adding string scalar support in AST.
A new generic scalar device view class is added in AST to support numeric, timestamp, duration and string scalars.

Register count did not change, and benchmark results are almost same.
Compile time - There is major increase in join.cu by 15%. Other files are in range of -2% to 7%
 
Addressed part of #8858

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13061
@GregoryKimball
Copy link
Contributor

Closing in favor of #13358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

8 participants