Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regular expression support to string_split #4714
Add regular expression support to string_split #4714
Changes from 8 commits
1cfc173
3929461
92366bd
51d870d
9c09603
5556717
14b873c
5cc7ccd
19918c2
481b356
02fad12
0ef4830
e396c69
c70390f
b7f0cba
4c94170
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of the
regex
expression completely? It is now useless since we usepattern
instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be achieved if you override
columnarEval
instead ofdoColumnar
similar to https://github.com/NVIDIA/spark-rapids/pull/4636/files#diff-a12810882b81a4eb395c03a80951f96ec080db793ffed6755739eeb2122840ccR1432There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be switching this from a
GpuTernaryExpression
to aGpuBinaryExpression
. I personally don't see it as a big deal either way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to stick with
GpuTernaryExpression
to match SparkThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't have to, right? I don't see any benefit from keeping it a
TernaryExpr
instead of justUnaryExpr
/GpuExpr
. I tried to implementGpuStringToMap
to inheritGpuExpression
and the evaluation function is super short: https://github.com/NVIDIA/spark-rapids/pull/4636/files#diff-a12810882b81a4eb395c03a80951f96ec080db793ffed6755739eeb2122840ccR1507-R1518There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically we have evaluated the literal delimiter pattern before calling to the Gpu override, thus we only pass in ONE input string expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't I still need to pass in all of the expressions though so that I can implement
children()
correctly?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the original delimiter expression still needs to be passed in to initialize
children
, but it is not used anywhere in the evaluation later on.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delimiter expression already isn't used in the evaluation. It is only referenced in
override def second: Expression = regex
which is just used to constructchildren
infinal def children: Seq[Expression] = IndexedSeq(first, second, third)
.I'm not against making the change and am curious to see what the benefits are but I would rather do this as a follow-on issue and review how similar regexp expressions are implemented since they all follow this same pattern.