Add regular expression support to string_split #4714

andygrove · 2022-02-07T20:59:27Z

Signed-off-by: Andy Grove [email protected]

Closes #4003

Depends on rapidsai/cudf#10139

Follow-on issues:

Status:

Draft implementation
Move some logic from GpuStringSplit to GpuStringSplitMeta
Update compatibility guide
File follow-on issues for limit = 0 or 1, and supporting line and string anchors

Signed-off-by: Andy Grove <[email protected]>

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

…limits

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

ttnghia · 2022-02-08T16:53:51Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

+case class GpuStringSplit(str: Expression, regex: Expression, limit: Expression,
+    isRegExp: Boolean, pattern: String)


Can we get rid of the regex expression completely? It is now useless since we use pattern instead.

Maybe this can be achieved if you override columnarEval instead of doColumnar similar to https://github.com/NVIDIA/spark-rapids/pull/4636/files#diff-a12810882b81a4eb395c03a80951f96ec080db793ffed6755739eeb2122840ccR1432

It would be switching this from a GpuTernaryExpression to a GpuBinaryExpression. I personally don't see it as a big deal either way.

I would prefer to stick with GpuTernaryExpression to match Spark

But we don't have to, right? I don't see any benefit from keeping it a TernaryExpr instead of just UnaryExpr/GpuExpr. I tried to implement GpuStringToMap to inherit GpuExpression and the evaluation function is super short: https://github.com/NVIDIA/spark-rapids/pull/4636/files#diff-a12810882b81a4eb395c03a80951f96ec080db793ffed6755739eeb2122840ccR1507-R1518

Basically we have evaluated the literal delimiter pattern before calling to the Gpu override, thus we only pass in ONE input string expression.

Wouldn't I still need to pass in all of the expressions though so that I can implement children() correctly?

Yes, the original delimiter expression still needs to be passed in to initialize children, but it is not used anywhere in the evaluation later on.

The delimiter expression already isn't used in the evaluation. It is only referenced in override def second: Expression = regex which is just used to construct children in final def children: Seq[Expression] = IndexedSeq(first, second, third).

I'm not against making the change and am curious to see what the benefits are but I would rather do this as a follow-on issue and review how similar regexp expressions are implemented since they all follow this same pattern.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

...src/main/301until310-nondb/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceMeta.scala

revans2

Nothing big, this is looking good.

integration_tests/src/main/python/string_test.py

revans2 · 2022-02-09T15:11:33Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

+case class GpuStringSplit(str: Expression, regex: Expression, limit: Expression,
+    isRegExp: Boolean, pattern: String)


It would be switching this from a GpuTernaryExpression to a GpuBinaryExpression. I personally don't see it as a big deal either way.

andygrove · 2022-02-09T18:18:29Z

build

sql-plugin/src/main/301db/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceExec.scala

...plugin/src/main/311+-nondb/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceExec.scala

sql-plugin/src/main/31xdb/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceExec.scala

revans2

I stopped reviewing after a bit because it looks like there might be some other code in here by accident.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

This reverts commit c70390f.

revans2

Looks good to me

This PR adds Java binding for the new strings API `strings::split_re` and `strings::split_record_re`, which allows splitting strings by regular expression delimiters. In addition, the Java string split overloads with default split pattern (an empty string) are removed in this PR. That is because with default empty pattern the Java's split API produces different results than cudf. Finally, some cleanup has been perform automatically thanks to IntelliJ IDE. Depends on #10128. This is breaking change which is fixed by NVIDIA/spark-rapids#4714. Thus, it should be merged at the same time with NVIDIA/spark-rapids#4714. Authors: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Andy Grove (https://github.com/andygrove) URL: #10139

revans2 · 2022-02-14T17:33:54Z

build

Draft implementation of string_split with regexp support

1cfc173

Signed-off-by: Andy Grove <[email protected]>

andygrove added this to the Jan 31 - Feb 11 milestone Feb 7, 2022

andygrove self-assigned this Feb 7, 2022

Rename maxSplit to limit and add TODO comment

3929461

ttnghia reviewed Feb 7, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Outdated Show resolved Hide resolved

andygrove added 2 commits February 7, 2022 14:44

code cleanup and add separate tests for negative, zero, and positive …

92366bd

…limits

fall back to CPU for limit of 0 or 1

51d870d

ttnghia mentioned this pull request Feb 7, 2022

Add JNI for strings::split_re and strings::split_record_re rapidsai/cudf#10139

Merged

andygrove added 2 commits February 7, 2022 16:13

fall back to CPU for split on regex containing string or line anchors

9c09603

move some logic from GpuStringSplit to GpuStringSplitMeta

5556717

This was referenced Feb 8, 2022

[FEA] GpuStringSplit: Add support for line and string anchors in regular expressions #4719

Closed

[FEA] GpuStringSplit: Add support for limit = 0 and limit =1 #4720

Closed

This was referenced Feb 8, 2022

[FEA] Support regular expression delimiters for str_to_map #4721

Closed

Support str_to_map [databricks] #4636

Merged

Additional tests

14b873c

ttnghia reviewed Feb 8, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Outdated Show resolved Hide resolved

ttnghia reviewed Feb 8, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Show resolved Hide resolved

ttnghia reviewed Feb 8, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Show resolved Hide resolved

ttnghia reviewed Feb 8, 2022

View reviewed changes

...src/main/301until310-nondb/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceMeta.scala Show resolved Hide resolved

andygrove added 2 commits February 8, 2022 16:07

update shims

5cc7ccd

check that expression has been tagged

19918c2

revans2 reviewed Feb 9, 2022

View reviewed changes

update split_re tests to actually use regexp rather than simple strings

481b356

andygrove changed the title ~~WIP: Draft implementation of string_split with regexp support~~ WIP: Draft implementation of string_split with regexp support [databricks] Feb 9, 2022

andygrove added 3 commits February 10, 2022 14:13

merge from branch-22.04

02fad12

fix merge issue

0ef4830

update compatibility guide

e396c69

ttnghia reviewed Feb 10, 2022

View reviewed changes

sql-plugin/src/main/301db/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceExec.scala Outdated Show resolved Hide resolved

ttnghia reviewed Feb 10, 2022

View reviewed changes

...plugin/src/main/311+-nondb/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceExec.scala Outdated Show resolved Hide resolved

ttnghia reviewed Feb 10, 2022

View reviewed changes

sql-plugin/src/main/31xdb/scala/com/nvidia/spark/rapids/shims/v2/GpuRegExpReplaceExec.scala Outdated Show resolved Hide resolved

fix incorrect imports in shim layer

c70390f

andygrove changed the title ~~WIP: Draft implementation of string_split with regexp support [databricks]~~ WIP: Add regular expression support to string_split [databricks] Feb 11, 2022

revans2 reviewed Feb 11, 2022

View reviewed changes

andygrove added 2 commits February 11, 2022 09:16

Revert "fix incorrect imports in shim layer"

b7f0cba

This reverts commit c70390f.

fix incorrect imports in shim layer

4c94170

revans2 approved these changes Feb 11, 2022

View reviewed changes

sameerz added the feature request New feature or request label Feb 11, 2022

andygrove changed the title ~~WIP: Add regular expression support to string_split [databricks]~~ WIP: Add regular expression support to string_split Feb 11, 2022

andygrove changed the title ~~WIP: Add regular expression support to string_split~~ Add regular expression support to string_split Feb 14, 2022

andygrove marked this pull request as ready for review February 14, 2022 16:12

revans2 merged commit 3c48c96 into NVIDIA:branch-22.04 Feb 14, 2022

andygrove deleted the string-split-regexp branch February 14, 2022 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regular expression support to string_split #4714

Add regular expression support to string_split #4714

andygrove commented Feb 7, 2022 •

edited

Loading

ttnghia Feb 8, 2022

ttnghia Feb 8, 2022

revans2 Feb 9, 2022

andygrove Feb 9, 2022

ttnghia Feb 11, 2022

ttnghia Feb 11, 2022

andygrove Feb 11, 2022

ttnghia Feb 11, 2022

andygrove Feb 11, 2022

revans2 left a comment

revans2 Feb 9, 2022

andygrove commented Feb 9, 2022

revans2 left a comment

revans2 left a comment

revans2 commented Feb 14, 2022

		case class GpuStringSplit(str: Expression, regex: Expression, limit: Expression,
		isRegExp: Boolean, pattern: String)

Add regular expression support to string_split #4714

Add regular expression support to string_split #4714

Conversation

andygrove commented Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Feb 9, 2022

revans2 left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Feb 14, 2022

andygrove commented Feb 7, 2022 •

edited

Loading