-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967
Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967
Conversation
Signed-off-by: Navin Kumar <[email protected]>
…-transpile-replace-multi
…-transpile-replace-multi
…e are backrefs Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
build |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great improvements
def test_regexp_replace_fallback(): | ||
gen = mk_str_gen('[abcdef]{0,2}') | ||
|
||
conf = { 'spark.rapids.sql.regexp.enabled': 'false' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, here and other places, could use typed constants
conf = { 'spark.rapids.sql.regexp.enabled': 'false' } | |
conf = { 'spark.rapids.sql.regexp.enabled': False } |
@@ -593,6 +593,11 @@ object GpuOverrides extends Logging { | |||
lit.value == null | |||
} | |||
|
|||
def isSupportedStringReplacePattern(strLit: String): Boolean = { | |||
// check for regex special characters, except for \u0000 which we can support | |||
!regexList.filterNot(_ == "\u0000").exists(pattern => strLit.contains(pattern)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more readable in a conjunctive form, but it is not part of your PR, so very optional:
!regexList.filterNot(_ == "\u0000").exists(pattern => strLit.contains(pattern)) | |
!regexList.exists(pattern => pattern != "\u0000" && strLit.contains(pattern)) |
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…-transpile-replace-multi
build |
Signed-off-by: Navin Kumar <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #7907.
This uses the updated stringReplace(Multi) API from rapidsai/cudf#12858 and rapidsai/cudf#12979 to optimize scenarios that involve simple choices in regular expressions. For example, the regular expressions
aa|bb
can be transpiled to a list["aa", "bb"]
which can be converted to a ColumnVector and passed to the new stringReplace(Multi) API without using any regex. This results in an improved speedup (especially for large strings).Some performance numbers:
This test created a Parquet file with 4096 rows of a single string column for each of string lengths. It then used a regular expression that was a simple choice (e.g.
aaaaa|bbbbb
), and then called regexp_replace on the dataframe using SQL, and then writes the result back to Parquet. This Parquet method was used due to the fact that keeping the entire dataset in memory results in OOM (running out of heap space on the JVM) on the CPU as the string length is increased. Posted is the effective speedup in a table. You can see this in the graph here:One interesting result is that there is no obvious inflection point in terms of string length. In general, it looks like the multi-replace optimization will either perform about the same, and as the string size increases in the dataframe, the speedup difference between the GPU regexp_replace and the optimized version that transpiles the regex choice into a list of strings which can then be passed to the stringReplace(Multi) API and executed in parallel on the GPU.
Source Code of Benchmark