-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Make RLike support consistent with Apache Spark #3797
Comments
Here are some additional notes, comparing regex support in Spark, Python, and cuDF. Comparing Regex support between Python, cuDF, and Java/SparkMethodologyA custom fuzzing tool was used to generate a Parquet file with a single column containing random strings. Regular expressions were also randomly generated and verified by calling Each expression was evaluated against the Parquet file with and without the RAPIDS Accelerator enabled and the results Expressions that either caused exceptions in cuDF or produced different results between GPU and CPU were then Scala / JavaThis Java code matches Spark's implementation of regex in the scala> import org.apache.commons.text.StringEscapeUtils
import org.apache.commons.text.StringEscapeUtils
scala> import java.util.regex.Pattern
import java.util.regex.Pattern
scala> Pattern.compile(StringEscapeUtils.escapeJava("o{2}")).matcher("foo").find(0)
res12: Boolean = true
scala> Pattern.compile(StringEscapeUtils.escapeJava("o{2}")).matcher("bar").find(0)
res11: Boolean = false Sparkscala> Seq("foo", "bar").toDF("c0").withColumn("rlike", expr("c0 RLIKE 'o{2}'")).show
+---+-----+
| c0|rlike|
+---+-----+
|foo| true|
|bar|false|
+---+-----+ Python>>> import re
>>> print(re.compile('o{2}').search("foo"))
<re.Match object; span=(1, 3), match='oo'>
>>> print(re.compile('o{2}').search("bar"))
None cuDF>>> import cudf
>>> s1 = cudf.Series(['foo', 'bar'])
>>> s1.str.contains('o{2}', regex=True)
0 True
1 False
dtype: bool Categories of IssueSpark supports stacked quantifiers but Python and cuDF do not
scala> Seq("", "a", "b", "bar").toDF("c0").withColumn("rlike", expr("c0 RLIKE 'a*+'")).show
+---+-----+
| c0|rlike|
+---+-----+
| | true|
| a| true|
| b| true|
|bar| true|
+---+-----+ cuDF supports multi-line inputs
Null character in input
TBD - to be analyzed still
|
@revans2 Here are the results of the audit of regex behavior between Spark/Java, Python, and cuDF. These are the issues that I keep running into and once these are resolved there may be others that I have not seen yet. |
Thanks. It looks like we should start with the multi-line and null support simply because they are the same in all CPU environments. |
There is an existing cuDF issue related to cuDF's stricter escaping requirements compared to Python and Java |
null support was already filed: |
Empty group handling: |
Is your feature request related to a problem? Please describe.
PR #3796 added an initial RLike implementation but there are a number of differences with the CPU implementation as noted in the documentation:
Describe the solution you'd like
We should make the GPU behavior consistent with CPU. We could also fall back to CPU in some cases.
Describe alternatives you've considered
None
Additional context
N/A
The text was updated successfully, but these errors were encountered: