Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] regexp_replace [^a] has different behavior between CPU and GPU for multiline strings #4229

Closed
andygrove opened this issue Nov 29, 2021 · 0 comments · Fixed by #4255
Closed
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug

With the pattern [^a] (meaning match any character except for 'a'), cuDF does not match newline characters but Java/Spark does, resulting in different behavior as demonstrated by these code snippets.

Spark/Java behavior

scala> import java.util.regex._
scala> val p = Pattern.compile("[^a]")
scala> p.matcher("a\nb\nc").replaceAll("1")
res0: String = a1111

cuDF behavior

>>> import cudf
>>> cudf.Series(['a\nb\nc']).str.replace('[^a]', '1', regex=True)
0    a\n1\n1

Steps/Code to reproduce bug

Add the following integration test to string_test.py

def test_regexp_replace_multiline():
    gen = mk_str_gen('[abcd]{0,3}\n[abcd]{0,3}')
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark: unary_op_df(spark, gen).selectExpr(
                'regexp_replace(a, "([^a])|([^b])", "1")'),
            conf={'spark.rapids.sql.expression.RegExpReplace': 'true'})

Expected behavior
We either need to document the difference in the compatibility guide or we need to find a way to make the behavior consistent.

Environment details (please complete the following information)
N/A

Additional context
None

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 29, 2021
@andygrove andygrove added this to the Nov 15 - Nov 26 milestone Nov 29, 2021
@andygrove andygrove self-assigned this Nov 29, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants