Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] regexp_replace hangs with specific inputs and patterns #8323

Closed
andygrove opened this issue May 18, 2023 · 3 comments · Fixed by #8433
Closed

[BUG] regexp_replace hangs with specific inputs and patterns #8323

andygrove opened this issue May 18, 2023 · 3 comments · Fixed by #8433
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug

regexp_replace hangs on the GPU with specific inputs and patterns.

Steps/Code to reproduce bug

Add this test to regexp_test.py

def test_re_replace_all():
    gen = mk_str_gen('.{0,2}\n{0,2}.{0,2}\n{0,2}')
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: unary_op_df(spark, gen).selectExpr(
            'REGEXP_REPLACE(a, ".*$", "PROD", 1)'),
        conf=_regexp_conf)

Expected behavior
Should either fall back to CPU or complete successfully on GPU

Environment details (please complete the following information)
Has been seen with CUDA 11.7 and 12 with Spark 3.3.x

Additional context

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 18, 2023
@andygrove andygrove self-assigned this May 18, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 19, 2023
@andygrove
Copy link
Contributor Author

andygrove commented May 19, 2023

Here is a Spark repro:

val df = Seq("one\ntwo", "three\n\n").toDF("strF").repartition(2)
df.createOrReplaceTempView("regular_match_table")
spark.sql("SELECT regexp_replace(strF, '.*$', 'scala') FROM regular_match_table").show()

Note that we are transpiling the Java pattern .*$ to a cuDF pattern of [^\n\r\u0085\u2028\u2029]*(\r|\u0085|\u2028|\u2029|\r\n)?$.

I have not been able to reproduce this in cuDF using the latest nightly.

>>> import cudf
>>> cudf.__version__
'23.06.00'
>>> s = cudf.Series(["one\ntwo", "three\n\n"])
>>> s.str.replace('[^\n\r\u0085\u2028\u2029]*(\r|\u0085|\u2028|\u2029|\r\n)?$', 'scala', 1, regex=True)
0        one\nscala
1    three\nscala\n
dtype: object

@andygrove
Copy link
Contributor Author

Here is a Java repro, using the published cuDF 23.04 jar

import ai.rapids.cudf.*;

public class Main {
  public static void main(String args[]) {
    try (ColumnVector v = ColumnVector.fromStrings("one\ntwo", "three\n\n")) {
      String pattern = "[^\n\r\u0085\u2028\u2029]*(\r|\u0085|\u2028|\u2029|\r\n)?$";
      String repl = "scala${1}";
      RegexProgram prog = new RegexProgram(pattern);
      v.stringReplaceWithBackrefs(prog, repl);
    }
  }
}

@andygrove
Copy link
Contributor Author

It is also reproducible with a simpler pattern that does not include the unicode characters:

    try (ColumnVector v = ColumnVector.fromStrings("one\ntwo", "three\n\n")) {
      String pattern = "[^\n\r]*(\r|\r\n)?$";
      String repl = "scala${1}";
      RegexProgram prog = new RegexProgram(pattern);
      v.stringReplaceWithBackrefs(prog, repl);
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants