[BUG] regexp_replace hangs with specific inputs and patterns #8323

andygrove · 2023-05-18T20:39:41Z

Describe the bug

regexp_replace hangs on the GPU with specific inputs and patterns.

Steps/Code to reproduce bug

Add this test to regexp_test.py

def test_re_replace_all():
    gen = mk_str_gen('.{0,2}\n{0,2}.{0,2}\n{0,2}')
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: unary_op_df(spark, gen).selectExpr(
            'REGEXP_REPLACE(a, ".*$", "PROD", 1)'),
        conf=_regexp_conf)

Expected behavior
Should either fall back to CPU or complete successfully on GPU

Environment details (please complete the following information)
Has been seen with CUDA 11.7 and 12 with Spark 3.3.x

Additional context

The text was updated successfully, but these errors were encountered:

andygrove · 2023-05-19T19:02:50Z

Here is a Spark repro:

val df = Seq("one\ntwo", "three\n\n").toDF("strF").repartition(2)
df.createOrReplaceTempView("regular_match_table")
spark.sql("SELECT regexp_replace(strF, '.*$', 'scala') FROM regular_match_table").show()

Note that we are transpiling the Java pattern .*$ to a cuDF pattern of [^\n\r\u0085\u2028\u2029]*(\r|\u0085|\u2028|\u2029|\r\n)?$.

I have not been able to reproduce this in cuDF using the latest nightly.

>>> import cudf
>>> cudf.__version__
'23.06.00'
>>> s = cudf.Series(["one\ntwo", "three\n\n"])
>>> s.str.replace('[^\n\r\u0085\u2028\u2029]*(\r|\u0085|\u2028|\u2029|\r\n)?$', 'scala', 1, regex=True)
0        one\nscala
1    three\nscala\n
dtype: object

andygrove · 2023-05-19T21:10:38Z

Here is a Java repro, using the published cuDF 23.04 jar

import ai.rapids.cudf.*;

public class Main {
  public static void main(String args[]) {
    try (ColumnVector v = ColumnVector.fromStrings("one\ntwo", "three\n\n")) {
      String pattern = "[^\n\r\u0085\u2028\u2029]*(\r|\u0085|\u2028|\u2029|\r\n)?$";
      String repl = "scala${1}";
      RegexProgram prog = new RegexProgram(pattern);
      v.stringReplaceWithBackrefs(prog, repl);
    }
  }
}

andygrove · 2023-05-19T23:10:34Z

It is also reproducible with a simpler pattern that does not include the unicode characters:

    try (ColumnVector v = ColumnVector.fromStrings("one\ntwo", "three\n\n")) {
      String pattern = "[^\n\r]*(\r|\r\n)?$";
      String repl = "scala${1}";
      RegexProgram prog = new RegexProgram(pattern);
      v.stringReplaceWithBackrefs(prog, repl);
    }

andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 18, 2023

andygrove self-assigned this May 18, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label May 19, 2023

andygrove mentioned this issue May 22, 2023

[BUG] replace_with_backrefs hangs with some inputs rapidsai/cudf#13404

Closed

andygrove mentioned this issue May 30, 2023

Add regression test for regexp_replace hanging with some inputs #8433

Merged

andygrove closed this as completed in #8433 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] regexp_replace hangs with specific inputs and patterns #8323

[BUG] regexp_replace hangs with specific inputs and patterns #8323

andygrove commented May 18, 2023

andygrove commented May 19, 2023 •

edited

Loading

andygrove commented May 19, 2023

andygrove commented May 19, 2023

[BUG] regexp_replace hangs with specific inputs and patterns #8323

[BUG] regexp_replace hangs with specific inputs and patterns #8323

Comments

andygrove commented May 18, 2023

andygrove commented May 19, 2023 • edited Loading

andygrove commented May 19, 2023

andygrove commented May 19, 2023

andygrove commented May 19, 2023 •

edited

Loading