Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Handle regexp_replace inconsistency from https://issues.apache.org/jira/browse/SPARK-39107 #5456

Closed
NVnavkumar opened this issue May 10, 2022 · 1 comment · Fixed by #5740
Assignees
Labels
audit_3.3.0 Audit related tasks for 3.3.0 bug Something isn't working

Comments

@NVnavkumar
Copy link
Collaborator

Describe the bug
When passing an empty string ('') and a regular expression containing only ? or * repetitions, the output is not consistent between Spark and cuDF. Note this is actually a pure inconsistency with Spark's regexp_replace, it actually does not apply to what Java does as a standard.

Steps/Code to reproduce bug
PySpark Example Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master("local[*]") \
      .appName("regex_empty_match") \
      .config("spark.rapids.sql.regexp.enabled", "true") \
      .config("spark.rapids.sql.explain","ALL") \
      .config("spark.rapids.sql.castStringToTimestamp.enabled","true") \
      .config("spark.rapids.sql.exec.CollectLimitExec", "true") \
      .config("spark.rapids.sql.castFloatToString.enabled", "true") \
      .config("spark.rapids.sql.hasExtendedYearValues", "false") \
      .getOrCreate()  

df = spark.sparkContext.parallelize([[""],[""],["AAA"]]).toDF(["a"])
df.show()

df.selectExpr("regexp_replace(a,'A*','_REPLACED_')").show()

Spark (CPU) Output:

+------------------------------------+
|regexp_replace(a, A*, _REPLACED_, 1)|
+------------------------------------+
|                                                               |
|                                                               |
|                _REPLACED__REPLACED_  |
+------------------------------------+

Plugin (GPU) Output:

+-------------------------------------+
|regexp_replace(a, A*, _REPLACED_, 1)|
+-------------------------------------+
|                          _REPLACED_                |
|                          _REPLACED_                |
|                _REPLACED__REPLACED_   |
+------------------------------------+

Expected behavior

Empty string input should be short circuited in the plugin as this is what is expected in certain versions of Spark.

Additional context

This bug was originally reported to Spark in https://issues.apache.org/jira/browse/SPARK-39107, and the issue was fixed in apache/spark#36457 for newer patch versions of Spark in 3.1,3.2,3.3 and master, so shims will need to be created to handle the original faulty behavior.

@NVnavkumar NVnavkumar added bug Something isn't working ? - Needs Triage Need team to review and classify audit_3.3.0 Audit related tasks for 3.3.0 labels May 10, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 17, 2022
@anthony-chang anthony-chang self-assigned this Jun 3, 2022
@anthony-chang
Copy link
Contributor

Depends on #5450 being merged first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0 bug Something isn't working
Projects
None yet
3 participants