Enable regular expression support based on whether UTF-8 is in the cu…

…rrent locale (#5776) * Regular expression support handling via UTF-8 in the locale Signed-off-by: Navin Kumar <[email protected]> * Fixup some tests, including a typo in transpiler unicode fuzz test Signed-off-by: Navin Kumar <[email protected]> * Update fuzz tests to not include \b or \B in fuzz testing because of known issues with unicode Signed-off-by: Navin Kumar <[email protected]> * Fix issue in fuzz tests with \Z followed by $ Signed-off-by: Navin Kumar <[email protected]> * Fix issue with word boundaries and negative character classes \D,\W,\S Signed-off-by: Navin Kumar <[email protected]> * Add reference to issue regarding \b and \B unicode issue Signed-off-by: Navin Kumar <[email protected]> * Fall back to CPU when negated character class is next to word boundary Signed-off-by: Navin Kumar <[email protected]> * Add \H and \V to fallback scenario with word boundaries Signed-off-by: Navin Kumar <[email protected]> * remove this test since it was removed in the upstream branch Signed-off-by: Navin Kumar <[email protected]> * move word boundary fuzz testing logic to a separate flag skipUnicodeIssues which will skip when testing full unicode characters but will use when using a smaller ASCII subset Signed-off-by: Navin Kumar <[email protected]> * Update the jenkins scripts here to set the locale Signed-off-by: Navin Kumar <[email protected]> * need to export LC_ALL in mvn_verify stage here Signed-off-by: Navin Kumar <[email protected]> * add comment for LC_ALL Signed-off-by: Navin Kumar <[email protected]> * Regexp compatibility doc update Signed-off-by: Navin Kumar <[email protected]> * Update scalatests and premerge build script Signed-off-by: Navin Kumar <[email protected]> * update build scripts to test regexp separately from other tests because of locale requirement Signed-off-by: Navin Kumar <[email protected]> * Feedback: code cleanup Signed-off-by: Navin Kumar <[email protected]> * Fix syntax errors in RegularExpressionSuite that prevent it from loading in non-UTF-8 environments Signed-off-by: Navin Kumar <[email protected]> * register custom regexp mark Signed-off-by: Navin Kumar <[email protected]> * updates to build script and test script Signed-off-by: Navin Kumar <[email protected]> * revert the nightly build script updates Signed-off-by: Navin Kumar <[email protected]>
NVIDIA · Jul 18, 2022 · 9d39953 · 9d39953
1 parent 3b1bcbe
commit 9d39953
Show file tree

Hide file tree

Showing 11 changed files with 957 additions and 717 deletions.
diff --git a/docs/compatibility.md b/docs/compatibility.md
@@ -574,15 +574,14 @@ The following Apache Spark regular expression functions and expressions are supp
 - `string_split`
 - `str_to_map`
 
-Regular expression evaluation on the GPU is enabled by default. Execution will fall back to the CPU for
-regular expressions that are not yet supported on the GPU. However, there are some edge cases that will
-still execute on the GPU and produce different results to the CPU. To disable regular expressions on the GPU,
-set `spark.rapids.sql.regexp.enabled=false`.
+Regular expression evaluation on the GPU is enabled by default when the UTF-8 character set is used
+by the current locale. Execution will fall back to the CPU for regular expressions that are not yet
+supported on the GPU, and in environments where the locale does not use UTF-8. However, there are
+some edge cases that will still execute on the GPU and produce different results to the CPU. To
+disable regular expressions on the GPU, set `spark.rapids.sql.regexp.enabled=false`.
 
 These are the known edge cases where running on the GPU will produce different results to the CPU:
 
-- Using regular expressions with Unicode data can produce incorrect results if the system `LANG` is not set
- to `en_US.UTF-8` ([#5549](https://github.com/NVIDIA/spark-rapids/issues/5549))
 - Regular expressions that contain an end of line anchor '$' or end of string anchor '\Z' or '\z' immediately
  next to a newline or a repetition that produces zero or more results
  ([#5610](https://github.com/NVIDIA/spark-rapids/pull/5610))`
@@ -596,14 +595,19 @@ The following regular expression patterns are not yet supported on the GPU and w
   or more results
 - Line anchor `$` and string anchors `\z` and `\Z` are not supported in patterns containing `\W` or `\D`
 - Line and string anchors are not supported by `string_split` and `str_to_map`
-- Word and non-word boundaries, `\b` and `\B`
 - Lazy quantifiers, such as `a*?`
 - Possessive quantifiers, such as `a*+`
 - Character classes that use union, intersection, or subtraction semantics, such as `[a-d[m-p]]`, `[a-z&&[def]]`, 
   or `[a-z&&[^bc]]`
 - Empty groups: `()`
 - `regexp_replace` does not support back-references
 
+The following regular expression patterns are known to potentially produce different results on the GPU
+vs the CPU.
+
+- Word and non-word boundaries, `\b` and `\B`
+
+
 Work is ongoing to increase the range of regular expressions that can run on the GPU.
 
 ## Timestamps

diff --git a/integration_tests/pytest.ini b/integration_tests/pytest.ini
@@ -30,5 +30,6 @@ markers =
     nightly_host_mem_consuming_case: case in nightly_resource_consuming_test that consume much more host memory than normal cases
     fuzz_test: Mark fuzz tests
     iceberg: Mark a test that requires Iceberg has been configured, skipping if tests are not configured for Iceberg
+    regexp: Mark a test that tests regular expressions on the GPU (only works when UTF-8 is enabled)
 filterwarnings =
     ignore:.*pytest.mark.order.*:_pytest.warning_types.PytestUnknownMarkWarning
diff --git a/integration_tests/src/main/python/regexp_no_unicode_test.py b/integration_tests/src/main/python/regexp_no_unicode_test.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2022, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import locale
+import pytest
+
+from asserts import assert_gpu_fallback_collect
+from data_gen import *
+from marks import *
+from pyspark.sql.types import *
+
+if locale.nl_langinfo(locale.CODESET) == 'UTF-8':
+    pytestmark = pytest.mark.skip(reason=str("Current locale uses UTF-8, fallback will not occur"))
+
+_regexp_conf = { 'spark.rapids.sql.regexp.enabled': 'true' }
+
+def mk_str_gen(pattern):
+    return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')
+
+@allow_non_gpu('ProjectExec', 'RLike')
+def test_rlike_no_unicode_fallback():
+    gen = mk_str_gen('[abcd]{1,3}')
+    assert_gpu_fallback_collect(
+        lambda spark: unary_op_df(spark, gen).selectExpr(
+            'a rlike "ab"'),
+        'RLike',
+        conf=_regexp_conf)
+
+@allow_non_gpu('ProjectExec', 'RegExpReplace')
+def test_re_replace_no_unicode_fallback():
+    gen = mk_str_gen('.{0,5}TEST[\ud720 A]{0,5}')
+    assert_gpu_fallback_collect(
+        lambda spark: unary_op_df(spark, gen).selectExpr(
+            'REGEXP_REPLACE(a, "TEST", "PROD")'),
+        'RegExpReplace',
+        conf=_regexp_conf)
+
+@allow_non_gpu('ProjectExec', 'StringSplit')
+def test_split_re_no_unicode_fallback():
+    data_gen = mk_str_gen('([bf]o{0,2}:){1,7}') \
+        .with_special_case('boo:and:foo')
+    assert_gpu_fallback_collect(
+        lambda spark : unary_op_df(spark, data_gen).selectExpr(
+            'split(a, "[o]", 2)'),
+        'StringSplit',
+        conf=_regexp_conf)