Skip to content

Commit

Permalink
Enable regular expression support based on whether UTF-8 is in the cu…
Browse files Browse the repository at this point in the history
…rrent locale (#5776)

* Regular expression support handling via UTF-8 in the locale

Signed-off-by: Navin Kumar <[email protected]>

* Fixup some tests, including a typo in transpiler unicode fuzz test

Signed-off-by: Navin Kumar <[email protected]>

* Update fuzz tests to not include \b or \B in fuzz testing because of
known issues with unicode

Signed-off-by: Navin Kumar <[email protected]>

* Fix issue in fuzz tests with \Z followed by $

Signed-off-by: Navin Kumar <[email protected]>

* Fix issue with word boundaries and negative character classes \D,\W,\S

Signed-off-by: Navin Kumar <[email protected]>

* Add reference to issue regarding \b and \B unicode issue

Signed-off-by: Navin Kumar <[email protected]>

* Fall back to CPU when negated character class is next to word boundary

Signed-off-by: Navin Kumar <[email protected]>

* Add \H and \V to fallback scenario with word boundaries

Signed-off-by: Navin Kumar <[email protected]>

* remove this test since it was removed in the upstream branch

Signed-off-by: Navin Kumar <[email protected]>

* move word boundary fuzz testing logic to a separate flag skipUnicodeIssues which will skip when testing full unicode characters but will use when using a smaller ASCII subset

Signed-off-by: Navin Kumar <[email protected]>

* Update the jenkins scripts here to set the locale

Signed-off-by: Navin Kumar <[email protected]>

* need to export LC_ALL in mvn_verify stage here

Signed-off-by: Navin Kumar <[email protected]>

* add comment for LC_ALL

Signed-off-by: Navin Kumar <[email protected]>

* Regexp compatibility doc update

Signed-off-by: Navin Kumar <[email protected]>

* Update scalatests and premerge build script

Signed-off-by: Navin Kumar <[email protected]>

* update build scripts to test regexp separately from other tests because of locale requirement

Signed-off-by: Navin Kumar <[email protected]>

* Feedback: code cleanup

Signed-off-by: Navin Kumar <[email protected]>

* Fix syntax errors in RegularExpressionSuite that prevent it from loading in non-UTF-8 environments

Signed-off-by: Navin Kumar <[email protected]>

* register custom regexp mark

Signed-off-by: Navin Kumar <[email protected]>

* updates to build script and test script

Signed-off-by: Navin Kumar <[email protected]>

* revert the nightly build script updates

Signed-off-by: Navin Kumar <[email protected]>
  • Loading branch information
NVnavkumar authored Jul 18, 2022
1 parent 3b1bcbe commit 9d39953
Show file tree
Hide file tree
Showing 11 changed files with 957 additions and 717 deletions.
18 changes: 11 additions & 7 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -574,15 +574,14 @@ The following Apache Spark regular expression functions and expressions are supp
- `string_split`
- `str_to_map`

Regular expression evaluation on the GPU is enabled by default. Execution will fall back to the CPU for
regular expressions that are not yet supported on the GPU. However, there are some edge cases that will
still execute on the GPU and produce different results to the CPU. To disable regular expressions on the GPU,
set `spark.rapids.sql.regexp.enabled=false`.
Regular expression evaluation on the GPU is enabled by default when the UTF-8 character set is used
by the current locale. Execution will fall back to the CPU for regular expressions that are not yet
supported on the GPU, and in environments where the locale does not use UTF-8. However, there are
some edge cases that will still execute on the GPU and produce different results to the CPU. To
disable regular expressions on the GPU, set `spark.rapids.sql.regexp.enabled=false`.

These are the known edge cases where running on the GPU will produce different results to the CPU:

- Using regular expressions with Unicode data can produce incorrect results if the system `LANG` is not set
to `en_US.UTF-8` ([#5549](https://github.com/NVIDIA/spark-rapids/issues/5549))
- Regular expressions that contain an end of line anchor '$' or end of string anchor '\Z' or '\z' immediately
next to a newline or a repetition that produces zero or more results
([#5610](https://github.com/NVIDIA/spark-rapids/pull/5610))`
Expand All @@ -596,14 +595,19 @@ The following regular expression patterns are not yet supported on the GPU and w
or more results
- Line anchor `$` and string anchors `\z` and `\Z` are not supported in patterns containing `\W` or `\D`
- Line and string anchors are not supported by `string_split` and `str_to_map`
- Word and non-word boundaries, `\b` and `\B`
- Lazy quantifiers, such as `a*?`
- Possessive quantifiers, such as `a*+`
- Character classes that use union, intersection, or subtraction semantics, such as `[a-d[m-p]]`, `[a-z&&[def]]`,
or `[a-z&&[^bc]]`
- Empty groups: `()`
- `regexp_replace` does not support back-references

The following regular expression patterns are known to potentially produce different results on the GPU
vs the CPU.

- Word and non-word boundaries, `\b` and `\B`


Work is ongoing to increase the range of regular expressions that can run on the GPU.

## Timestamps
Expand Down
1 change: 1 addition & 0 deletions integration_tests/pytest.ini
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,6 @@ markers =
nightly_host_mem_consuming_case: case in nightly_resource_consuming_test that consume much more host memory than normal cases
fuzz_test: Mark fuzz tests
iceberg: Mark a test that requires Iceberg has been configured, skipping if tests are not configured for Iceberg
regexp: Mark a test that tests regular expressions on the GPU (only works when UTF-8 is enabled)
filterwarnings =
ignore:.*pytest.mark.order.*:_pytest.warning_types.PytestUnknownMarkWarning
57 changes: 57 additions & 0 deletions integration_tests/src/main/python/regexp_no_unicode_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import locale
import pytest

from asserts import assert_gpu_fallback_collect
from data_gen import *
from marks import *
from pyspark.sql.types import *

if locale.nl_langinfo(locale.CODESET) == 'UTF-8':
pytestmark = pytest.mark.skip(reason=str("Current locale uses UTF-8, fallback will not occur"))

_regexp_conf = { 'spark.rapids.sql.regexp.enabled': 'true' }

def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')

@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_no_unicode_fallback():
gen = mk_str_gen('[abcd]{1,3}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "ab"'),
'RLike',
conf=_regexp_conf)

@allow_non_gpu('ProjectExec', 'RegExpReplace')
def test_re_replace_no_unicode_fallback():
gen = mk_str_gen('.{0,5}TEST[\ud720 A]{0,5}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'REGEXP_REPLACE(a, "TEST", "PROD")'),
'RegExpReplace',
conf=_regexp_conf)

@allow_non_gpu('ProjectExec', 'StringSplit')
def test_split_re_no_unicode_fallback():
data_gen = mk_str_gen('([bf]o{0,2}:){1,7}') \
.with_special_case('boo:and:foo')
assert_gpu_fallback_collect(
lambda spark : unary_op_df(spark, data_gen).selectExpr(
'split(a, "[o]", 2)'),
'StringSplit',
conf=_regexp_conf)
Loading

0 comments on commit 9d39953

Please sign in to comment.