Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable regular expression support based on whether UTF-8 is in the current locale #5776

Merged
merged 30 commits into from
Jul 18, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
a39b36b
Regular expression support handling via UTF-8 in the locale
NVnavkumar Jun 7, 2022
142bbca
Merge branch 'branch-22.08' into regexp_unicode_fix
NVnavkumar Jun 14, 2022
4b88f8c
Fixup some tests, including a typo in transpiler unicode fuzz test
NVnavkumar Jun 14, 2022
80062c0
Update fuzz tests to not include \b or \B in fuzz testing because of
NVnavkumar Jun 21, 2022
17612f5
Fix issue in fuzz tests with \Z followed by $
NVnavkumar Jun 21, 2022
e141562
Fix issue with word boundaries and negative character classes \D,\W,\S
NVnavkumar Jun 21, 2022
598634b
Add reference to issue regarding \b and \B unicode issue
NVnavkumar Jun 21, 2022
2919fac
Fall back to CPU when negated character class is next to word boundary
NVnavkumar Jun 22, 2022
91c5407
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jun 22, 2022
e1f4fbe
Add \H and \V to fallback scenario with word boundaries
NVnavkumar Jun 23, 2022
f217eed
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jun 30, 2022
6cd302b
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jul 6, 2022
963f245
remove this test since it was removed in the upstream branch
NVnavkumar Jul 6, 2022
dc9d1be
move word boundary fuzz testing logic to a separate flag skipUnicodeI…
NVnavkumar Jul 6, 2022
6ea8e99
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jul 7, 2022
2f4536e
Update the jenkins scripts here to set the locale
NVnavkumar Jul 7, 2022
a3d2d9f
need to export LC_ALL in mvn_verify stage here
NVnavkumar Jul 8, 2022
1453387
add comment for LC_ALL
NVnavkumar Jul 8, 2022
4d33f85
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jul 11, 2022
da12d28
Regexp compatibility doc update
NVnavkumar Jul 11, 2022
2724802
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jul 11, 2022
84139c2
Update scalatests and premerge build script
NVnavkumar Jul 12, 2022
889ba7a
update build scripts to test regexp separately from other tests becau…
NVnavkumar Jul 12, 2022
c1e184c
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jul 12, 2022
6b21fcb
Feedback: code cleanup
NVnavkumar Jul 14, 2022
e2d0d8d
Fix syntax errors in RegularExpressionSuite that prevent it from load…
NVnavkumar Jul 14, 2022
7f8f7aa
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…
NVnavkumar Jul 14, 2022
652cf94
register custom regexp mark
NVnavkumar Jul 14, 2022
158a70e
updates to build script and test script
NVnavkumar Jul 15, 2022
16fb328
revert the nightly build script updates
NVnavkumar Jul 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions integration_tests/src/main/python/regexp_no_unicode_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import locale
import pytest

from asserts import assert_gpu_fallback_collect
from data_gen import *
from marks import *
from pyspark.sql.types import *

if locale.nl_langinfo(locale.CODESET) == 'UTF-8':
pytestmark = pytest.mark.skip(reason=str("Current locale uses UTF-8, fallback will not occur"))

_regexp_conf = { 'spark.rapids.sql.regexp.enabled': 'true' }

def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')

@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_no_unicode_fallback():
gen = mk_str_gen('[abcd]{1,3}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "ab"'),
'RLike',
conf=_regexp_conf)

@allow_non_gpu('ProjectExec', 'RegExpReplace')
def test_re_replace_no_unicode_fallback():
gen = mk_str_gen('.{0,5}TEST[\ud720 A]{0,5}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'REGEXP_REPLACE(a, "TEST", "PROD")'),
'RegExpReplace',
conf=_regexp_conf)

@allow_non_gpu('ProjectExec', 'StringSplit')
def test_split_re_no_unicode_fallback():
data_gen = mk_str_gen('([bf]o{0,2}:){1,7}') \
.with_special_case('boo:and:foo')
assert_gpu_fallback_collect(
lambda spark : unary_op_df(spark, data_gen).selectExpr(
'split(a, "[o]", 2)'),
'StringSplit',
conf=_regexp_conf)
Loading