Enable regular expression support based on whether UTF-8 is in the current locale #5776

NVnavkumar · 2022-06-07T21:47:17Z

This relaxes the requirement of needing the LANG=en_US.UTF-8 environment variable to be set by switching the requirement to checking the default Charset that the JVM is using (which can be set via the LANG environment indirectly anyways. This allows users to have different languages in their locale but using the UTF-8 character set, which is much easier requirement to fulfill for most users.

This also splits off the regular expression Python integration tests into a separate test file that checks for unicode support, and another set of regular expression Python integration tests that should fall back to the CPU when UTF-8 is not detected in the current locale. Also, a couple of fuzz tests for full unicode input are provided as well.

Signed-off-by: Navin Kumar [email protected]

Signed-off-by: Navin Kumar <[email protected]>

known issues with unicode Signed-off-by: Navin Kumar <[email protected]>

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-06-21T20:56:54Z

build

tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala

andygrove · 2022-06-21T21:52:23Z

tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala

+  test("AST fuzz test - regexp_find - full unicode input") {
+    assume(isUnicodeEnabled())


What is the plan for ensuring that we get coverage here in CI?

We have 2 sets of integration tests regexp_test.py and regexp_no_unicode_test.py. The second file checks for fallback when UTF-8 is not in the locale.

The point also in these tests is that the language and country requirement is also removed, and it's using Java to check locale information.

Signed-off-by: Navin Kumar <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Signed-off-by: Navin Kumar <[email protected]>

…gexp_unicode_fix

NVnavkumar · 2022-06-23T17:21:31Z

build

Signed-off-by: Navin Kumar <[email protected]>

…gexp_unicode_fix

Signed-off-by: Navin Kumar <[email protected]>

…ssues which will skip when testing full unicode characters but will use when using a smaller ASCII subset Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-07-06T23:00:08Z

build

…gexp_unicode_fix

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

…gexp_unicode_fix

Signed-off-by: Navin Kumar <[email protected]>

…se of locale requirement Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-07-12T04:20:36Z

build

NVnavkumar · 2022-07-12T20:17:24Z

build

…gexp_unicode_fix

NVnavkumar · 2022-07-12T22:00:59Z

build

pxLi · 2022-07-13T00:57:49Z

jenkins/spark-nightly-build.sh

@@ -95,6 +95,12 @@ for buildver in "${SPARK_SHIM_VERSIONS[@]:1}"; do
    $MVN -U -B clean install -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \
        -Dcuda.version=$CUDA_CLASSIFIER \
        -Dbuildver="${buildver}"
+    # enable UTF-8 and run regular expression tests
+    env LC_ALL="en_US.UTF-8" $MVN verify -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \


nit: can we do mvn test in this case? verify maybe heave if we require test run only

I tried this with mvn test but it would never run the integration test that are also necessary to run. This way it runs the 3 scalatests and the integration test as well.

Actually we run integration test in https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/jenkins/spark-tests.sh as downstream testing of our nightly build to cover all test scenarios (spark releases+OS, over 10 pipelines), I think add test to spark-tests.sh could be a follow-up one

For this script, we only run build and scala UTs.

pxLi · 2022-07-13T00:59:07Z

jenkins/spark-nightly-build.sh

@@ -111,6 +117,13 @@ $MVN -B clean install -pl '!tools' \
    -Dmaven.repo.local=$M2DIR \
    -Dcuda.version=$CUDA_CLASSIFIER

+# enable UTF-8 and run regular expression tests
+env LC_ALL="en_US.UTF-8" $MVN verify -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \


same as above. also wildcardSuites string could be a var

addressed wildcardSuites issue in last commit

pxLi · 2022-07-13T01:11:34Z

jenkins/spark-premerge-build.sh

+    # export 'LC_ALL' to set locale with UTF-8 so regular expressions are enabled
+    LC_ALL="en_US.UTF-8" TEST="regexp_test.py" ./integration_tests/run_pyspark_from_build.sh
+    # export 'LC_ALL' to set locale without UTF-8 so regular expressions are disabled to test fallback
+    LC_ALL="en_US.iso88591" TEST="regexp_no_unicode_test.py" ./integration_tests/run_pyspark_from_build.sh


looks like this one should already be covered in normal tests run w/o explicitly setting LC_ALL?

removed this redundant last line in last commit (it was artifact of the previous code organization)

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-07-14T06:10:59Z

build

…ing in non-UTF-8 environments Signed-off-by: Navin Kumar <[email protected]>

…gexp_unicode_fix

NVnavkumar · 2022-07-14T16:28:50Z

build

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-07-14T17:57:02Z

build

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-07-15T06:26:43Z

build

pxLi

LGTM. Let's have a followup for nightly build and tests later

pxLi · 2022-07-15T06:29:58Z

all premerge CI is blocked by #6003

pxLi · 2022-07-18T06:10:45Z

build

Regular expression support handling via UTF-8 in the locale

a39b36b

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar requested a review from andygrove June 7, 2022 21:47

anthony-chang mentioned this pull request Jun 8, 2022

[BUG] regexp: Build fails on CI when more characters added to fuzzer but not locally #5711

Closed

NVnavkumar added 2 commits June 14, 2022 15:43

Merge branch 'branch-22.08' into regexp_unicode_fix

142bbca

Fixup some tests, including a typo in transpiler unicode fuzz test

4b88f8c

Signed-off-by: Navin Kumar <[email protected]>

sameerz added the bug Something isn't working label Jun 15, 2022

NVnavkumar self-assigned this Jun 16, 2022

NVnavkumar added 2 commits June 21, 2022 09:25

Update fuzz tests to not include \b or \B in fuzz testing because of

80062c0

known issues with unicode Signed-off-by: Navin Kumar <[email protected]>

Fix issue in fuzz tests with \Z followed by $

17612f5

Signed-off-by: Navin Kumar <[email protected]>

andygrove mentioned this pull request Jun 21, 2022

WIP: Regexp Unicode support #5662

Closed

Fix issue with word boundaries and negative character classes \D,\W,\S

e141562

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar marked this pull request as ready for review June 21, 2022 21:08

NVnavkumar changed the title ~~WIP: Enable regular expression support based on whether UTF-8 is in the current locale~~ Enable regular expression support based on whether UTF-8 is in the current locale Jun 21, 2022

andygrove reviewed Jun 21, 2022

View reviewed changes

tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala Show resolved Hide resolved

andygrove reviewed Jun 21, 2022

View reviewed changes

Add reference to issue regarding \b and \B unicode issue

598634b

Signed-off-by: Navin Kumar <[email protected]>

anthony-chang reviewed Jun 22, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Outdated Show resolved Hide resolved

NVnavkumar added 2 commits June 22, 2022 14:19

Fall back to CPU when negated character class is next to word boundary

2919fac

Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…

91c5407

…gexp_unicode_fix

NVnavkumar added 5 commits June 23, 2022 13:52

Add \H and \V to fallback scenario with word boundaries

e1f4fbe

Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…

f217eed

…gexp_unicode_fix

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…

6cd302b

…gexp_unicode_fix

remove this test since it was removed in the upstream branch

963f245

Signed-off-by: Navin Kumar <[email protected]>

move word boundary fuzz testing logic to a separate flag skipUnicodeI…

dc9d1be

…ssues which will skip when testing full unicode characters but will use when using a smaller ASCII subset Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…

6ea8e99

…gexp_unicode_fix

anthony-chang reviewed Jul 7, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Show resolved Hide resolved

NVnavkumar added 3 commits July 11, 2022 13:13

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…

2724802

…gexp_unicode_fix

Update scalatests and premerge build script

84139c2

Signed-off-by: Navin Kumar <[email protected]>

update build scripts to test regexp separately from other tests becau…

889ba7a

…se of locale requirement Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…

c1e184c

…gexp_unicode_fix

pxLi reviewed Jul 13, 2022

View reviewed changes

Feedback: code cleanup

6b21fcb

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar added 2 commits July 14, 2022 09:22

Fix syntax errors in RegularExpressionSuite that prevent it from load…

e2d0d8d

…ing in non-UTF-8 environments Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into re…

7f8f7aa

…gexp_unicode_fix

register custom regexp mark

652cf94

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar added 2 commits July 14, 2022 23:10

updates to build script and test script

158a70e

Signed-off-by: Navin Kumar <[email protected]>

revert the nightly build script updates

16fb328

Signed-off-by: Navin Kumar <[email protected]>

pxLi approved these changes Jul 15, 2022

View reviewed changes

jlowe approved these changes Jul 18, 2022

View reviewed changes

anthony-chang approved these changes Jul 18, 2022

View reviewed changes

NVnavkumar merged commit 9d39953 into NVIDIA:branch-22.08 Jul 18, 2022

anthony-chang mentioned this pull request Jul 18, 2022

[BUG] Investigate regexp failures with unicode input #5521

Closed

jlowe mentioned this pull request Jul 19, 2022

[BUG] regexp_test is failing in nightly tests #6028

Closed

pxLi mentioned this pull request Jul 20, 2022

[BUG] Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec failure #6032

Closed

NVnavkumar mentioned this pull request Jul 25, 2022

[FEA] Add UTF-8 environment for regular expression testing in nightly CI testing #6080

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable regular expression support based on whether UTF-8 is in the current locale #5776

Enable regular expression support based on whether UTF-8 is in the current locale #5776

NVnavkumar commented Jun 7, 2022

NVnavkumar commented Jun 21, 2022

andygrove Jun 21, 2022

NVnavkumar Jun 21, 2022

NVnavkumar Jun 21, 2022

NVnavkumar commented Jun 23, 2022

NVnavkumar commented Jul 6, 2022

NVnavkumar commented Jul 12, 2022

NVnavkumar commented Jul 12, 2022

NVnavkumar commented Jul 12, 2022

pxLi Jul 13, 2022

NVnavkumar Jul 14, 2022

pxLi Jul 15, 2022 •

edited

Loading

pxLi Jul 13, 2022

NVnavkumar Jul 14, 2022

pxLi Jul 13, 2022

NVnavkumar Jul 14, 2022

NVnavkumar commented Jul 14, 2022

NVnavkumar commented Jul 14, 2022

NVnavkumar commented Jul 14, 2022

NVnavkumar commented Jul 15, 2022

pxLi left a comment

pxLi commented Jul 15, 2022

pxLi commented Jul 18, 2022

		test("AST fuzz test - regexp_find - full unicode input") {
		assume(isUnicodeEnabled())

Enable regular expression support based on whether UTF-8 is in the current locale #5776

Enable regular expression support based on whether UTF-8 is in the current locale #5776

Conversation

NVnavkumar commented Jun 7, 2022

NVnavkumar commented Jun 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NVnavkumar commented Jun 23, 2022

NVnavkumar commented Jul 6, 2022

NVnavkumar commented Jul 12, 2022

NVnavkumar commented Jul 12, 2022

NVnavkumar commented Jul 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxLi Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NVnavkumar commented Jul 14, 2022

NVnavkumar commented Jul 14, 2022

NVnavkumar commented Jul 14, 2022

NVnavkumar commented Jul 15, 2022

pxLi left a comment

Choose a reason for hiding this comment

pxLi commented Jul 15, 2022

pxLi commented Jul 18, 2022

pxLi Jul 15, 2022 •

edited

Loading