-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable regular expression support based on whether UTF-8 is in the current locale #5776
Enable regular expression support based on whether UTF-8 is in the current locale #5776
Conversation
Signed-off-by: Navin Kumar <[email protected]>
known issues with unicode Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
build |
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala
Show resolved
Hide resolved
test("AST fuzz test - regexp_find - full unicode input") { | ||
assume(isUnicodeEnabled()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan for ensuring that we get coverage here in CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 2 sets of integration tests regexp_test.py
and regexp_no_unicode_test.py
. The second file checks for fallback when UTF-8 is not in the locale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point also in these tests is that the language and country requirement is also removed, and it's using Java to check locale information.
Signed-off-by: Navin Kumar <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…ssues which will skip when testing full unicode characters but will use when using a smaller ASCII subset Signed-off-by: Navin Kumar <[email protected]>
build |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Show resolved
Hide resolved
Signed-off-by: Navin Kumar <[email protected]>
…se of locale requirement Signed-off-by: Navin Kumar <[email protected]>
build |
1 similar comment
build |
build |
jenkins/spark-nightly-build.sh
Outdated
@@ -95,6 +95,12 @@ for buildver in "${SPARK_SHIM_VERSIONS[@]:1}"; do | |||
$MVN -U -B clean install -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \ | |||
-Dcuda.version=$CUDA_CLASSIFIER \ | |||
-Dbuildver="${buildver}" | |||
# enable UTF-8 and run regular expression tests | |||
env LC_ALL="en_US.UTF-8" $MVN verify -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we do mvn test in this case? verify maybe heave if we require test run only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this with mvn test
but it would never run the integration test that are also necessary to run. This way it runs the 3 scalatests and the integration test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we run integration test in https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/jenkins/spark-tests.sh as downstream testing of our nightly build to cover all test scenarios (spark releases+OS, over 10 pipelines), I think add test to spark-tests.sh could be a follow-up one
For this script, we only run build and scala UTs.
jenkins/spark-nightly-build.sh
Outdated
@@ -111,6 +117,13 @@ $MVN -B clean install -pl '!tools' \ | |||
-Dmaven.repo.local=$M2DIR \ | |||
-Dcuda.version=$CUDA_CLASSIFIER | |||
|
|||
# enable UTF-8 and run regular expression tests | |||
env LC_ALL="en_US.UTF-8" $MVN verify -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above. also wildcardSuites
string could be a var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed wildcardSuites
issue in last commit
jenkins/spark-premerge-build.sh
Outdated
# export 'LC_ALL' to set locale with UTF-8 so regular expressions are enabled | ||
LC_ALL="en_US.UTF-8" TEST="regexp_test.py" ./integration_tests/run_pyspark_from_build.sh | ||
# export 'LC_ALL' to set locale without UTF-8 so regular expressions are disabled to test fallback | ||
LC_ALL="en_US.iso88591" TEST="regexp_no_unicode_test.py" ./integration_tests/run_pyspark_from_build.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this one should already be covered in normal tests run w/o explicitly setting LC_ALL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this redundant last line in last commit (it was artifact of the previous code organization)
Signed-off-by: Navin Kumar <[email protected]>
build |
…ing in non-UTF-8 environments Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's have a followup for nightly build and tests later
all premerge CI is blocked by #6003 |
build |
Fixes #5629.
This relaxes the requirement of needing the
LANG=en_US.UTF-8
environment variable to be set by switching the requirement to checking the default Charset that the JVM is using (which can be set via theLANG
environment indirectly anyways. This allows users to have different languages in their locale but using the UTF-8 character set, which is much easier requirement to fulfill for most users.This also splits off the regular expression Python integration tests into a separate test file that checks for unicode support, and another set of regular expression Python integration tests that should fall back to the CPU when UTF-8 is not detected in the current locale. Also, a couple of fuzz tests for full unicode input are provided as well.
Signed-off-by: Navin Kumar [email protected]