Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable regular expression support based on whether UTF-8 is in the current locale #5776

Merged
merged 30 commits into from
Jul 18, 2022

Conversation

NVnavkumar
Copy link
Collaborator

Fixes #5629.

This relaxes the requirement of needing the LANG=en_US.UTF-8 environment variable to be set by switching the requirement to checking the default Charset that the JVM is using (which can be set via the LANG environment indirectly anyways. This allows users to have different languages in their locale but using the UTF-8 character set, which is much easier requirement to fulfill for most users.

This also splits off the regular expression Python integration tests into a separate test file that checks for unicode support, and another set of regular expression Python integration tests that should fall back to the CPU when UTF-8 is not detected in the current locale. Also, a couple of fuzz tests for full unicode input are provided as well.

Signed-off-by: Navin Kumar [email protected]

@sameerz sameerz added the bug Something isn't working label Jun 15, 2022
@NVnavkumar NVnavkumar self-assigned this Jun 16, 2022
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar NVnavkumar marked this pull request as ready for review June 21, 2022 21:08
@NVnavkumar NVnavkumar changed the title WIP: Enable regular expression support based on whether UTF-8 is in the current locale Enable regular expression support based on whether UTF-8 is in the current locale Jun 21, 2022
Comment on lines +592 to +593
test("AST fuzz test - regexp_find - full unicode input") {
assume(isUnicodeEnabled())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan for ensuring that we get coverage here in CI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 2 sets of integration tests regexp_test.py and regexp_no_unicode_test.py. The second file checks for fallback when UTF-8 is not in the locale.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point also in these tests is that the language and country requirement is also removed, and it's using Java to check locale information.

@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

build

1 similar comment
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

build

@@ -95,6 +95,12 @@ for buildver in "${SPARK_SHIM_VERSIONS[@]:1}"; do
$MVN -U -B clean install -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \
-Dcuda.version=$CUDA_CLASSIFIER \
-Dbuildver="${buildver}"
# enable UTF-8 and run regular expression tests
env LC_ALL="en_US.UTF-8" $MVN verify -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we do mvn test in this case? verify maybe heave if we require test run only

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this with mvn test but it would never run the integration test that are also necessary to run. This way it runs the 3 scalatests and the integration test as well.

Copy link
Collaborator

@pxLi pxLi Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we run integration test in https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/jenkins/spark-tests.sh as downstream testing of our nightly build to cover all test scenarios (spark releases+OS, over 10 pipelines), I think add test to spark-tests.sh could be a follow-up one

For this script, we only run build and scala UTs.

@@ -111,6 +117,13 @@ $MVN -B clean install -pl '!tools' \
-Dmaven.repo.local=$M2DIR \
-Dcuda.version=$CUDA_CLASSIFIER

# enable UTF-8 and run regular expression tests
env LC_ALL="en_US.UTF-8" $MVN verify -pl '!tools' $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. also wildcardSuites string could be a var

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed wildcardSuites issue in last commit

# export 'LC_ALL' to set locale with UTF-8 so regular expressions are enabled
LC_ALL="en_US.UTF-8" TEST="regexp_test.py" ./integration_tests/run_pyspark_from_build.sh
# export 'LC_ALL' to set locale without UTF-8 so regular expressions are disabled to test fallback
LC_ALL="en_US.iso88591" TEST="regexp_no_unicode_test.py" ./integration_tests/run_pyspark_from_build.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this one should already be covered in normal tests run w/o explicitly setting LC_ALL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this redundant last line in last commit (it was artifact of the previous code organization)

Signed-off-by: Navin Kumar <[email protected]>
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

build

Signed-off-by: Navin Kumar <[email protected]>
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

build

Copy link
Collaborator

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's have a followup for nightly build and tests later

@pxLi
Copy link
Collaborator

pxLi commented Jul 15, 2022

all premerge CI is blocked by #6003

@pxLi
Copy link
Collaborator

pxLi commented Jul 18, 2022

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] regexp unicode tests require LANG=en_US.UTF-8 to pass
6 participants