Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite regex pattern literal[a-b]{x,y} to custom kernel in rlike #8

Merged
merged 32 commits into from
May 21, 2024

Conversation

thirtiseven
Copy link

PR in main repo: NVIDIA#10822

Depends on package jni pr: wjxiz1992/spark-rapids-jni#1

This pr uses a custom kernel to rewrite regex patterns like literal[a-b]{x,y} in like to get performance gain.

For example, pattern abc[0-9]{2,3} will match data xxxabc12345, and [\u4e00-\u9fa5]+ (match chinese characters) will match data spark-rapids上海.

A quick performance test, 2000000 strings with length from 0 to 2000, about 1/3 strings contains a substring "abcdefghijklmnopqrstuvwxyz" and run 100 queries like 'abcdefghijklmnopqrstuvwxy[0-9]{5}'.

cpu gluten gpu before gpu after speedup vs gpu before
6676.7 ms 2024.00 ms 4106.7 ms 821.7 ms 5.00x

thirtiseven and others added 30 commits April 7, 2024 19:42
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
* Add NVTX ranges to identify Spark stages and tasks

Signed-off-by: Jason Lowe <[email protected]>

* scalastyle

---------

Signed-off-by: Jason Lowe <[email protected]>
@wjxiz1992 wjxiz1992 merged commit 3c8daac into nvliyuan:0520-base-local May 21, 2024
1 of 2 checks passed
nvliyuan pushed a commit that referenced this pull request May 28, 2024
* A hacky approach for regexpr rewrite

Signed-off-by: Haoyang Li <[email protected]>

* Use contains instead for that case

Signed-off-by: Haoyang Li <[email protected]>

* add config to switch

Signed-off-by: Haoyang Li <[email protected]>

* Rewrite some rlike expression to StartsWith/EndsWith/Contains

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* Draft code to adapt RegexParser in regex rewrite

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* Apply suggestions from code review

Co-authored-by: Gera Shegalov <[email protected]>

* A checkpoint before removing endsWith rewrite

Signed-off-by: Haoyang Li <[email protected]>

* Remove equalsTo and endsWith

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* address a comment

Signed-off-by: Haoyang Li <[email protected]>

* address a comment

Signed-off-by: Haoyang Li <[email protected]>

* address comments

Signed-off-by: Haoyang Li <[email protected]>

* fix 2.13 build

Signed-off-by: Haoyang Li <[email protected]>

* checkpoint before pattern matching => if

Signed-off-by: Haoyang Li <[email protected]>

* Add prefix range in regex parser rewrite

Signed-off-by: Haoyang Li <[email protected]>

* Address comments

Signed-off-by: Haoyang Li <[email protected]>

* wip

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* change some names

Signed-off-by: Haoyang Li <[email protected]>

* checkpoint before upmerge

Signed-off-by: Haoyang Li <[email protected]>

* add tests

Signed-off-by: Haoyang Li <[email protected]>

* Catch exceptions when trying to examine Iceberg scan for metadata queries (NVIDIA#10836)

Signed-off-by: Jason Lowe <[email protected]>

* Add NVTX ranges to identify Spark stages and tasks (NVIDIA#10826)

* Add NVTX ranges to identify Spark stages and tasks

Signed-off-by: Jason Lowe <[email protected]>

* scalastyle

---------

Signed-off-by: Jason Lowe <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Jason Lowe <[email protected]>
Co-authored-by: Gera Shegalov <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants