Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework on substring index #11149

Merged
merged 25 commits into from
Jul 25, 2024

Conversation

Feng-Jiang28
Copy link
Collaborator

@Feng-Jiang28 Feng-Jiang28 commented Jul 7, 2024

Depends on NVIDIA/spark-rapids-jni#2205
This PR is for closing issue #8750, by replace the regex implementation underneath with cudf::strings::slice_strings() with cudf::strings::find()/rfind(), new implementation achieved 12X speed up comparing to the old version NVIDIA/spark-rapids-jni#2205 (comment)

@Feng-Jiang28 Feng-Jiang28 marked this pull request as draft July 7, 2024 16:05
@winningsix winningsix changed the title [WIP]Reworkonsubstringindex [WIP]Rework on substring index Jul 8, 2024
@thirtiseven
Copy link
Collaborator

Depends on NVIDIA/spark-rapids-jni#2205

@Feng-Jiang28 Feng-Jiang28 changed the title [WIP]Rework on substring index Rework on substring index Jul 15, 2024
@Feng-Jiang28 Feng-Jiang28 marked this pull request as ready for review July 15, 2024 01:48
Signed-off-by: fejiang <[email protected]>
Signed-off-by: fejiang <[email protected]>
@sameerz sameerz added the performance A performance related task/issue label Jul 15, 2024
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have some new integration tests for new covered cases so we can test them on all spark versions.

Signed-off-by: fejiang <[email protected]>
@Feng-Jiang28
Copy link
Collaborator Author

build

Signed-off-by: fejiang <[email protected]>
@Feng-Jiang28
Copy link
Collaborator Author

build

@Feng-Jiang28
Copy link
Collaborator Author

build

1 similar comment
@Feng-Jiang28
Copy link
Collaborator Author

build

@Feng-Jiang28
Copy link
Collaborator Author

build

Signed-off-by: fejiang <[email protected]>
Signed-off-by: fejiang <[email protected]>
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Feng-Jiang28
Copy link
Collaborator Author

build

@Feng-Jiang28 Feng-Jiang28 requested review from jlowe and revans2 July 23, 2024 05:42
revans2
revans2 previously approved these changes Jul 23, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance is much much better and is super fast compared to the CPU.

@Feng-Jiang28
Copy link
Collaborator Author

build

1 similar comment
@Feng-Jiang28
Copy link
Collaborator Author

build

@Feng-Jiang28
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my comment, docs need to be regenerated. Premerge checks are failing because of this. You can run mvn verify -DskipTests to regenerate the docs and then commit the changes.

[2024-07-24T13:11:36.608Z] found modified files during mvn verify:
[2024-07-24T13:11:36.609Z]  M docs/supported_ops.md

Signed-off-by: fejiang <[email protected]>
@jlowe
Copy link
Member

jlowe commented Jul 24, 2024

build

@Feng-Jiang28 Feng-Jiang28 merged commit 37beb24 into NVIDIA:branch-24.08 Jul 25, 2024
43 checks passed
@Feng-Jiang28 Feng-Jiang28 deleted the reworkonsubstringindex branch July 25, 2024 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants