Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi string contains [databricks] #11413

Merged
merged 7 commits into from
Nov 19, 2024

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Aug 30, 2024

Combine multi targets for multiple string contains, and invoke one kernel to get multiple bool columns results.

Depends on cuDF PR

Verified: contains in case when were combined when running pytest.

Perf test

summary

strings pattern total kernel time change kernel speedup end to end time change end to end speedup
short strings 125ms -> 38ms 3.5x 12s->12s not obvious
long strings 24s -> 11s 2.18x 59s-53s 1.11x

details

short strings

image
The above part is kernel time of new kernel
The bottom part is base line.

long strings

Kernel time:
Base line:
image

new:
image

@res-life res-life self-assigned this Aug 30, 2024
@res-life res-life force-pushed the multi-string-contians branch from 27654c6 to 4444aa4 Compare September 2, 2024 05:10
@res-life res-life force-pushed the multi-string-contians branch from 4444aa4 to 3677ff6 Compare September 2, 2024 05:20
@res-life res-life changed the title [Do not review] Support multi string contians Support multi string contians Sep 2, 2024
@res-life res-life marked this pull request as ready for review September 2, 2024 09:29
@res-life
Copy link
Collaborator Author

res-life commented Sep 2, 2024

Building failed, because it's depending on rapidsai/cudf#16641

@res-life res-life requested a review from revans2 September 2, 2024 09:30
@sameerz sameerz added the performance A performance related task/issue label Sep 2, 2024
@res-life res-life changed the title Support multi string contians Support multi string contians [databricks] Sep 3, 2024
revans2
revans2 previously approved these changes Sep 4, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good and the performance numbers look good.

gerashegalov
gerashegalov previously approved these changes Sep 7, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nits

@@ -379,3 +379,30 @@ def test_case_when_all_then_values_are_scalars_with_nulls():
"tab",
sql_without_else,
conf = {'spark.rapids.sql.case_when.fuse': 'true'})

@pytest.mark.parametrize('combine_string_contains_enabled', ['true', 'false'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer Python constants

Suggested change
@pytest.mark.parametrize('combine_string_contains_enabled', ['true', 'false'])
@pytest.mark.parametrize('combine_string_contains_enabled', [True, False])

However, the pytest case id will be more readable if, instead of a boolean, parameters are strings

@pytest.mark.parametrize('string_contains_mode', ['multiContains', 'singleContains'], ids=idfn)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +471 to +473
override def equals(o: Any): Boolean = o match {
case other: ContainsCombiner => exp.left.semanticEquals(other.exp.left) &&
exp.right.isInstanceOf[GpuLiteral] && other.exp.right.isInstanceOf[GpuLiteral]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you make ContainsCombiner a case class you can use pattern matching instead of manual instanceof checks:

Suggested change
override def equals(o: Any): Boolean = o match {
case other: ContainsCombiner => exp.left.semanticEquals(other.exp.left) &&
exp.right.isInstanceOf[GpuLiteral] && other.exp.right.isInstanceOf[GpuLiteral]
override def equals(o: Any): Boolean = (o, exp) match {
case (ContainsCombiner(GpuContains(combLeft, GpuLiteral(_, _))), GpuContains(expLeft, GpuLiteral(_, _))) =>
expLeft.semanticEquals(combLeft)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not significantly reduce the num of code lines, from 5 lines to 4 lines.
Let me keep the original versoin, since it's a nit.

@ttnghia ttnghia changed the title Support multi string contians [databricks] Support multi string contains [databricks] Sep 27, 2024
@res-life res-life changed the base branch from branch-24.10 to branch-24.12 October 9, 2024 02:09
@res-life res-life dismissed stale reviews from gerashegalov and revans2 October 9, 2024 02:09

The base branch was changed.

@res-life
Copy link
Collaborator Author

build

1 similar comment
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Nov 18, 2024

The previous commit fixed the casting from String to UTF8String error when running test_rlike_multi_line test case

E                   : java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String
E                   	at org.apache.spark.sql.rapids.ContainsCombiner.$anonfun$multiContains$3(stringFunctions.scala:506)
E                   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Filed a follow-up issue optimize the multi-contains generated by rlike

@res-life
Copy link
Collaborator Author

The perf test in the description was updated, please look again.
For the long strings, kernel speedup is 2.18x now.
CI was passed, please review again.

@res-life
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator

Looks good to me.

@res-life res-life merged commit b16d107 into NVIDIA:branch-24.12 Nov 19, 2024
48 of 49 checks passed
@res-life res-life deleted the multi-string-contians branch November 19, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants