Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add regular expression support to GPU implementation of StringSplit #4003

Closed
andygrove opened this issue Nov 2, 2021 · 3 comments · Fixed by #4714
Closed

[FEA] Add regular expression support to GPU implementation of StringSplit #4003

andygrove opened this issue Nov 2, 2021 · 3 comments · Fixed by #4714
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P1 Nice to have for release

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We currently implement StringSplit but only support non-regexp patterns.

Describe the solution you'd like
We should support regular expressions in the split pattern, compatible with Spark.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 2, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Nov 2, 2021
@Salonijain27 Salonijain27 added the P1 Nice to have for release label Nov 12, 2021
@andygrove andygrove self-assigned this Dec 2, 2021
@andygrove andygrove added this to the Nov 30 - Dec 10 milestone Dec 2, 2021
@andygrove
Copy link
Contributor Author

andygrove commented Dec 7, 2021

Depends on rapidsai/cudf#3584

@andygrove andygrove added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Dec 7, 2021
@andygrove andygrove removed this from the Nov 30 - Dec 10 milestone Dec 8, 2021
@andygrove andygrove removed their assignment Jan 7, 2022
@ttnghia
Copy link
Collaborator

ttnghia commented Jan 26, 2022

Depends on rapidsai/cudf#10128 and rapidsai/cudf#10139.

@sameerz
Copy link
Collaborator

sameerz commented Feb 1, 2022

Example from #4658 (comment)

I wish we support split with regular expressions.

Mini repro:

val address = Seq((1,"abc.com"),
(2,"...abc"),
(3,".a.b.c"))

import spark.implicits._
val df = address.toDF("id","txt")
df.write.mode("overwrite").format("parquet").save("/tmp/testparquet")
val df2=spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
spark.sql("select id,txt, split(txt, '\\.')[0] AS new_txt from df2").show()

Not-supported-messages:

!Expression split(txt#209, ., -1) cannot run on GPU because regular expressions are not supported yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P1 Nice to have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants