Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support reverse #6885

Closed
viadea opened this issue Oct 21, 2022 · 11 comments · Fixed by #7285 or #7380
Closed

[FEA] Support reverse #6885

viadea opened this issue Oct 21, 2022 · 11 comments · Fixed by #7285 or #7380
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Oct 21, 2022

I wish we can support reverse function.

select reverse(a) from test_diff_schema_evo;
    ! <Reverse> reverse(a#0) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.Reverse
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 21, 2022
@revans2
Copy link
Collaborator

revans2 commented Nov 2, 2022

I don't see any existing cudf functionality that will do what we want. @viadea do you want revers for string, array, or both?

We will need separate kernels or APIs for each of these. This is because just reversing the bits in a UTF-8 string will not reverse the string because strings are multi-byte.

@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Nov 2, 2022
@cindyyuanjiang cindyyuanjiang self-assigned this Nov 2, 2022
@viadea
Copy link
Collaborator Author

viadea commented Nov 9, 2022

@revans2 String is good for now. i can show you a real example offline

@cindyyuanjiang cindyyuanjiang removed their assignment Nov 9, 2022
@ttnghia
Copy link
Collaborator

ttnghia commented Nov 18, 2022

In cudf string is just a special case of array: array of chars. So we can support both at the same time.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 18, 2022

Some example:

scala> val df = Seq("this is a string", "another test string").toDF("s")
df: org.apache.spark.sql.DataFrame = [s: string]

scala> df.show
+-------------------+
|                  s|
+-------------------+
|   this is a string|
|another test string|
+-------------------+


scala> df.selectExpr("reverse(s)").show
+-------------------+
|         reverse(s)|
+-------------------+
|   gnirts a si siht|
|gnirts tset rehtona|
+-------------------+


scala> val df2 = Seq(Seq(1, 2, 3), Seq(4, 5, 6)).toDF("i")
df2: org.apache.spark.sql.DataFrame = [i: array<int>]

scala> df2.show
+---------+
|        i|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
+---------+


scala> df2.selectExpr("reverse(i)").show
+----------+
|reverse(i)|
+----------+
| [3, 2, 1]|
| [6, 5, 4]|
+----------+

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 18, 2022

I was worried if reverse("this is a string") could produce "string a is this" but it didn't, which is very good.

@revans2
Copy link
Collaborator

revans2 commented Nov 18, 2022

The problem is that strings are stored in UTF-8, which supports multi-byte characters. ASCII we can probably do it very simply, but for anything that is not ASCII it will potentially corrupt the string.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 18, 2022

I see-so the same implementation can't be used for both, but the implementation should be very similar and straightforward.

@revans2
Copy link
Collaborator

revans2 commented Nov 18, 2022

For a list reverse a gather map would likely work well. It would be even better if you could not have to materialize the gather map, because for some list types the gather map is going to be larger than the list column itself. But that is probably a minor optimization.

If you look at https://github.com/rapidsai/cudf/blob/branch-23.02/cpp/include/cudf/strings/detail/utf8.hpp it provides APIs to be able to see if the current byte is part of a multi-byte character or not, and what the length of it is. If you can find a good way to do this type of thing in parallel, then great. But from what I have seen most of the string APIs work by having a single thread per string. Even then we might be able to play some games to coalesce reads and writes to speed things up.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 30, 2022

We also have a duplicate FEA issue: #4375.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Dec 1, 2022
Adds `cudf::strings::reverse` function.
This is to support NVIDIA/spark-rapids#6885

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Nghia Truong (https://github.com/ttnghia)
  - Christopher Harris (https://github.com/cwharris)
  - Jake Awe (https://github.com/AyodeAwe)
  - Bradley Dice (https://github.com/bdice)

URL: #12227
@cindyyuanjiang cindyyuanjiang self-assigned this Dec 2, 2022
@sameerz
Copy link
Collaborator

sameerz commented Dec 2, 2022

Also depends on rapidsai/cudf#12283

@cindyyuanjiang cindyyuanjiang linked a pull request Dec 8, 2022 that will close this issue
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Dec 13, 2022
This implements `lists::reverse` that output a lists column in which each list is generated by reversing the order of the elements in the corresponding input list.

Example:
```
 s = [ [1, 2, 3], [], null, [4, 5, null] ]
 r = reverse(s)
 r is now [ [3, 2, 1], [], null, [null, 5, 4] ]
```

This is to support NVIDIA/spark-rapids#4375 and NVIDIA/spark-rapids#6885.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Jason Lowe (https://github.com/jlowe)

URL: #12336
@ttnghia
Copy link
Collaborator

ttnghia commented Dec 14, 2022

The libcudf PR for array input type was merged (rapidsai/cudf#12336). We can continue to complete our plugin work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
5 participants