-
Notifications
You must be signed in to change notification settings - Fork 240
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Question about using UDF to implement operations. #2535
Comments
We support UDFs (sort of). If the UDF is really simple, and can be translated into a catalyst expression, then we can do some things with that in your turn it on (very experimental). I don't think what you are doing is something we support yet for translation to catalyst. The other option we have is they you can write your own UDFs either using cuda directly or using the java cudf API. They can give you a lot of control. But it looks like your UDF is really just a join. You have an array mapping partition ids to some other number, and you want to look it up based off of that partition id. That is a join. As a side note we are working on |
Thank you for the quick and detailed response! How about collect_set in Windowing, as I am under the impression the cudf library supports collect_set in its java api? |
Cudf just did a core freeze for our next release, and we will be doing our own code freeze shortly. So remembering what is in previous releases gets to be a bit complicated. https://nvidia.github.io/spark-rapids/docs/supported_ops.html should list all of the operations for the current release on Apache Spark 3.0.0.
@jlowe I don't think we support UDAFs yet for RapidsUDFs. Do we? |
@revans2 correct, UDAFs are not yet supported. |
Closing this as answered. Feel free to reopen if there's more to discuss. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hey all!
As a part of my thesis I am doing research on spark-rapids, comparing GPU and CPU processing on biological sequencing, essentially constructing De Bruijn Graphs from a large text file. The part of the code I want to accelerate is fairly simple, with the only complicated operations that are not already implemented that I require being
collect_set
andzipWithIndex
.Is there any method using UDF to implement these in a GPU accelerated way?
The data I want to use both functions is
String, Long
, making the code look something like this:As far as I can tell,
collect_list
is supported in windowing, howeverdropDuplicates
is not supported on listTypes, andzipWithIndex
is not dataframe supported, however, I have been using a function that mostly transfers the operation to dataframe:The text was updated successfully, but these errors were encountered: