-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register partd
encode dispatch in dask_cudf
#14287
Register partd
encode dispatch in dask_cudf
#14287
Conversation
Merge pull request rapidsai#5690 from ajschmidt8/phase2 [skip ci] Update master references for main branch
[RELEASE] Re-release v0.15 cudf [skip-ci]
[RELEASE] cudf v0.17
[RELEASE] cudf v0.18
[RELEASE] Release v0.18.1 cudf
[RELEASE] v0.18.2 `cudf` release [skip-ci]
[RELEASE] v0.19.1 cudf
[RELEASE] v0.19.2 cudf [skip-ci]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
Hmm, looks like the disk shuffle in dask wants some internal property of an
|
Ah, right - Sorry. The test requires a newer version of dask, so I'll need to add a version check to that test. |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
import pickle | ||
from functools import partial | ||
|
||
import partd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add partd
to our package requirements and conda recipes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it comes in transitively through dask? Which is a dependency of dask-cudf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right exactly, dask depends on partd. I think it should be safe for us to let dask worry about the partd dependency. If dask suddenly stops using partd for shuffle="disk"
, it will also stop using partd_encode_dispatch
.
/ok to test |
/ok to test |
/merge |
Description
This PR enables "disk"-based shuffling of
cudf
-backed Dask-DataFrame collections, but does not yet add theshuffle="disk"
option to thedask_cudf.DataFrame.shuffle/sort_values
APIs.We now use basic (slow)
pickle
logic to convertcudf.DataFrame
objects to/frombytes
here, so I'd like to consider further optimizations before making theshuffle="disk"
option "official".Checklist