Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dask serializers for cuDF objects #4153

Merged
merged 1 commit into from
Feb 14, 2020

Conversation

jakirkham
Copy link
Member

In the event that serializing CUDA objects directly is not possible, this performs the next best thing, which is to serialize objects using Dask's serialization protocol. The protocol requires that data be on the host in 1-D contiguous memoryviews. So we perform serialization as we otherwise would. As a last step we perform a device-to-host transfer of the frames. Then we hand this off to Dask to serialize. When deserializing the data, all of the deserializers already work as frames are turned into Buffers, which perform a host-to-device transfer if needed. This provides us an option that avoids pickling. As a result we are able to serialize things with Dask more efficiently using this protocol.

In the event that serializing CUDA objects directly is not possible,
this performs the next best thing, which is to serialize objects using
Dask's serialization protocol. The protocol requires that data be on the
host in 1-D contiguous `memoryviews`. So we perform serialization as we
otherwise would. As a last step we perform a device-to-host transfer of
the frames. Then we hand this off to Dask to serialize. When
deserializing the data, all of the deserializers already work as frames
are turned into `Buffer`s, which perform a host-to-device transfer if
needed. This provides us an option that avoids pickling. As a result we
are able to serialize things with Dask more efficiently using this
protocol.
@jakirkham jakirkham requested a review from a team as a code owner February 14, 2020 02:28
@codecov
Copy link

codecov bot commented Feb 14, 2020

Codecov Report

Merging #4153 into branch-0.13 will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.13    #4153      +/-   ##
===============================================
- Coverage        86.74%   86.67%   -0.08%     
===============================================
  Files               50       50              
  Lines             9810     9818       +8     
===============================================
  Hits              8510     8510              
- Misses            1300     1308       +8     

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85983b0...e4b488c. Read the comment docs.

@kkraus14
Copy link
Collaborator

Would like someone more familiar with dask serialization dispatching to review as well before merging 😄

@jakirkham
Copy link
Member Author

@quasiben, would you be able to take a look? 🙂

@quasiben
Copy link
Member

When would serializing cuda objects directly not be possible ? As I understand it, if we are transporting over TCP we will be calling pickled on the objects which should trigger a device to host copy. Has this now changed with the addition Buffer over numba ?

@jakirkham
Copy link
Member Author

jakirkham commented Feb 14, 2020

Right this is the TCP case.

So this would be useful for people who either lack UCX, or the hardware to really take advantage of it. This could come up on some cloud service providers or for our users that are not currently using UCX.

Should also make for more realistic comparisons between UCX and TCP. Particularly as operations on the host Dask won't pickle in the first place. Also would be more realistic when comparing to other language implementations (Java, C++, etc.) that don't have this overhead.

@quasiben
Copy link
Member

So currently the TCP case is not supported? Or it is but not performant/well understood ?

@jakirkham
Copy link
Member Author

It's supported by falling back to pickle, which adds unnecessary performance overhead. This avoids that overhead.

@quasiben
Copy link
Member

Ah, that makes sense. Thanks @jakirkham

@quasiben quasiben merged commit 16ebc35 into rapidsai:branch-0.13 Feb 14, 2020
@jakirkham jakirkham deleted the add_dask_serializers branch February 14, 2020 18:05
@jakirkham
Copy link
Member Author

Thanks for the reviews! 😄

@jakirkham
Copy link
Member Author

Added PR ( dask/distributed#3478 ) to make sure we are registering this support with Dask. Things behave the same for cuDF versions lacking this support (in other words we fallback to pickle), but cuDF versions with this feature will perform more efficient serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond dask Dask issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants