Serializing CUDA objects in iterables #110

pentschev · 2019-08-09T10:49:35Z

We have improved serialization performance in #98, but it introduced an issue: serializing tuples (and potentially all other iterable types) with CUDA objects fails now. The issue is distributed.protocol.serialize will only serialize tuples when the serializer is pickle, for instance, serialize((np_array1, np_array2), serializers=["dask"]) will also fail.

I've found out about this issue when trying to run the sample from #57, more specifically, it happens during the df = df.sort_values(['A','B','C']) call, where it tries to serialize a tuple of (<cudf.DataFrame ncols=11 nrows=2759411 >, <cudf.DataFrame ncols=11 nrows=2759411 >).

@mrocklin maybe you have an idea on how to solve this? I thought of checking for iterables on device_to_host and serializing each object individually, returning perhaps an iterable of DeviceSerialized objects, but then we could have tuples of tuples, and so on, that would make things more complicated, so it doesn't seem like a resilient solution. Also, I'm not sure if the solution should be here or in dask-distributed, or maybe there's one already that I just don't know of.

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-08-09T13:40:45Z

For now I think that your idea of implementing cuda_serialize on standard collections (tuple, list, set, dict) makes sense. If they encounter no cuda-serialiable objects then they should raise NotImplementedError so that lower-priority serializers take over. See this code:

https://github.com/dask/distributed/blob/a55515569d4c5da734e5b14ae414cd342c37ed7b/distributed/protocol/serialize.py#L140-L150

pentschev · 2019-08-09T14:12:27Z

But doing the way you're suggesting, the case I mentioned at the beginning (tuple of cuDF dataframes) would be serialized using pickle, which is slow. Is that what you're saying should happen?

Just to make it clear in case my initial comment wasn't, what I meant was a more complex handling that would serialize each of the elements of the tuple using the CUDA serializer, but it would be more complex and probably wouldn't solve for all possible object combinations (e.g., cuDF dataframes inside a tuple, inside another tuple).

pentschev · 2019-08-09T14:15:59Z

Ignore my previous comment, I had misunderstood your suggestion. It does make sense to handle it the way you're saying, I'll try doing that.

pentschev · 2019-09-12T19:15:36Z

This has been fixed in dask/distributed#2948 and #118, closing.

pentschev mentioned this issue Aug 11, 2019

Add support for collection serialization dask/distributed#2948

Merged

pentschev closed this as completed Sep 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serializing CUDA objects in iterables #110

Serializing CUDA objects in iterables #110

pentschev commented Aug 9, 2019

mrocklin commented Aug 9, 2019

pentschev commented Aug 9, 2019

pentschev commented Aug 9, 2019

pentschev commented Sep 12, 2019

Serializing CUDA objects in iterables #110

Serializing CUDA objects in iterables #110

Comments

pentschev commented Aug 9, 2019

mrocklin commented Aug 9, 2019

pentschev commented Aug 9, 2019

pentschev commented Aug 9, 2019

pentschev commented Sep 12, 2019