Serialize data within tasks #2110

mrocklin · 2018-07-12T12:42:58Z

Currently if we submit data within a task like the following:

x = np.array([1, 2, 3])
future = client.submit(func, x)

Then we construct a task like (func, x) and then call pickle on this task. We don't do custom serialization once we contruct tasks. This is mostly because stuff in tasks is rarely large, and traversing tasks can increase our overhead in the common case. Typically we encourage people to use dask.delayed or something similar to mark data

x = np.array([1, 2, 3])
x = dask.delayed(x)
future = client.submit(func, x)

But this is error prone and requires rarely held expertise.

We might again consider traversing arguments.

The text was updated successfully, but these errors were encountered:

dhirschfeld · 2018-07-13T10:27:49Z

Just a note to mention that none of the below work:

fut = client.submit(echo, obj)

obj_fut = client.scatter(obj)
fut = client.submit(echo, obj_fut)

delayed_obj = delayed(obj)
fut = client.submit(echo, delayed_obj)

All three methods of submitting a function to run on the cluster fail, going through the exact same warn_dumps ⟶ pickle.dumps code path.

Failing test cases for each are currently part of PR #2115

dhirschfeld · 2018-07-15T06:02:46Z

Update:
Scattering does work, but only if you ensure the custom serialization is imported on all workers.

obj_fut = client.scatter(obj)

...will fail with the below error if distributed.protocol.arrow isn't explicitly imported.

  File "C:\dev\src\distributed\distributed\protocol\serialize.py", line 157, in serialize
    raise TypeError(msg)
TypeError: Could not serialize object of type RecordBatch

If you import distributed.protocol.arrow in the client process but not in the workers it fails with the below error:

  File "c:\dev\src\distributed\distributed\core.py", line 448, in send_recv
    raise Exception(response['text'])
Exception: Serialization for type pyarrow.lib.RecordBatch not found

...so to get it to actually work I need to run:

def init_arrow():
    from distributed.protocol import arrow
    return None

init_arrow()
client.run(init_arrow)
obj_fut = client.scatter(obj)
fut = client.submit(echo, obj_fut)
result = fut.result()
assert obj.equals(result)

dhirschfeld · 2018-07-16T06:27:42Z

The question I have is Is this by design? i.e. is it intended that the user has to initialise the serialisation support on all of the workers?

In the case of an adaptive cluster I guess this could be supported by using the --preload option for any new workers.

mrocklin · 2018-07-16T12:30:29Z

Yes, something like preload is probably the right way to handle this today, assuming that it's not already in library code.

Eventually it would be nice to allow clients to register functions to be run at worker start time with the scheduler that could be passed to workers as they start up.

Client registers func with the scheduler as a preload operation
Scheduler holds on to func
Scheduler also sends func to all workers to have them run func
Workers run func
New worker arrives, tells Scheduler that it exists
As part of saying "Hi" the scheduler also gives the worker func and tells it to run it

If anyone is interested in implementing this let me know and I'll point to the various locations in the client, scheduler, and worker, where these changes would have to be made. It's a modest amount of work and would be a good introduction to the distributed system.

dhirschfeld · 2018-07-24T10:31:08Z

I moved the above to a new issue as I think it's a separate concern.

jakirkham · 2020-07-07T02:05:30Z

The idea of handling external serialization more simply is brought up in issue ( #3831 ) as well.

mrocklin mentioned this issue Jul 12, 2018

Support arrow Table/RecordBatch types #2103

Closed

mrocklin mentioned this issue Jul 13, 2018

Add custom serialization support for pyarrow #2115

Merged

dhirschfeld mentioned this issue Jul 24, 2018

Worker initialization function #2140

Closed

mrocklin mentioned this issue Aug 15, 2018

Incremental model selection dask/dask-ml#288

Merged

guillaumeeb mentioned this issue Aug 21, 2018

Add a worker initialization function #2201

Merged

mrocklin mentioned this issue Sep 6, 2018

Pickling occurs when Client started with processes=0 #2056

Closed

dhirschfeld mentioned this issue Jan 18, 2019

Registration of custom (de)serializer is recognized by dask_loads, dask_dumps, but not when computing graph #2469

Open

mcguipat mentioned this issue Aug 13, 2019

Custom Serialization for Task Args #2953

Open

dhirschfeld mentioned this issue Jul 7, 2020

Using Dask serialization in more cases #3946

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize data within tasks #2110

Serialize data within tasks #2110

mrocklin commented Jul 12, 2018

dhirschfeld commented Jul 13, 2018

dhirschfeld commented Jul 15, 2018

dhirschfeld commented Jul 16, 2018

mrocklin commented Jul 16, 2018

dhirschfeld commented Jul 24, 2018

jakirkham commented Jul 7, 2020

Serialize data within tasks #2110

Serialize data within tasks #2110

Comments

mrocklin commented Jul 12, 2018

dhirschfeld commented Jul 13, 2018

dhirschfeld commented Jul 15, 2018

dhirschfeld commented Jul 16, 2018

mrocklin commented Jul 16, 2018

dhirschfeld commented Jul 24, 2018

jakirkham commented Jul 7, 2020