-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize data within tasks #2110
Comments
Just a note to mention that none of the below work: fut = client.submit(echo, obj) obj_fut = client.scatter(obj)
fut = client.submit(echo, obj_fut) delayed_obj = delayed(obj)
fut = client.submit(echo, delayed_obj) All three methods of submitting a function to run on the cluster fail, going through the exact same Failing test cases for each are currently part of PR #2115 |
Update: obj_fut = client.scatter(obj) ...will fail with the below error if File "C:\dev\src\distributed\distributed\protocol\serialize.py", line 157, in serialize
raise TypeError(msg)
TypeError: Could not serialize object of type RecordBatch If you import File "c:\dev\src\distributed\distributed\core.py", line 448, in send_recv
raise Exception(response['text'])
Exception: Serialization for type pyarrow.lib.RecordBatch not found ...so to get it to actually work I need to run: def init_arrow():
from distributed.protocol import arrow
return None
init_arrow()
client.run(init_arrow)
obj_fut = client.scatter(obj)
fut = client.submit(echo, obj_fut)
result = fut.result()
assert obj.equals(result) |
The question I have is Is this by design? i.e. is it intended that the user has to initialise the serialisation support on all of the workers? In the case of an adaptive cluster I guess this could be supported by using the |
Yes, something like preload is probably the right way to handle this today, assuming that it's not already in library code. Eventually it would be nice to allow clients to register functions to be run at worker start time with the scheduler that could be passed to workers as they start up.
If anyone is interested in implementing this let me know and I'll point to the various locations in the client, scheduler, and worker, where these changes would have to be made. It's a modest amount of work and would be a good introduction to the distributed system. |
I moved the above to a new issue as I think it's a separate concern. |
The idea of handling external serialization more simply is brought up in issue ( #3831 ) as well. |
Currently if we submit data within a task like the following:
Then we construct a task like
(func, x)
and then call pickle on this task. We don't do custom serialization once we contruct tasks. This is mostly because stuff in tasks is rarely large, and traversing tasks can increase our overhead in the common case. Typically we encourage people to use dask.delayed or something similar to mark dataBut this is error prone and requires rarely held expertise.
We might again consider traversing arguments.
The text was updated successfully, but these errors were encountered: