-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single pass serialization #4699
Changes from 15 commits
176ed15
0a4433a
ff0a542
084680c
6548968
ca6d0f6
f951260
2da2476
9717ae8
ec3d4ad
f9b6f7c
845b290
cd937b5
7fb8a26
79f8d24
9aa4bbe
f700725
13ddcec
2fa9f12
e081747
f544006
6a2a915
be143ba
29baab1
e7d5105
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -1,11 +1,17 @@ | ||||
import logging | ||||
import threading | ||||
|
||||
import msgpack | ||||
|
||||
from ..utils import LRU | ||||
from . import pickle | ||||
from .compression import decompress, maybe_compress | ||||
from .serialize import ( | ||||
MsgpackList, | ||||
Serialize, | ||||
Serialized, | ||||
SerializedCallable, | ||||
TaskGraphValue, | ||||
merge_and_deserialize, | ||||
msgpack_decode_default, | ||||
msgpack_encode_default, | ||||
|
@@ -15,6 +21,39 @@ | |||
|
||||
logger = logging.getLogger(__name__) | ||||
|
||||
cache_dumps = LRU(maxsize=100) | ||||
cache_loads = LRU(maxsize=100) | ||||
cache_dumps_lock = threading.Lock() | ||||
cache_loads_lock = threading.Lock() | ||||
|
||||
|
||||
def dumps_function(func): | ||||
""" Dump a function to bytes, cache functions """ | ||||
try: | ||||
with cache_dumps_lock: | ||||
result = cache_dumps[func] | ||||
except KeyError: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still don't have a good feeling on what timescales where currently optimizing or whether this is a particularly performance critical section. Therefore, this comment might be irrelevant. |
||||
result = pickle.dumps(func, protocol=4) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also curious about this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am curious too :) distributed/distributed/worker.py Line 3527 in fa5d993
@jakirkham do you know? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That line comes from @mrocklin's PR ( #4019 ), which allowed connections to dynamically determine what compression and pickle protocols are supported and then use them in communication. In a few places I think Matt found it easier to simply force pickle protocol 4 than allow it to be configurable. So if this is coming from that worker code, that is the history |
||||
if len(result) < 100000: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can be a bit more generous with the cache size. currently we're at There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Related: in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds like that was just copied and moved over from here distributed/distributed/worker.py Line 3528 in fa5d993
Perhaps we can make a new issue and revisit? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My plan is to remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah ok in that case I don't think the |
||||
with cache_dumps_lock: | ||||
cache_dumps[func] = result | ||||
except TypeError: # Unhashable function | ||||
result = pickle.dumps(func, protocol=4) | ||||
return result | ||||
|
||||
|
||||
def loads_function(bytes_object): | ||||
""" Load a function from bytes, cache bytes """ | ||||
if len(bytes_object) < 100000: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. personal preference: I would put the size of the cache into a constant s.t. the two function don't drift apart |
||||
with cache_dumps_lock: | ||||
if bytes_object in cache_loads: | ||||
return cache_loads[bytes_object] | ||||
else: | ||||
result = pickle.loads(bytes_object) | ||||
cache_loads[bytes_object] = result | ||||
return result | ||||
return pickle.loads(bytes_object) | ||||
|
||||
|
||||
def dumps(msg, serializers=None, on_error="message", context=None) -> list: | ||||
"""Transform Python message to bytestream suitable for communication | ||||
|
@@ -47,25 +86,32 @@ def _inplace_compress_frames(header, frames): | |||
|
||||
def _encode_default(obj): | ||||
typ = type(obj) | ||||
if typ is Serialize or typ is Serialized: | ||||
offset = len(frames) | ||||
if typ is Serialized: | ||||
sub_header, sub_frames = obj.header, obj.frames | ||||
else: | ||||
sub_header, sub_frames = serialize_and_split( | ||||
obj, serializers=serializers, on_error=on_error, context=context | ||||
) | ||||
_inplace_compress_frames(sub_header, sub_frames) | ||||
sub_header["num-sub-frames"] = len(sub_frames) | ||||
frames.append( | ||||
msgpack.dumps( | ||||
sub_header, default=msgpack_encode_default, use_bin_type=True | ||||
) | ||||
) | ||||
frames.extend(sub_frames) | ||||
return {"__Serialized__": offset} | ||||
|
||||
ret = msgpack_encode_default(obj) | ||||
if ret is not obj: | ||||
return ret | ||||
|
||||
if typ is Serialize: | ||||
obj = obj.data # TODO: remove Serialize/to_serialize completely | ||||
|
||||
offset = len(frames) | ||||
if typ in (Serialized, SerializedCallable): | ||||
sub_header, sub_frames = obj.header, obj.frames | ||||
elif callable(obj): | ||||
sub_header, sub_frames = {"callable": dumps_function(obj)}, [] | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Functions can be quite large sometimes, for example if users close over large variables out of function scope. Msgpack may not handle this well in some cases x = np.arange(1000000000)
def f(y):
return y + x.sum() Obviously users shouldn't do this, but they will. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like we're bypassing the list of serializers here. This allows users to get past configurations where users specifically turn off pickle. |
||||
else: | ||||
return msgpack_encode_default(obj) | ||||
sub_header, sub_frames = serialize_and_split( | ||||
obj, serializers=serializers, on_error=on_error, context=context | ||||
) | ||||
_inplace_compress_frames(sub_header, sub_frames) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The inplace stuff always makes me uncomfortable. Thoughts on making new header/frames dict/lists here instead? For reference, it was these sorts of inplace operations that previously caused us to run into the msgpack tuple vs list difference. I think that avoiding them when we can is useful, unless there is a large performance boost (which I wouldn't expect here). |
||||
sub_header["num-sub-frames"] = len(sub_frames) | ||||
frames.append( | ||||
msgpack.dumps( | ||||
sub_header, default=msgpack_encode_default, use_bin_type=True | ||||
) | ||||
) | ||||
frames.extend(sub_frames) | ||||
return {"__Serialized__": offset} | ||||
|
||||
frames[0] = msgpack.dumps(msg, default=_encode_default, use_bin_type=True) | ||||
return frames | ||||
|
@@ -75,9 +121,18 @@ def _encode_default(obj): | |||
raise | ||||
|
||||
|
||||
class DelayedExceptionRaise: | ||||
def __init__(self, err): | ||||
self.err = err | ||||
|
||||
|
||||
def loads(frames, deserialize=True, deserializers=None): | ||||
""" Transform bytestream back into Python value """ | ||||
|
||||
if deserializers is None: | ||||
# TODO: get from configuration both here and in protocol.serialize() | ||||
deserializers = ("dask", "pickle") | ||||
|
||||
try: | ||||
|
||||
def _decode_default(obj): | ||||
|
@@ -87,19 +142,55 @@ def _decode_default(obj): | |||
frames[offset], | ||||
object_hook=msgpack_decode_default, | ||||
use_list=False, | ||||
**msgpack_opts | ||||
**msgpack_opts, | ||||
) | ||||
offset += 1 | ||||
sub_frames = frames[offset : offset + sub_header["num-sub-frames"]] | ||||
if "callable" in sub_header: | ||||
if deserialize: | ||||
try: | ||||
if "pickle" not in deserializers: | ||||
raise TypeError( | ||||
f"Cannot deserialize {sub_header['callable']}, " | ||||
"pickle isn't in deserializers" | ||||
) | ||||
return loads_function(sub_header["callable"]) | ||||
except Exception as e: | ||||
if deserialize == "delay-exception": | ||||
return DelayedExceptionRaise(e) | ||||
Comment on lines
+168
to
+169
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am confused about when this is necessary and why it wasn't before. I'm wary of creating new systems like this if we can avoid it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I understand this now that I've seen the |
||||
else: | ||||
raise | ||||
else: | ||||
return SerializedCallable(sub_header, sub_frames) | ||||
if deserialize: | ||||
if "compression" in sub_header: | ||||
sub_frames = decompress(sub_header, sub_frames) | ||||
return merge_and_deserialize( | ||||
sub_header, sub_frames, deserializers=deserializers | ||||
) | ||||
try: | ||||
return merge_and_deserialize( | ||||
sub_header, sub_frames, deserializers=deserializers | ||||
) | ||||
except Exception as e: | ||||
if deserialize == "delay-exception": | ||||
return DelayedExceptionRaise(e) | ||||
else: | ||||
raise | ||||
else: | ||||
return Serialized(sub_header, sub_frames) | ||||
else: | ||||
# Notice, even though `msgpack_decode_default()` supports | ||||
# `__MsgpackList__`, we decode it here explicitly. This way | ||||
# we can delay the convertion to a regular `list` until it | ||||
# gets to a worker. | ||||
if "__MsgpackList__" in obj: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the type of obj here? Is |
||||
if deserialize: | ||||
return list(obj["as-tuple"]) | ||||
else: | ||||
return MsgpackList(obj["as-tuple"]) | ||||
if "__TaskGraphValue__" in obj: | ||||
if deserialize: | ||||
return obj["data"] | ||||
else: | ||||
return TaskGraphValue(obj["data"]) | ||||
return msgpack_decode_default(obj) | ||||
|
||||
return msgpack.loads( | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are discussing getting rid of
dumps_function
, will these still be needed or will they go away as well?