-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: handle serializing objects with cycles using a Serializer object #10170
Conversation
@amitmurthy What's the best way to combine this with BufferedAsyncStream? We need |
As a standalone implementation, Same thing if we decide to move the implementation of |
I don't need seeking for this, just |
The recv buffer in |
No, as I said the Serializer does not need to seek or re-read the buffer. It keeps a table mapping offsets to objects, so it can look up already-(de)serialized objects when it sees a back-reference later in the stream. However some kind of message boundary is needed, because there has to be some limit to what a back-reference can refer to. The end of a "message" (however defined) is when it is OK to clear the back-reference table. Messages can be implemented on top of the stream, they don't have to be part of the stream object. Basically they need to be managed by whoever calls |
|
3b9c65c
to
3337f0e
Compare
I have changed it not to require |
does not specializing should the |
We should look at performance. It would be nice if we didn't have to specialize I/O code on every kind of stream. You're right about What would be another way to avoid recompiling when the same function is sent repeatedly? |
Either Serializer state or serialize_cycle |
More comments: the keys of the table would ideally be WeakRef objects The lazy initialization of the table feels like a premature (de)optimization, since it should almost always be getting created |
deserialize(s, ::Type{Expr}) = deserialize_expr(s, int32(read(s, UInt8))) | ||
deserialize(s, ::Type{LongExpr}) = deserialize_expr(s, read(s, Int32)) | ||
deserialize(s, ::Type{Expr}) = deserialize_expr(s, int32(read(s.io, UInt8))) | ||
deserialize(s, ::Type{LongExpr}) = deserialize_expr(s, read(s.io, Int32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have annotations on these since this will fail if s
is not a Serializer
object.
I like this approach. I think this transparently handles 99% of what people want. If you want to share state across the serialization of multiple objects do you just manually construct a |
I thought "having the same serializer state" was an implied given. We could now even have Serializer write a header to verify this. The lifetime bound tends to be pretty simple (likely just the same as the underlying stream) |
The lifetime bound is important when you think of long held streams - think A As suggested by Jeff, currently it handles the case of self-referencing objects, the optimisation part for long held streams and sending the same object repeatedly (for e.g., darrays being sent across mutiple times) can be discussed separately. |
a but i guess this can merge and deal with the other optimizations later. as it stands, this is very simply a drop-in replacement of the existing deserializer and can be merged as such |
That's right, this is not intended to optimize distributed computing cases, just handle cycles. |
afcdd40
to
6d0c1a9
Compare
Ok I checked #7893 with this, and the performance is simply awful. Will investigate. |
a446338
to
5877c37
Compare
Ok the performance is no longer awful. Just a few percent slower than what we have now. |
8893f94
to
1ef0f32
Compare
Status of performance with a 240MB DataFrame: On master:
on this branch:
|
Ok, back to parity with master for lots of pointer-free objects:
But, for large graphs of mutable objects the performance can be really bad. We might need an option to disable cycle support. |
That does, unfortunately, seem a bit like a "be broken, maybe" option :-\ |
Well, I certainly think handling cycles should be enabled by default. This would only be an escape hatch in case somebody's code suddenly takes 20x longer unnecessarily. |
be0029d
to
9a03c9b
Compare
9a03c9b
to
78b999f
Compare
Time for a final review. With this plus @jakebolewski 's changes things are generally faster than 0.3 even with cycle handling. Since the module is called |
👍 |
Master:
this branch:
|
RFC: handle serializing objects with cycles using a Serializer object
@@ -351,6 +351,17 @@ end | |||
|
|||
copy(o::ObjectIdDict) = ObjectIdDict(o) | |||
|
|||
# SerializationState type needed as soon as ObjectIdDict is available | |||
|
|||
type SerializationState{I<:IO} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the known_lambda_data
state be moved in here too?
Been meaning to write this up for a while.
Serializing now uses a
Serializer
object that keeps the needed state to handle cycles. In some cases it also handles repeated references to the same object.As in my last attempt, when defining
serialize
you useand when defining
deserialize
you use