RFC: handle serializing objects with cycles using a Serializer object #10170

JeffBezanson · 2015-02-12T03:13:26Z

Been meaning to write this up for a while.

Serializing now uses a Serializer object that keeps the needed state to handle cycles. In some cases it also handles repeated references to the same object.

As in my last attempt, when defining serialize you use

    serialize_cycle(s, obj) && return

and when defining deserialize you use

    deserialize_cycle(s, newobj, pos)

JeffBezanson · 2015-02-12T17:10:02Z

@amitmurthy What's the best way to combine this with BufferedAsyncStream? We need position in order to do back-references.

amitmurthy · 2015-02-13T03:36:49Z

As a standalone implementation, BufferedAsyncStream can be modified to use a seekable buffer. It seemed more natural to use a PipeBuffer since it was an AsyncStream.

Same thing if we decide to move the implementation of BufferedAsyncStream into all AsyncStream types. AsyncStream's are not seekable.

JeffBezanson · 2015-02-13T04:44:20Z

I don't need seeking for this, just position. We could keep a running byte counter in [Buffered]AsyncStreams and use offsets relative to that for each message.

amitmurthy · 2015-02-13T04:53:32Z

The recv buffer in TCPSocket as well the send buffer in BufferedAsyncStream are bypassed for large arrays. The send buffer is also flushed if its size becomes greater that 100K. Wouldn't that be a problem for a Serializer object? They are also not aware of any "message" construct - just a stream of bytes.

JeffBezanson · 2015-02-13T15:58:40Z

No, as I said the Serializer does not need to seek or re-read the buffer. It keeps a table mapping offsets to objects, so it can look up already-(de)serialized objects when it sees a back-reference later in the stream.

However some kind of message boundary is needed, because there has to be some limit to what a back-reference can refer to. The end of a "message" (however defined) is when it is OK to clear the back-reference table. Messages can be implemented on top of the stream, they don't have to be part of the stream object. Basically they need to be managed by whoever calls serialize and deserialize.

amitmurthy · 2015-02-15T12:54:29Z

Since serialize/deserialize needs to work across all types of IO, and async streams do not have a concept of a position or reset, we could add counters to Serializer itself and track bytes recd/sent via read/write calls.
reset on a Serializer object will clear all state. We should also have a member auto-reset value. This would avoid inadvertent memory leaks. via user code that inadvertantly does not call reset.
As issue with using a Serializer as the only interface to serialize/deserialize would be that mixing in-memory buffers and tcpsockets/file io could become complicated. For example, currently, it is straightforward to
- serialize some objects to an IOBuffer (to avoid a large number of syscalls)
- takebuffer, write to fd, serialize a large object directly to fd
- serialize some more to IOBuffer
- finally takebuffer, write to fd and flush
We could probably layer the serialize(s::Serializer, ....) methods above regular serialize(s::IO, ....) methods. Only user code that needs the specific optimization provided by a Serializer object can choose to use it via a copy constructor (to set different fds) to track the same counters where applicable.
There could be other scenarios where serializing is done across both in-memory buffers and system IO calls. I cannot think of a good situation and I am not sure if it is a good idea to have two interfaces (for e.g., bugs dues to serializing via a Serializer and restore directly from fd), but we probably should.
All IO, wherever applicable should support a send buffer. I'll start by submitting a PR for (as discussed towards the end of optimized send - direct writes for large bitstype arrays #10073) merging the functionality of BufferedAsyncStream into all AsyncStreams.
Serializer could also implement the functionality of BufferedAsyncStream (for serialize only). AsyncStreams already have a recv buffer. Not really in favor of this though.

JeffBezanson · 2015-02-20T00:02:26Z

I have changed it not to require position. The parallel test now passes for me, so we could use this implementation and leave further issues for later. Currently you can still simply call serialize(io, x) and it will work no worse than before, except also handle cycles within x.

JeffBezanson · 2015-02-20T02:23:49Z

cc @StefanKarpinski @vtjnash

vtjnash · 2015-02-20T03:42:55Z

does not specializing Serializer on the type of io have any performance implications relative to the previous incarnation?

should serialize_cycle handle the isimmutable check instead of the callsites? also, it perhaps shouldn't branch on isimmutable as you do now, but on isbits (for example, an immutable containing an array containing itself seems like it might be suboptimal, no?)

the lambda_numbers and known_lambda_data dictionaries should go away

JeffBezanson · 2015-02-20T05:06:09Z

We should look at performance. It would be nice if we didn't have to specialize I/O code on every kind of stream.

You're right about isbits.

What would be another way to avoid recompiling when the same function is sent repeatedly?

vtjnash · 2015-02-20T14:27:25Z

Either Serializer state or serialize_cycle

vtjnash · 2015-02-20T16:51:48Z

More comments: the keys of the table would ideally be WeakRef objects

The lazy initialization of the table feels like a premature (de)optimization, since it should almost always be getting created

StefanKarpinski · 2015-02-20T17:17:57Z

base/serialize.jl

-deserialize(s, ::Type{Expr})     = deserialize_expr(s, int32(read(s, UInt8)))
-deserialize(s, ::Type{LongExpr}) = deserialize_expr(s, read(s, Int32))
+deserialize(s, ::Type{Expr})     = deserialize_expr(s, int32(read(s.io, UInt8)))
+deserialize(s, ::Type{LongExpr}) = deserialize_expr(s, read(s.io, Int32))


Would be nice to have annotations on these since this will fail if s is not a Serializer object.

StefanKarpinski · 2015-02-20T17:20:01Z

I like this approach. I think this transparently handles 99% of what people want. If you want to share state across the serialization of multiple objects do you just manually construct a Serializer object and then serialize to that instead of to a bare IO object? There's a bit of an issue with that in that the receiving side then would need to have a Serializer object with the exact same lifetime, which seems a little tricky to orchestrate. But maybe that's an advanced use-case and we can assume anyone doing that knows what they're doing.

vtjnash · 2015-02-20T18:54:34Z

I thought "having the same serializer state" was an implied given. We could now even have Serializer write a header to verify this. The lifetime bound tends to be pretty simple (likely just the same as the underlying stream)

amitmurthy · 2015-02-21T01:57:59Z

The lifetime bound is important when you think of long held streams - think multi.jl. This is when the same serializer state on either side becomes important and we will need some sort of a periodic reset (to avoid unnecessarily holding references).

A reset tag sent on the stream (whenever either side calls a reset method on the Serializer), should handle it.

As suggested by Jeff, currently it handles the case of self-referencing objects, the optimisation part for long held streams and sending the same object repeatedly (for e.g., darrays being sent across mutiple times) can be discussed separately.

vtjnash · 2015-02-21T02:12:31Z

a reset tag sounds very finicky. i think it would do much better to periodically sweep all of the WeakRef objects in the table (perhaps a callback finalizer from the gc), and serializing to the stream a list of any that can be discarded on the other end.

but i guess this can merge and deal with the other optimizations later. as it stands, this is very simply a drop-in replacement of the existing deserializer and can be merged as such

JeffBezanson · 2015-02-21T04:09:56Z

That's right, this is not intended to optimize distributed computing cases, just handle cycles.

JeffBezanson · 2015-02-25T20:38:22Z

Ok I checked #7893 with this, and the performance is simply awful. Will investigate.

JeffBezanson · 2015-03-10T05:45:18Z

Ok the performance is no longer awful. Just a few percent slower than what we have now.

JeffBezanson · 2015-03-18T15:56:35Z

Status of performance with a 240MB DataFrame:

On master:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"))
elapsed time: 13.120145274 seconds (1158 MB allocated, 2.07% gc time in 6 pauses with 4 full sweep)

julia> @time serialize(open("out.jld","w"), d);
elapsed time: 7.78289567 seconds (643 MB allocated, 5.31% gc time in 1 pauses with 1 full sweep)

on this branch:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"))
elapsed time: 14.836951264 seconds (1162 MB allocated, 2.04% gc time in 7 pauses with 4 full sweep)

julia> @time serialize(open("out.jld","w"), d);
elapsed time: 9.000394767 seconds (650 MB allocated, 4.63% gc time in 1 pauses with 1 full sweep)

JeffBezanson · 2015-03-18T16:39:21Z

Ok, back to parity with master for lots of pointer-free objects:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"));
elapsed time: 12.809747142 seconds (1161 MB allocated, 2.35% gc time in 7 pauses with 4 full sweep)

julia> @time serialize(open("out.jld","w"), d);
elapsed time: 7.926651548 seconds (650 MB allocated, 5.35% gc time in 1 pauses with 1 full sweep)

But, for large graphs of mutable objects the performance can be really bad. We might need an option to disable cycle support.

StefanKarpinski · 2015-03-19T16:31:12Z

That does, unfortunately, seem a bit like a "be broken, maybe" option :-\

JeffBezanson · 2015-03-19T16:50:40Z

Well, I certainly think handling cycles should be enabled by default. This would only be an escape hatch in case somebody's code suddenly takes 20x longer unnecessarily.

fixes #308

JeffBezanson · 2015-06-02T00:32:10Z

Time for a final review. With this plus @jakebolewski 's changes things are generally faster than 0.3 even with cycle handling.

Since the module is called Serializer I called the type SerializationState. Ideas for anything shorter are welcome.

jakebolewski · 2015-06-02T00:33:07Z

👍

JeffBezanson · 2015-06-02T00:48:56Z

Master:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"));
  13.414 seconds      (29950 k allocations: 1309 MB, 5.14% gc time)

julia> @time serialize(open("out.jld","w"), d);
   4.082 seconds      (10796 k allocations: 188 MB)

this branch:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"));
  10.633 seconds      (30089 k allocations: 1314 MB, 5.91% gc time)

julia> @time serialize(open("out.jld","w"), d);
   4.494 seconds      (11034 k allocations: 197 MB)

RFC: handle serializing objects with cycles using a Serializer object

vtjnash · 2015-06-04T14:34:50Z

base/dict.jl

@@ -351,6 +351,17 @@ end

 copy(o::ObjectIdDict) = ObjectIdDict(o)

+# SerializationState type needed as soon as ObjectIdDict is available
+
+type SerializationState{I<:IO}


shouldn't the known_lambda_data state be moved in here too?

amitmurthy mentioned this pull request Feb 19, 2015

Buffered send in AsyncStreams #10232

Merged

JeffBezanson force-pushed the jb/serializer branch from 3b9c65c to 3337f0e Compare February 19, 2015 23:41

StefanKarpinski reviewed Feb 20, 2015
View reviewed changes

JeffBezanson force-pushed the jb/serializer branch from afcdd40 to 6d0c1a9 Compare February 21, 2015 05:18

JeffBezanson force-pushed the jb/serializer branch from a446338 to 5877c37 Compare February 25, 2015 22:49

JeffBezanson force-pushed the jb/serializer branch from 8893f94 to 1ef0f32 Compare March 18, 2015 03:26

JeffBezanson mentioned this pull request Apr 24, 2015

Stack overflow serializing recursive function defined in function scope #10945

Closed

JeffBezanson mentioned this pull request May 4, 2015

Improve the performance of serialization by > 2x while maintaining backwards compatibility #11125

Merged

JeffBezanson force-pushed the jb/serializer branch 2 times, most recently from be0029d to 9a03c9b Compare June 1, 2015 23:43

handle serializing objects with cycles using a Serializer object

78b999f

fixes #308

JeffBezanson force-pushed the jb/serializer branch from 9a03c9b to 78b999f Compare June 1, 2015 23:48

JeffBezanson added a commit that referenced this pull request Jun 2, 2015

Merge pull request #10170 from JuliaLang/jb/serializer

ab7224b

RFC: handle serializing objects with cycles using a Serializer object

JeffBezanson merged commit ab7224b into master Jun 2, 2015

StefanKarpinski mentioned this pull request Jun 2, 2015

0.4 roadmap #11536

Closed

13 tasks

tkelman deleted the jb/serializer branch June 2, 2015 17:54

vtjnash reviewed Jun 4, 2015
View reviewed changes

JeffBezanson mentioned this pull request Jun 17, 2015

serialization / deserialization performance #7893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: handle serializing objects with cycles using a Serializer object #10170

RFC: handle serializing objects with cycles using a Serializer object #10170

JeffBezanson commented Feb 12, 2015

JeffBezanson commented Feb 12, 2015

amitmurthy commented Feb 13, 2015

JeffBezanson commented Feb 13, 2015

amitmurthy commented Feb 13, 2015

JeffBezanson commented Feb 13, 2015

amitmurthy commented Feb 15, 2015

JeffBezanson commented Feb 20, 2015

JeffBezanson commented Feb 20, 2015

vtjnash commented Feb 20, 2015

JeffBezanson commented Feb 20, 2015

vtjnash commented Feb 20, 2015

vtjnash commented Feb 20, 2015

StefanKarpinski Feb 20, 2015

StefanKarpinski commented Feb 20, 2015

vtjnash commented Feb 20, 2015

amitmurthy commented Feb 21, 2015

vtjnash commented Feb 21, 2015

JeffBezanson commented Feb 21, 2015

JeffBezanson commented Feb 25, 2015

JeffBezanson commented Mar 10, 2015

JeffBezanson commented Mar 18, 2015

JeffBezanson commented Mar 18, 2015

StefanKarpinski commented Mar 19, 2015

JeffBezanson commented Mar 19, 2015

JeffBezanson commented Jun 2, 2015

jakebolewski commented Jun 2, 2015

JeffBezanson commented Jun 2, 2015

vtjnash Jun 4, 2015

RFC: handle serializing objects with cycles using a Serializer object #10170

RFC: handle serializing objects with cycles using a Serializer object #10170

Conversation

JeffBezanson commented Feb 12, 2015

JeffBezanson commented Feb 12, 2015

amitmurthy commented Feb 13, 2015

JeffBezanson commented Feb 13, 2015

amitmurthy commented Feb 13, 2015

JeffBezanson commented Feb 13, 2015

amitmurthy commented Feb 15, 2015

JeffBezanson commented Feb 20, 2015

JeffBezanson commented Feb 20, 2015

vtjnash commented Feb 20, 2015

JeffBezanson commented Feb 20, 2015

vtjnash commented Feb 20, 2015

vtjnash commented Feb 20, 2015

StefanKarpinski Feb 20, 2015

Choose a reason for hiding this comment

StefanKarpinski commented Feb 20, 2015

vtjnash commented Feb 20, 2015

amitmurthy commented Feb 21, 2015

vtjnash commented Feb 21, 2015

JeffBezanson commented Feb 21, 2015

JeffBezanson commented Feb 25, 2015

JeffBezanson commented Mar 10, 2015

JeffBezanson commented Mar 18, 2015

JeffBezanson commented Mar 18, 2015

StefanKarpinski commented Mar 19, 2015

JeffBezanson commented Mar 19, 2015

JeffBezanson commented Jun 2, 2015

jakebolewski commented Jun 2, 2015

JeffBezanson commented Jun 2, 2015

vtjnash Jun 4, 2015

Choose a reason for hiding this comment