Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: handle serializing objects with cycles using a Serializer object #10170

Merged
merged 1 commit into from
Jun 2, 2015

Conversation

JeffBezanson
Copy link
Member

Been meaning to write this up for a while.

Serializing now uses a Serializer object that keeps the needed state to handle cycles. In some cases it also handles repeated references to the same object.

As in my last attempt, when defining serialize you use

    serialize_cycle(s, obj) && return

and when defining deserialize you use

    deserialize_cycle(s, newobj, pos)

@JeffBezanson
Copy link
Member Author

@amitmurthy What's the best way to combine this with BufferedAsyncStream? We need position in order to do back-references.

@amitmurthy
Copy link
Contributor

As a standalone implementation, BufferedAsyncStream can be modified to use a seekable buffer. It seemed more natural to use a PipeBuffer since it was an AsyncStream.

Same thing if we decide to move the implementation of BufferedAsyncStream into all AsyncStream types. AsyncStream's are not seekable.

@JeffBezanson
Copy link
Member Author

I don't need seeking for this, just position. We could keep a running byte counter in [Buffered]AsyncStreams and use offsets relative to that for each message.

@amitmurthy
Copy link
Contributor

The recv buffer in TCPSocket as well the send buffer in BufferedAsyncStream are bypassed for large arrays. The send buffer is also flushed if its size becomes greater that 100K. Wouldn't that be a problem for a Serializer object? They are also not aware of any "message" construct - just a stream of bytes.

@JeffBezanson
Copy link
Member Author

No, as I said the Serializer does not need to seek or re-read the buffer. It keeps a table mapping offsets to objects, so it can look up already-(de)serialized objects when it sees a back-reference later in the stream.

However some kind of message boundary is needed, because there has to be some limit to what a back-reference can refer to. The end of a "message" (however defined) is when it is OK to clear the back-reference table. Messages can be implemented on top of the stream, they don't have to be part of the stream object. Basically they need to be managed by whoever calls serialize and deserialize.

@amitmurthy
Copy link
Contributor

  • Since serialize/deserialize needs to work across all types of IO, and async streams do not have a concept of a position or reset, we could add counters to Serializer itself and track bytes recd/sent via read/write calls.
  • reset on a Serializer object will clear all state. We should also have a member auto-reset value. This would avoid inadvertent memory leaks. via user code that inadvertantly does not call reset.
  • As issue with using a Serializer as the only interface to serialize/deserialize would be that mixing in-memory buffers and tcpsockets/file io could become complicated. For example, currently, it is straightforward to
    • serialize some objects to an IOBuffer (to avoid a large number of syscalls)
    • takebuffer, write to fd, serialize a large object directly to fd
    • serialize some more to IOBuffer
    • finally takebuffer, write to fd and flush
  • We could probably layer the serialize(s::Serializer, ....) methods above regular serialize(s::IO, ....) methods. Only user code that needs the specific optimization provided by a Serializer object can choose to use it via a copy constructor (to set different fds) to track the same counters where applicable.
  • There could be other scenarios where serializing is done across both in-memory buffers and system IO calls. I cannot think of a good situation and I am not sure if it is a good idea to have two interfaces (for e.g., bugs dues to serializing via a Serializer and restore directly from fd), but we probably should.
  • All IO, wherever applicable should support a send buffer. I'll start by submitting a PR for (as discussed towards the end of optimized send - direct writes for large bitstype arrays #10073) merging the functionality of BufferedAsyncStream into all AsyncStreams.
  • Serializer could also implement the functionality of BufferedAsyncStream (for serialize only). AsyncStreams already have a recv buffer. Not really in favor of this though.

@JeffBezanson
Copy link
Member Author

I have changed it not to require position. The parallel test now passes for me, so we could use this implementation and leave further issues for later. Currently you can still simply call serialize(io, x) and it will work no worse than before, except also handle cycles within x.

@JeffBezanson
Copy link
Member Author

cc @StefanKarpinski @vtjnash

@vtjnash
Copy link
Member

vtjnash commented Feb 20, 2015

does not specializing Serializer on the type of io have any performance implications relative to the previous incarnation?

should serialize_cycle handle the isimmutable check instead of the callsites? also, it perhaps shouldn't branch on isimmutable as you do now, but on isbits (for example, an immutable containing an array containing itself seems like it might be suboptimal, no?)

the lambda_numbers and known_lambda_data dictionaries should go away

@JeffBezanson
Copy link
Member Author

We should look at performance. It would be nice if we didn't have to specialize I/O code on every kind of stream.

You're right about isbits.

What would be another way to avoid recompiling when the same function is sent repeatedly?

@vtjnash
Copy link
Member

vtjnash commented Feb 20, 2015

Either Serializer state or serialize_cycle

@vtjnash
Copy link
Member

vtjnash commented Feb 20, 2015

More comments: the keys of the table would ideally be WeakRef objects

The lazy initialization of the table feels like a premature (de)optimization, since it should almost always be getting created

deserialize(s, ::Type{Expr}) = deserialize_expr(s, int32(read(s, UInt8)))
deserialize(s, ::Type{LongExpr}) = deserialize_expr(s, read(s, Int32))
deserialize(s, ::Type{Expr}) = deserialize_expr(s, int32(read(s.io, UInt8)))
deserialize(s, ::Type{LongExpr}) = deserialize_expr(s, read(s.io, Int32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have annotations on these since this will fail if s is not a Serializer object.

@StefanKarpinski
Copy link
Member

I like this approach. I think this transparently handles 99% of what people want. If you want to share state across the serialization of multiple objects do you just manually construct a Serializer object and then serialize to that instead of to a bare IO object? There's a bit of an issue with that in that the receiving side then would need to have a Serializer object with the exact same lifetime, which seems a little tricky to orchestrate. But maybe that's an advanced use-case and we can assume anyone doing that knows what they're doing.

@vtjnash
Copy link
Member

vtjnash commented Feb 20, 2015

I thought "having the same serializer state" was an implied given. We could now even have Serializer write a header to verify this. The lifetime bound tends to be pretty simple (likely just the same as the underlying stream)

@amitmurthy
Copy link
Contributor

The lifetime bound is important when you think of long held streams - think multi.jl. This is when the same serializer state on either side becomes important and we will need some sort of a periodic reset (to avoid unnecessarily holding references).

A reset tag sent on the stream (whenever either side calls a reset method on the Serializer), should handle it.

As suggested by Jeff, currently it handles the case of self-referencing objects, the optimisation part for long held streams and sending the same object repeatedly (for e.g., darrays being sent across mutiple times) can be discussed separately.

@vtjnash
Copy link
Member

vtjnash commented Feb 21, 2015

a reset tag sounds very finicky. i think it would do much better to periodically sweep all of the WeakRef objects in the table (perhaps a callback finalizer from the gc), and serializing to the stream a list of any that can be discarded on the other end.

but i guess this can merge and deal with the other optimizations later. as it stands, this is very simply a drop-in replacement of the existing deserializer and can be merged as such

@JeffBezanson
Copy link
Member Author

That's right, this is not intended to optimize distributed computing cases, just handle cycles.

@JeffBezanson
Copy link
Member Author

Ok I checked #7893 with this, and the performance is simply awful. Will investigate.

@JeffBezanson
Copy link
Member Author

Ok the performance is no longer awful. Just a few percent slower than what we have now.

@JeffBezanson
Copy link
Member Author

Status of performance with a 240MB DataFrame:

On master:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"))
elapsed time: 13.120145274 seconds (1158 MB allocated, 2.07% gc time in 6 pauses with 4 full sweep)

julia> @time serialize(open("out.jld","w"), d);
elapsed time: 7.78289567 seconds (643 MB allocated, 5.31% gc time in 1 pauses with 1 full sweep)

on this branch:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"))
elapsed time: 14.836951264 seconds (1162 MB allocated, 2.04% gc time in 7 pauses with 4 full sweep)

julia> @time serialize(open("out.jld","w"), d);
elapsed time: 9.000394767 seconds (650 MB allocated, 4.63% gc time in 1 pauses with 1 full sweep)

@JeffBezanson
Copy link
Member Author

Ok, back to parity with master for lots of pointer-free objects:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"));
elapsed time: 12.809747142 seconds (1161 MB allocated, 2.35% gc time in 7 pauses with 4 full sweep)

julia> @time serialize(open("out.jld","w"), d);
elapsed time: 7.926651548 seconds (650 MB allocated, 5.35% gc time in 1 pauses with 1 full sweep)

But, for large graphs of mutable objects the performance can be really bad. We might need an option to disable cycle support.

@StefanKarpinski
Copy link
Member

That does, unfortunately, seem a bit like a "be broken, maybe" option :-\

@JeffBezanson
Copy link
Member Author

Well, I certainly think handling cycles should be enabled by default. This would only be an escape hatch in case somebody's code suddenly takes 20x longer unnecessarily.

@JeffBezanson
Copy link
Member Author

Time for a final review. With this plus @jakebolewski 's changes things are generally faster than 0.3 even with cycle handling.

Since the module is called Serializer I called the type SerializationState. Ideas for anything shorter are welcome.

@jakebolewski
Copy link
Member

👍

@JeffBezanson
Copy link
Member Author

Master:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"));
  13.414 seconds      (29950 k allocations: 1309 MB, 5.14% gc time)

julia> @time serialize(open("out.jld","w"), d);
   4.082 seconds      (10796 k allocations: 188 MB)

this branch:

julia> @time d = deserialize(open("/home/jeff/src/julia/ioe.jld"));
  10.633 seconds      (30089 k allocations: 1314 MB, 5.91% gc time)

julia> @time serialize(open("out.jld","w"), d);
   4.494 seconds      (11034 k allocations: 197 MB)

JeffBezanson added a commit that referenced this pull request Jun 2, 2015
RFC: handle serializing objects with cycles using a Serializer object
@JeffBezanson JeffBezanson merged commit ab7224b into master Jun 2, 2015
@StefanKarpinski StefanKarpinski mentioned this pull request Jun 2, 2015
13 tasks
@tkelman tkelman deleted the jb/serializer branch June 2, 2015 17:54
@@ -351,6 +351,17 @@ end

copy(o::ObjectIdDict) = ObjectIdDict(o)

# SerializationState type needed as soon as ObjectIdDict is available

type SerializationState{I<:IO}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the known_lambda_data state be moved in here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants