[IMPROVED] Reuse buffers in Conn.processOpError #341

charlievieth · 2018-02-18T00:50:46Z

This significantly reduces memory usage when the nats Conn enters an error loop.

In a Cloud Foundry test (route-emitter/cmd/route-emitter) the Conn.processOpError was responsible for ~95% of memory allocations. This was due to Conn.processOpError allocating a new bufio.Writer each time it was called. This commit changes Conn.processOpError to reuse and reset its existing bufio.Writer.

Below is the output from go tool pprof implicating Conn.processOpError:

File: route-emitter.test
Type: inuse_space
Time: Feb 17, 2018 at 5:54pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5 -cum
Showing nodes accounting for 560MB, 94.65% of 591.68MB total
Dropped 30 nodes (cum <= 2.96MB)
Showing top 5 nodes out of 11
      flat  flat%   sum%        cum   cum%
         0     0%     0%   563.50MB 95.24%  github.com/nats-io/nats.(*Conn).readLoop
     560MB 94.65% 94.65%      560MB 94.65%  github.com/nats-io/nats.(*Conn).processOpErr
         0     0% 94.65%    25.67MB  4.34%  io.(*multiWriter).Write
         0     0% 94.65%    25.67MB  4.34%  io.Copy
         0     0% 94.65%    25.67MB  4.34%  io.copyBuffer

Output of go tool pprof after applying this patch (note the absence of nats-io and overall reduction in memory):

File: route-emitter.test
Type: inuse_space
Time: Feb 17, 2018 at 6:54pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5 -cum
Showing nodes accounting for 0, 0% of 31.14MB total
Showing top 5 nodes out of 90
      flat  flat%   sum%        cum   cum%
         0     0%     0%    21.88MB 70.28%  io.(*multiWriter).Write
         0     0%     0%    21.88MB 70.28%  io.Copy
         0     0%     0%    21.88MB 70.28%  io.copyBuffer
         0     0%     0%    21.88MB 70.28%  os/exec.(*Cmd).Start.func1
         0     0%     0%    21.88MB 70.28%  os/exec.(*Cmd).writerDescriptor.func1

I would have provided the output of running the benchmarks in nats-io/go-nats/test, but they do not appear to hit this code path. That said, I'd be happy to send them if you think they'd be useful.

This significantly reduces memory usage when the nats Conn enters an error loop.

coveralls · 2018-02-18T00:53:59Z

Coverage increased (+0.006%) to 94.655% when pulling e23ea03 on charlievieth:master into c7e16c2 on nats-io:master.

charlievieth · 2018-02-18T01:36:01Z

Oops, this commit (nats-io/go-nats@5406b8c: nats: simplify error check in Conn.connect) briefly leaked into this PR on account of it being on my master branch - it has been removed so that only nats-io/go-nats@e23ea03 is part of this PR.

kozlovic

This need a bit more investigation and regardless, if we do a change here, we need to take into consideration the ReconnectBufSize value.

kozlovic · 2018-02-18T04:08:51Z

nats.go

-		nc.bw = bufio.NewWriterSize(nc.pending, nc.Opts.ReconnectBufSize)
+		// Reset pending buffers before reconnecting.
+		if nc.pending == nil {
+			nc.pending = new(bytes.Buffer)


This does not account for nc.Opts.ReconnectBufSize, which is worrisome that tests pass, which means that this is not properly tested.
More importantly, in which situation would you get this called so many times and so fast that the GC would not collect those buffers? Maybe we need to understand better why this is a problem in the first place.

@kozlovic thanks for the feedback!

The tests likely passed because bufio.NewWriterSize(W, N) does nothing to limit the amount of data written io.Writer W - N here is only used to size the internal buffer. Properly limiting the amount of data writted to W would need to be handled by W - implementing a LimitedWriter like io.LimitedReader and wrapping the buffer passed to nc.pending would work - this would likely require updating the error handling around reconnects.

The reason this was spinning so hard and generating loads of garbage was that we use ginkgo for our tests and it encourages some ~~nasty~~ interesting patterns - like spinning up a nats listener for each test irregardless of whether it has something to connect to - it is spun down at the end of each test, but who knows what the connection state is then. Because of the nastiness ginkgo introduces it does tend to highlight situations where applications do not perform well when error'ing.

That said, this is an edge case that occurs only when error'ing in a tight loop in test code, but an 8Mb memory allocation per reconnect is still a pretty heavy hit (regardless of the efficacy of Go's GC).

Thanks again for reviewing and please let me know if there is anything I may do to help.

Yes, the underlying bytes buffer is never limiting the amount of data that can be written. We actually do check for length in the publish() call to enforce that limit (this is to avoid unbound growth).
Let me investigate a bit more to see why the original code would need so much memory, but otherwise I think your changes are good. Thanks!

kozlovic · 2018-03-01T00:13:09Z

Sorry for the delay. @derekcollison would you want to have a look? I think the change is ok.

kozlovic · 2018-03-01T16:29:55Z

@charlievieth Would you want to have a go at what @derekcollison commented on? If you don't have the time, I can try to update the PR (or we can merge and then I will submit a new PR). Let me know. Thanks!

derekcollison · 2018-03-01T20:47:38Z

Let's merge and have a quick follow-on PR. Thanks @charlievieth, much appreciated.

charlievieth · 2018-03-02T02:31:08Z

Thanks for taking the PR!

kozlovic · 2018-03-02T02:39:37Z

@charlievieth You are welcomed. But I had another look and something is not quite right. I don't see how nc.pending cannot be nil since we set it to nil on successful reconnect. I don't think that a connection can have successive processOpErr without first reconnecting. In other words, I think that nc.pending should always be nil, which means that we end-up always creating nc.pending.
Would that be possible for you to put a stack/panic if nc.pending is not nil there so we see how that's possible?

By the way, the original code would create a nc.bw that is 32Kb with the net.Conn as the underlying writer. When starting the reconnect process, nc.bw would be recreated with a buffer of the size of ReconnectBufSize (8MB by default) and use a bytes.Buffer as a backing writer. When a connection is successfully created, nc.bw would again be created with default 32Kb buffer and the new net.Conn as the writer, then nc.pending set to nil.
With the PR, it means that the nc.bw basically always stay with a buffer of 32Kb, but simply change the writer..

charlievieth · 2018-03-02T03:10:19Z

@kozlovic Don't have a ton of time to look at this, while it's certainly possible that nc.pending cannot be nil - the tests say otherwise and panic.

Regarding, your second point I don't see how the size of nc.bw affects anything - you're double buffering (nc.bw => nc.pending) and the size of nc.bw has no impact on the amount of data that can be written to nc.pending and the double buffering actually has a negative, though tiny, impact on performance. The size of nc.bw only matters when writing to a net.Conn, which may be unbuffered, why change it needlessly?

Or to put it more tersely, changing the size of nc.bw is the bug here. Remember the bytes.Buffer will grow as needed, but the bufio.Writer allocates its scratch space immediately.

kozlovic · 2018-03-02T03:23:38Z

Don't have a ton of time to look at this, while it's certainly possible that nc.pending cannot be nil - the tests say otherwise and panic.

If it panic, that's the trace I would want to see: how we get to processOpErr with a non nil nc.pending.

you're double buffering (nc.bw => nc.pending)

True. But this is so that the rest of the library behaves the same regardless if we are connected or not.

I agree that the size of bw does not really matter except that writes may be copy back to pending more often with this change since now bw buffer may be smaller than ReconnectBufSize (and I realize that we flush - to pending - at each Publish() call).

Or to put it more tersely, changing the size of nc.bw is the bug here.

Changing where? Are you saying in original code as opposed to what you did in the PR?

charlievieth · 2018-03-02T03:27:30Z

Disregard the first part about nc.pending not being nil - I misread that - its nil.

charlievieth · 2018-03-02T03:31:29Z

I was referring to this PR.

charlievieth · 2018-03-02T03:31:48Z

Again, thanks for taking the time to look at this stuff.

nats: reuse buffers in Conn.processOpError

e23ea03

This significantly reduces memory usage when the nats Conn enters an error loop.

charlievieth force-pushed the master branch from f601eaa to e23ea03 Compare February 18, 2018 01:31

kozlovic requested changes Feb 18, 2018

View reviewed changes

kozlovic approved these changes Mar 1, 2018

View reviewed changes

kozlovic changed the title ~~nats: reuse buffers in Conn.processOpError~~ [IMPROVED] Reuse buffers in Conn.processOpError Mar 1, 2018

kozlovic merged commit 0718c1c into nats-io:master Mar 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVED] Reuse buffers in Conn.processOpError #341

[IMPROVED] Reuse buffers in Conn.processOpError #341

charlievieth commented Feb 18, 2018

coveralls commented Feb 18, 2018

charlievieth commented Feb 18, 2018

kozlovic left a comment

kozlovic Feb 18, 2018

charlievieth Feb 18, 2018

kozlovic Feb 19, 2018

kozlovic commented Mar 1, 2018

kozlovic commented Mar 1, 2018

derekcollison commented Mar 1, 2018

charlievieth commented Mar 2, 2018

kozlovic commented Mar 2, 2018

charlievieth commented Mar 2, 2018 •

edited

Loading

kozlovic commented Mar 2, 2018

charlievieth commented Mar 2, 2018

charlievieth commented Mar 2, 2018

charlievieth commented Mar 2, 2018

[IMPROVED] Reuse buffers in Conn.processOpError #341

[IMPROVED] Reuse buffers in Conn.processOpError #341

Conversation

charlievieth commented Feb 18, 2018

coveralls commented Feb 18, 2018

charlievieth commented Feb 18, 2018

kozlovic left a comment

Choose a reason for hiding this comment

kozlovic Feb 18, 2018

Choose a reason for hiding this comment

charlievieth Feb 18, 2018

Choose a reason for hiding this comment

kozlovic Feb 19, 2018

Choose a reason for hiding this comment

kozlovic commented Mar 1, 2018

kozlovic commented Mar 1, 2018

derekcollison commented Mar 1, 2018

charlievieth commented Mar 2, 2018

kozlovic commented Mar 2, 2018

charlievieth commented Mar 2, 2018 • edited Loading

kozlovic commented Mar 2, 2018

charlievieth commented Mar 2, 2018

charlievieth commented Mar 2, 2018

charlievieth commented Mar 2, 2018

charlievieth commented Mar 2, 2018 •

edited

Loading