Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVED] Reuse buffers in Conn.processOpError #341

Merged
merged 1 commit into from
Mar 1, 2018

Conversation

charlievieth
Copy link
Contributor

This significantly reduces memory usage when the nats Conn enters an error loop.

In a Cloud Foundry test (route-emitter/cmd/route-emitter) the Conn.processOpError was responsible for ~95% of memory allocations. This was due to Conn.processOpError allocating a new bufio.Writer each time it was called. This commit changes Conn.processOpError to reuse and reset its existing bufio.Writer.

Below is the output from go tool pprof implicating Conn.processOpError:

File: route-emitter.test
Type: inuse_space
Time: Feb 17, 2018 at 5:54pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5 -cum
Showing nodes accounting for 560MB, 94.65% of 591.68MB total
Dropped 30 nodes (cum <= 2.96MB)
Showing top 5 nodes out of 11
      flat  flat%   sum%        cum   cum%
         0     0%     0%   563.50MB 95.24%  github.com/nats-io/nats.(*Conn).readLoop
     560MB 94.65% 94.65%      560MB 94.65%  github.com/nats-io/nats.(*Conn).processOpErr
         0     0% 94.65%    25.67MB  4.34%  io.(*multiWriter).Write
         0     0% 94.65%    25.67MB  4.34%  io.Copy
         0     0% 94.65%    25.67MB  4.34%  io.copyBuffer

Output of go tool pprof after applying this patch (note the absence of nats-io and overall reduction in memory):

File: route-emitter.test
Type: inuse_space
Time: Feb 17, 2018 at 6:54pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5 -cum
Showing nodes accounting for 0, 0% of 31.14MB total
Showing top 5 nodes out of 90
      flat  flat%   sum%        cum   cum%
         0     0%     0%    21.88MB 70.28%  io.(*multiWriter).Write
         0     0%     0%    21.88MB 70.28%  io.Copy
         0     0%     0%    21.88MB 70.28%  io.copyBuffer
         0     0%     0%    21.88MB 70.28%  os/exec.(*Cmd).Start.func1
         0     0%     0%    21.88MB 70.28%  os/exec.(*Cmd).writerDescriptor.func1

I would have provided the output of running the benchmarks in nats-io/go-nats/test, but they do not appear to hit this code path. That said, I'd be happy to send them if you think they'd be useful.

This significantly reduces memory usage when the nats Conn enters an
error loop.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 94.655% when pulling e23ea03 on charlievieth:master into c7e16c2 on nats-io:master.

@charlievieth
Copy link
Contributor Author

Oops, this commit (nats-io/go-nats@5406b8c: nats: simplify error check in Conn.connect) briefly leaked into this PR on account of it being on my master branch - it has been removed so that only nats-io/go-nats@e23ea03 is part of this PR.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need a bit more investigation and regardless, if we do a change here, we need to take into consideration the ReconnectBufSize value.

nc.bw = bufio.NewWriterSize(nc.pending, nc.Opts.ReconnectBufSize)
// Reset pending buffers before reconnecting.
if nc.pending == nil {
nc.pending = new(bytes.Buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not account for nc.Opts.ReconnectBufSize, which is worrisome that tests pass, which means that this is not properly tested.
More importantly, in which situation would you get this called so many times and so fast that the GC would not collect those buffers? Maybe we need to understand better why this is a problem in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlovic thanks for the feedback!

The tests likely passed because bufio.NewWriterSize(W, N) does nothing to limit the amount of data written io.Writer W - N here is only used to size the internal buffer. Properly limiting the amount of data writted to W would need to be handled by W - implementing a LimitedWriter like io.LimitedReader and wrapping the buffer passed to nc.pending would work - this would likely require updating the error handling around reconnects.

The reason this was spinning so hard and generating loads of garbage was that we use ginkgo for our tests and it encourages some nasty interesting patterns - like spinning up a nats listener for each test irregardless of whether it has something to connect to - it is spun down at the end of each test, but who knows what the connection state is then. Because of the nastiness ginkgo introduces it does tend to highlight situations where applications do not perform well when error'ing.

That said, this is an edge case that occurs only when error'ing in a tight loop in test code, but an 8Mb memory allocation per reconnect is still a pretty heavy hit (regardless of the efficacy of Go's GC).

Thanks again for reviewing and please let me know if there is anything I may do to help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the underlying bytes buffer is never limiting the amount of data that can be written. We actually do check for length in the publish() call to enforce that limit (this is to avoid unbound growth).
Let me investigate a bit more to see why the original code would need so much memory, but otherwise I think your changes are good. Thanks!

@kozlovic
Copy link
Member

kozlovic commented Mar 1, 2018

Sorry for the delay. @derekcollison would you want to have a look? I think the change is ok.

@kozlovic
Copy link
Member

kozlovic commented Mar 1, 2018

@charlievieth Would you want to have a go at what @derekcollison commented on? If you don't have the time, I can try to update the PR (or we can merge and then I will submit a new PR). Let me know. Thanks!

@derekcollison
Copy link
Member

Let's merge and have a quick follow-on PR. Thanks @charlievieth, much appreciated.

@kozlovic kozlovic changed the title nats: reuse buffers in Conn.processOpError [IMPROVED] Reuse buffers in Conn.processOpError Mar 1, 2018
@kozlovic kozlovic merged commit 0718c1c into nats-io:master Mar 1, 2018
@charlievieth
Copy link
Contributor Author

Thanks for taking the PR!

@kozlovic
Copy link
Member

kozlovic commented Mar 2, 2018

@charlievieth You are welcomed. But I had another look and something is not quite right. I don't see how nc.pending cannot be nil since we set it to nil on successful reconnect. I don't think that a connection can have successive processOpErr without first reconnecting. In other words, I think that nc.pending should always be nil, which means that we end-up always creating nc.pending.
Would that be possible for you to put a stack/panic if nc.pending is not nil there so we see how that's possible?

By the way, the original code would create a nc.bw that is 32Kb with the net.Conn as the underlying writer. When starting the reconnect process, nc.bw would be recreated with a buffer of the size of ReconnectBufSize (8MB by default) and use a bytes.Buffer as a backing writer. When a connection is successfully created, nc.bw would again be created with default 32Kb buffer and the new net.Conn as the writer, then nc.pending set to nil.
With the PR, it means that the nc.bw basically always stay with a buffer of 32Kb, but simply change the writer..

@charlievieth
Copy link
Contributor Author

charlievieth commented Mar 2, 2018

@kozlovic Don't have a ton of time to look at this, while it's certainly possible that nc.pending cannot be nil - the tests say otherwise and panic.

Regarding, your second point I don't see how the size of nc.bw affects anything - you're double buffering (nc.bw => nc.pending) and the size of nc.bw has no impact on the amount of data that can be written to nc.pending and the double buffering actually has a negative, though tiny, impact on performance. The size of nc.bw only matters when writing to a net.Conn, which may be unbuffered, why change it needlessly?

Or to put it more tersely, changing the size of nc.bw is the bug here. Remember the bytes.Buffer will grow as needed, but the bufio.Writer allocates its scratch space immediately.

@kozlovic
Copy link
Member

kozlovic commented Mar 2, 2018

Don't have a ton of time to look at this, while it's certainly possible that nc.pending cannot be nil - the tests say otherwise and panic.

If it panic, that's the trace I would want to see: how we get to processOpErr with a non nil nc.pending.

you're double buffering (nc.bw => nc.pending)

True. But this is so that the rest of the library behaves the same regardless if we are connected or not.

I agree that the size of bw does not really matter except that writes may be copy back to pending more often with this change since now bw buffer may be smaller than ReconnectBufSize (and I realize that we flush - to pending - at each Publish() call).

Or to put it more tersely, changing the size of nc.bw is the bug here.

Changing where? Are you saying in original code as opposed to what you did in the PR?

@charlievieth
Copy link
Contributor Author

Disregard the first part about nc.pending not being nil - I misread that - its nil.

@charlievieth
Copy link
Contributor Author

I was referring to this PR.

@charlievieth
Copy link
Contributor Author

Again, thanks for taking the time to look at this stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants