-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in StreamFramer and truncation in api/AllocFS.Logs #4234
Conversation
} | ||
s.l.Unlock() | ||
} | ||
|
||
// send takes a StreamFrame, encodes and sends it | ||
func (s *StreamFramer) send(f *StreamFrame) error { | ||
sending := *f | ||
f.Data = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't let this fool you as it fooled me and everyone who touched it before! 😅
Nil'ing the slice doesn't prevent a race. sending.Data will have the same backing array as f.Data used to. This means the next call to s.readData operates on the same backing array as sending.Data is reading in another goroutine!
At least that's my best explanation (and reverting this file immediately induces the bug).
api/fs.go
Outdated
@@ -315,8 +320,11 @@ func (a *AllocFS) Logs(alloc *Allocation, follow bool, task, logType, origin str | |||
// Decode the next frame | |||
var frame StreamFrame | |||
if err := dec.Decode(&frame); err != nil { | |||
errCh <- err | |||
close(frames) | |||
if err == io.EOF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put my review comment in the wrong place so repasting it here:
Shouldn't we also close the channel when the error is ErrClosedPipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I think due to how we currently Close the pipe that can't happen, but there's no reason not to be defensive.
|
||
//TODO There should be a way to connect the client to the servers in | ||
//makeClient above | ||
require.NoError(c.Agent().SetServers([]string{fmt.Sprintf("127.0.0.1:%d", rpcPort)})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't makeClient already do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't! This makeClient
helper is very strange and all of the tests in api/
that expect a client node seem to be commented out! For example: https://github.com/hashicorp/nomad/blob/master/api/allocations_test.go#L33-L59
I'm honestly not sure what the right approach is here. Either we need to improve this test client or we need to remove it and use one of the other test client helpers we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That test client just serializes a config and starts the agent with it. By default it only creates the server. You may want to consider just using a dev agent.
d0cc37d
to
1b34922
Compare
Closing the frames chan is the only race-free way to signal to receivers that all frames have been sent and no errors have occurred. If EOF is sent on error chan receivers may not receive the last frame (or frames since the chan is buffered) before receiving the error. Closing frames is the idiomatic way of signaling there is no more data to be read from a chan.
According to go/codec's docs, Reset(...) should be called on Decoders/Encoders before reuse: https://godoc.org/github.com/ugorji/go/codec I could find no evidence that *not* calling Reset() caused bugs, but might as well do what the docs say?
In the old code `sending` in the `send()` method shared the Data slice's underlying backing array with its caller. Clearing StreamFrame.Data didn't break the reference from the sent frame to the StreamFramer's data slice.
return | ||
} | ||
decoder.Reset(httpPipe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look related to the race, was this all general cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and I was torn on whether to include it or not.
When testing adding or removing these lines had no impact on truncation or ordering issues.
However, they also didn't hurt and the docs say Reset should be called when "reusing" an encoder/decoder? https://godoc.org/github.com/ugorji/go/codec (search for Reset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with the team, and we're leaning toward keeping it Just In Case.
|
||
//TODO There should be a way to connect the client to the servers in | ||
//makeClient above | ||
require.NoError(c.Agent().SetServers([]string{fmt.Sprintf("127.0.0.1:%d", rpcPort)})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That test client just serializes a config and starts the agent with it. By default it only creates the server. You may want to consider just using a dev agent.
n := new(StreamFrame) | ||
*n = *s | ||
n.Data = make([]byte, len(s.Data)) | ||
copy(n.Data, s.Data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the race on the backing slice I understand how this fixes the issue but it is rather unfortunate that we are copying the data! I believe another fix would be for the out channel of the framer to be a wrapper:
type FrameFuture struct {
Frame *StreamFrame
FrameSent chan struct{}
}
func (f *FrameFuture) Sent() {
close(f.FrameSent)
}
And then the caller can call Sent on the future after it has been flushed: https://github.com/hashicorp/nomad/pull/4234/files#diff-03a99c783d6606b25fccfc5f3d26092bR470
Then send() can select on that and the exit channel. This way we can avoid copying.
@@ -423,20 +422,23 @@ func (f *FileSystem) logs(conn io.ReadWriteCloser) { | |||
go func() { | |||
for { | |||
if _, err := conn.Read(nil); err != nil { | |||
if err == io.EOF { | |||
if err == io.EOF || err == io.ErrClosedPipe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here 51440e2
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #4159
This was extremely difficult to debug since there is an incredible amount of indirection, proxying, and copying all in code that tries to avoid unnecessary copying because it's proxying a potentially significant amount of data! Luckily I added a test to
api/fs_test.go
that demonstrated the bug. Sadly it's slow (6-8s on my Vagrantbox).The two bugs affecting #4159 are easiest to review by looking at their commits alone:
api/
between sending frames and closing: 41f05dc