-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race/deadlock issues with logs when a pod is not ready #719
Conversation
@@ -146,7 +157,6 @@ func (s *SonobuoyClient) LogReader(cfg *LogConfig) (*Reader, error) { | |||
go func(wg *sync.WaitGroup, agg chan *message, errc chan error) { | |||
wg.Wait() | |||
close(agg) | |||
errc <- io.EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do this if we add the 'assume eof' logic above. Otherwise I think we have to worry about another odd edge case: what if every container threw an error, filling the buffered channel, then we tried to add an eof here?
I think that would work OK since the this call may block but eventually the Read()
call would pull a value off, unblocking this, and closing the error channel. Regardless of if the error channel closed, the program would end rather than hang. Or so I think w/o checking.
It seems the more simple thing is to just close the channels and let the logic happen at the read site where assuming EOF would make sense once both the message and error channels have been closed.
@ashish-amarnath Here is the fix I finally ended at.
I talked to you about not wanting to increase the buffer size on a whim, but after a lot of deliberation, I think that is actually the right design here considering the rest of the code which already exists. |
There existed an issue where if there was a problem reading the logs of a pod it would try to put a message onto the error channel of the logReader which was unbuffered. As a result it would block but the problem was that the reading routine was trying to drain the message channel first before reading errors. The solution to this was to ensure that we sufficiently buffered the error channel to handle these bad-reader cases. Trying to stream from both the channels introduced issues where the messages wouldn't be drained yet but the error would short circuit the logic causing the logs to not get fully read. Fixes #708 Signed-off-by: John Schnake <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #719 +/- ##
=========================================
+ Coverage 39.22% 39.33% +0.1%
=========================================
Files 68 68
Lines 3816 3821 +5
=========================================
+ Hits 1497 1503 +6
Misses 2221 2221
+ Partials 98 97 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I think this looks good, it's difficult to test to truly validate. =)
What this PR does / why we need it:
There existed an issue where if there was a problem reading the logs
of a pod it would try to put a message onto the error channel of
the logReader which was unbuffered. As a result it would block but
the problem was that the reading routine was trying to drain the
message channel first before reading errors.
The solution to this was to ensure that we sufficiently buffered the
error channel to handle these bad-reader cases. Trying to stream
from both the channels introduced issues where the messages wouldn't
be drained yet but the error would short circuit the logic causing
the logs to not get fully read.
Which issue(s) this PR fixes
Fixes #708
Special notes for your reviewer:
Note on the solution: I am generally highly skeptical when the solution to a race is "increase the buffer size of the channel" but after thinking on it a bit I don't think there is really a better way to do this. I did some googling and found this minimum buffer size for errors was not an anomaly. The only alternative seemed more awkward which would involve creating other channels and layers of logic which made it more confusing. This seemed the best approach: sure to remove any race, minimally changes the existing logic, still prioritizes the messages before reading the errors, etc.
w.r.t. testing the change:
A bit difficult to test but you can do so manually; I used 2 simple YAML files and checked a few cases:
The command I used to check the initial error reporting was:
That should hit a race with the pod and report an error (currently it would hang forever).
Then try to just grab the logs a few more times with
./sonobuoy logs -n default
and see that the pod starts normally and you eventually get the logs as expected.Then I copied that yaml and changed the pod name in order to get 1 pod that was ready and then race the other so I could see I got the both the logs and the error:
For the yaml file I used:
Release note: