Fix race/deadlock issues with logs when a pod is not ready #719

johnSchnake · 2019-05-14T03:02:22Z

What this PR does / why we need it:
There existed an issue where if there was a problem reading the logs
of a pod it would try to put a message onto the error channel of
the logReader which was unbuffered. As a result it would block but
the problem was that the reading routine was trying to drain the
message channel first before reading errors.

The solution to this was to ensure that we sufficiently buffered the
error channel to handle these bad-reader cases. Trying to stream
from both the channels introduced issues where the messages wouldn't
be drained yet but the error would short circuit the logic causing
the logs to not get fully read.

Which issue(s) this PR fixes
Fixes #708

Special notes for your reviewer:
Note on the solution: I am generally highly skeptical when the solution to a race is "increase the buffer size of the channel" but after thinking on it a bit I don't think there is really a better way to do this. I did some googling and found this minimum buffer size for errors was not an anomaly. The only alternative seemed more awkward which would involve creating other channels and layers of logic which made it more confusing. This seemed the best approach: sure to remove any race, minimally changes the existing logic, still prioritizes the messages before reading the errors, etc.

w.r.t. testing the change:
A bit difficult to test but you can do so manually; I used 2 simple YAML files and checked a few cases:

can read normal logs in a standard, ready state
can properly report the error about the container not being ready
can report the error about one pod not being ready while at the same time reporting logs of other ready pods in the namespace.

The command I used to check the initial error reporting was:

kubectl apply -f test1.yaml && ./sonobuoy logs -n default

That should hit a race with the pod and report an error (currently it would hang forever).

Then try to just grab the logs a few more times with ./sonobuoy logs -n default and see that the pod starts normally and you eventually get the logs as expected.

Then I copied that yaml and changed the pod name in order to get 1 pod that was ready and then race the other so I could see I got the both the logs and the error:

kubectl apply -f test2.yaml && ./sonobuoy logs -n default

For the yaml file I used:

apiVersion: v1
kind: Pod
metadata:
  name: testpod2
spec:
  containers:
  - name: sise
    image: mhausenblas/simpleservice:0.5.0
    ports:
    - containerPort: 9876

Release note:

Fixed a bug which caused `sonobuoy logs` to hang if a pod was not in a ready state.

johnSchnake · 2019-05-14T03:05:45Z

pkg/client/logs.go

@@ -146,7 +157,6 @@ func (s *SonobuoyClient) LogReader(cfg *LogConfig) (*Reader, error) {
 	go func(wg *sync.WaitGroup, agg chan *message, errc chan error) {
 		wg.Wait()
 		close(agg)
-		errc <- io.EOF


No need to do this if we add the 'assume eof' logic above. Otherwise I think we have to worry about another odd edge case: what if every container threw an error, filling the buffered channel, then we tried to add an eof here?

I think that would work OK since the this call may block but eventually the Read() call would pull a value off, unblocking this, and closing the error channel. Regardless of if the error channel closed, the program would end rather than hang. Or so I think w/o checking.

It seems the more simple thing is to just close the channels and let the logic happen at the read site where assuming EOF would make sense once both the message and error channels have been closed.

johnSchnake · 2019-05-14T03:08:42Z

@ashish-amarnath Here is the fix I finally ended at.

selecting off of the 2 channels led to problems because the messages weren't prioritized over the errors like they were before causing a race between normal logs and the standard EOF condition (or another error like the pods not being ready).

I talked to you about not wanting to increase the buffer size on a whim, but after a lot of deliberation, I think that is actually the right design here considering the rest of the code which already exists.

There existed an issue where if there was a problem reading the logs of a pod it would try to put a message onto the error channel of the logReader which was unbuffered. As a result it would block but the problem was that the reading routine was trying to drain the message channel first before reading errors. The solution to this was to ensure that we sufficiently buffered the error channel to handle these bad-reader cases. Trying to stream from both the channels introduced issues where the messages wouldn't be drained yet but the error would short circuit the logic causing the logs to not get fully read. Fixes #708 Signed-off-by: John Schnake <[email protected]>

codecov-io · 2019-05-14T14:02:38Z

Codecov Report

Merging #719 into master will increase coverage by 0.1%.
The diff coverage is 33.33%.

@@            Coverage Diff            @@
##           master     #719     +/-   ##
=========================================
+ Coverage   39.22%   39.33%   +0.1%     
=========================================
  Files          68       68             
  Lines        3816     3821      +5     
=========================================
+ Hits         1497     1503      +6     
  Misses       2221     2221             
+ Partials       98       97      -1

Impacted Files	Coverage Δ
pkg/client/logs.go	`28.84% <33.33%> (-0.45%)`	⬇️
pkg/plugin/aggregation/aggregator.go	`72.16% <0%> (+5.15%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 770b85b...e0800a1. Read the comment docs.

stevesloka

/lgtm

I think this looks good, it's difficult to test to truly validate. =)

johnSchnake requested a review from stevesloka May 14, 2019 03:02

johnSchnake commented May 14, 2019

View reviewed changes

stevesloka approved these changes May 14, 2019

View reviewed changes

johnSchnake merged commit 9a19511 into vmware-tanzu:master May 15, 2019

johnSchnake deleted the logRaceFix branch May 15, 2019 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race/deadlock issues with logs when a pod is not ready #719

Fix race/deadlock issues with logs when a pod is not ready #719

johnSchnake commented May 14, 2019

johnSchnake May 14, 2019

johnSchnake commented May 14, 2019

codecov-io commented May 14, 2019

stevesloka left a comment

Fix race/deadlock issues with logs when a pod is not ready #719

Fix race/deadlock issues with logs when a pod is not ready #719

Conversation

johnSchnake commented May 14, 2019

johnSchnake May 14, 2019

Choose a reason for hiding this comment

johnSchnake commented May 14, 2019

codecov-io commented May 14, 2019

Codecov Report

stevesloka left a comment

Choose a reason for hiding this comment