Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race/deadlock issues with logs when a pod is not ready #719

Merged
merged 1 commit into from
May 15, 2019
Merged

Fix race/deadlock issues with logs when a pod is not ready #719

merged 1 commit into from
May 15, 2019

Conversation

johnSchnake
Copy link
Contributor

What this PR does / why we need it:
There existed an issue where if there was a problem reading the logs
of a pod it would try to put a message onto the error channel of
the logReader which was unbuffered. As a result it would block but
the problem was that the reading routine was trying to drain the
message channel first before reading errors.

The solution to this was to ensure that we sufficiently buffered the
error channel to handle these bad-reader cases. Trying to stream
from both the channels introduced issues where the messages wouldn't
be drained yet but the error would short circuit the logic causing
the logs to not get fully read.

Which issue(s) this PR fixes
Fixes #708

Special notes for your reviewer:
Note on the solution: I am generally highly skeptical when the solution to a race is "increase the buffer size of the channel" but after thinking on it a bit I don't think there is really a better way to do this. I did some googling and found this minimum buffer size for errors was not an anomaly. The only alternative seemed more awkward which would involve creating other channels and layers of logic which made it more confusing. This seemed the best approach: sure to remove any race, minimally changes the existing logic, still prioritizes the messages before reading the errors, etc.

w.r.t. testing the change:
A bit difficult to test but you can do so manually; I used 2 simple YAML files and checked a few cases:

  • can read normal logs in a standard, ready state
  • can properly report the error about the container not being ready
  • can report the error about one pod not being ready while at the same time reporting logs of other ready pods in the namespace.

The command I used to check the initial error reporting was:

kubectl apply -f test1.yaml && ./sonobuoy logs -n default

That should hit a race with the pod and report an error (currently it would hang forever).

Then try to just grab the logs a few more times with ./sonobuoy logs -n default and see that the pod starts normally and you eventually get the logs as expected.

Then I copied that yaml and changed the pod name in order to get 1 pod that was ready and then race the other so I could see I got the both the logs and the error:

kubectl apply -f test2.yaml && ./sonobuoy logs -n default

For the yaml file I used:

apiVersion: v1
kind: Pod
metadata:
  name: testpod2
spec:
  containers:
  - name: sise
    image: mhausenblas/simpleservice:0.5.0
    ports:
    - containerPort: 9876

Release note:

Fixed a bug which caused `sonobuoy logs` to hang if a pod was not in a ready state.

@johnSchnake johnSchnake requested a review from stevesloka May 14, 2019 03:02
@@ -146,7 +157,6 @@ func (s *SonobuoyClient) LogReader(cfg *LogConfig) (*Reader, error) {
go func(wg *sync.WaitGroup, agg chan *message, errc chan error) {
wg.Wait()
close(agg)
errc <- io.EOF
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do this if we add the 'assume eof' logic above. Otherwise I think we have to worry about another odd edge case: what if every container threw an error, filling the buffered channel, then we tried to add an eof here?

I think that would work OK since the this call may block but eventually the Read() call would pull a value off, unblocking this, and closing the error channel. Regardless of if the error channel closed, the program would end rather than hang. Or so I think w/o checking.

It seems the more simple thing is to just close the channels and let the logic happen at the read site where assuming EOF would make sense once both the message and error channels have been closed.

@johnSchnake
Copy link
Contributor Author

@ashish-amarnath Here is the fix I finally ended at.

selecting off of the 2 channels led to problems because the messages weren't prioritized over the errors like they were before causing a race between normal logs and the standard EOF condition (or another error like the pods not being ready).

I talked to you about not wanting to increase the buffer size on a whim, but after a lot of deliberation, I think that is actually the right design here considering the rest of the code which already exists.

There existed an issue where if there was a problem reading the logs
of a pod it would try to put a message onto the error channel of
the logReader which was unbuffered. As a result it would block but
the problem was that the reading routine was trying to drain the
message channel first before reading errors.

The solution to this was to ensure that we sufficiently buffered the
error channel to handle these bad-reader cases. Trying to stream
from both the channels introduced issues where the messages wouldn't
be drained yet but the error would short circuit the logic causing
the logs to not get fully read.

Fixes #708

Signed-off-by: John Schnake <[email protected]>
@codecov-io
Copy link

Codecov Report

Merging #719 into master will increase coverage by 0.1%.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #719     +/-   ##
=========================================
+ Coverage   39.22%   39.33%   +0.1%     
=========================================
  Files          68       68             
  Lines        3816     3821      +5     
=========================================
+ Hits         1497     1503      +6     
  Misses       2221     2221             
+ Partials       98       97      -1
Impacted Files Coverage Δ
pkg/client/logs.go 28.84% <33.33%> (-0.45%) ⬇️
pkg/plugin/aggregation/aggregator.go 72.16% <0%> (+5.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 770b85b...e0800a1. Read the comment docs.

Copy link
Contributor

@stevesloka stevesloka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I think this looks good, it's difficult to test to truly validate. =)

@johnSchnake johnSchnake merged commit 9a19511 into vmware-tanzu:master May 15, 2019
@johnSchnake johnSchnake deleted the logRaceFix branch May 15, 2019 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix race in logs gathering
3 participants