-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data race between Agent.Run() and Agent.Stop() #1625
Fix data race between Agent.Run() and Agent.Stop() #1625
Conversation
@@ -22,7 +22,7 @@ import ( | |||
"github.com/uber/jaeger-lib/metrics" | |||
"go.uber.org/zap" | |||
|
|||
"github.com/jaegertracing/jaeger/cmd/agent/app/customtransports" | |||
customtransport "github.com/jaegertracing/jaeger/cmd/agent/app/customtransports" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is needed otherwise line 66 will not compile. I am not sure how it was compiling before. I may be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Rename the directory the the package is in or the package itself to match the directory name. Right now the package is named customtransport
, which the directory is customertransports
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be the right approach, otherwise it breaks a common naming practice with no apparent good reason.
However, I am reluctant to make this change. If any other external code (not in this repo) imports this package renaming it will break that external code. I'd rather let this repo's maintainers decide if that's what they want to do.
I am reverting this change since apparently it compiles fine the way it was, not sure why I thought it doesn't.
Codecov Report
@@ Coverage Diff @@
## master #1625 +/- ##
=========================================
Coverage ? 98.74%
=========================================
Files ? 191
Lines ? 9182
Branches ? 0
=========================================
Hits ? 9067
Misses ? 89
Partials ? 26
Continue to review full report at Codecov.
|
We had a data race with Agent.Stop() failing if called immediately after Agent.Run() returns. The root cause of the race was that Stop() called WaitGroup.Wait() before Run() called WaitGroup.Add() for the first time which is prohibited: https://golang.org/pkg/sync/#WaitGroup.Add This change moves WaitGroup.Add() to earlier stage which guarantees that WaitGroup.Wait() will be called after that. Github issue: #1624 Testing done: Added automated tests which was failing before the bug was fixed and does not fail any more after the fix. Also verified that `make test` passes. Signed-off-by: Tigran Najaryan <[email protected]>
@@ -147,6 +147,47 @@ func withRunningAgent(t *testing.T, testcase func(string, chan error)) { | |||
t.Fatal("Expecting server exit log") | |||
} | |||
|
|||
func TestStartStopRace(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't reproduce this on my mac open-telemetry/opentelemetry-collector#43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think wg.Wait()
doesn't block if count=0
, not sure if i missed something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't reproduce this on my mac open-telemetry/opentelemetry-collector#43
Yes, unfortunately it is a non-deterministic race, no guarantee that it will be reproducible. The test may catch the problem and it does sometimes when running on my machine.
I think wg.Wait() doesn't block if count=0, not sure if i missed something
The documentation of WaitGroup.Add() says:
Note that calls with a positive delta that occur when the counter is zero must happen before a Wait.
This means it is not valid to call Wait with zero counter before Add is called. However this is precisely what happens in our code.
You are right that in production mode Wait() does not block when count=0 however it is considered a mistake and go test race detector catches this mistake, see this comment https://github.com/golang/go/blob/master/src/sync/waitgroup.go#L124-L127
@@ -76,12 +76,12 @@ func NewThriftProcessor( | |||
numProcessors: numProcessors, | |||
} | |||
metrics.Init(&res.metrics, mFactory, nil) | |||
res.processing.Add(res.numProcessors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a hunch that deadlock might've been in func processBuffer
due to not executing s.processing.Done()
, maybe it should be defer s.processing.Done()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, and thats not the case unless Serve()
was called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing a deadlock, I am observing a data race. Here is the output of go test -v -race github.com/jaegertracing/jaeger/cmd/agent/app
without my fix applied (but with my new test included):
==================
WARNING: DATA RACE
Write at 0x00c000346118 by goroutine 19:
sync.(*WaitGroup).Wait()
/usr/local/go/src/internal/race/race.go:41 +0xef
github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).Stop()
/home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:102 +0x11d
Previous read at 0x00c000346118 by goroutine 18:
sync.(*WaitGroup).Add()
/usr/local/go/src/internal/race/race.go:37 +0x169
github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).Serve()
/home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:84 +0x62
Goroutine 19 (running) created at:
github.com/jaegertracing/jaeger/cmd/agent/app.(*Agent).Stop()
/home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent.go:88 +0xa1
github.com/jaegertracing/jaeger/cmd/agent/app.TestStartStopRace()
/home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent_test.go:180 +0x4ca
testing.tRunner()
/usr/local/go/src/testing/testing.go:865 +0x163
Goroutine 18 (running) created at:
github.com/jaegertracing/jaeger/cmd/agent/app.(*Agent).Run()
/home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent.go:75 +0x2bf
github.com/jaegertracing/jaeger/cmd/agent/app.TestStartStopRace()
/home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent_test.go:176 +0x442
testing.tRunner()
/usr/local/go/src/testing/testing.go:865 +0x163
==================
--- FAIL: TestStartStopRace (0.01s)
testing.go:809: race detected during execution of test
It shows a race between sync.(*WaitGroup).Wait() and sync.(*WaitGroup).Add() which is prohibited by WaitGroup documentation. The documentation says Add() must be called before Wait().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a hunch that deadlock might've been in func processBuffer due to not executing s.processing.Done(), maybe it should be defer s.processing.Done()?
I only see one exit path in processBuffer(), so s.processing.Done() is guaranteed be called when the function returns (I am ignoring the possible panic case, since that is not happening in this particular case).
I agree with you it would be more idiomatic to use defer s.processing.Done()
which would make it more resilient if in the future additional exit path is added to processBuffer(), but I do not think in this particular case it cases any problems.
@@ -76,12 +76,12 @@ func NewThriftProcessor( | |||
numProcessors: numProcessors, | |||
} | |||
metrics.Init(&res.metrics, mFactory, nil) | |||
res.processing.Add(res.numProcessors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion....
res.processing.Add(res.numProcessors)
go func() {
s.processBuffer()
s.processing.Done()
}()
Why? It consolidates Add() and Done() close together.
the processBuffer goroutines will immediately block as there is nothing in the server's data channel.
This also nicely removes the ability to accidentally spin up multiple sets of processors for the same ThriftProcessor value that aren't tracked in the waitgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggestion, it makes WaitGroup operation more visible. It needs to be slightly modified to account for numProcessors:
res.processing.Add(res.numProcessors)
for i := 0; i < res.numProcessors; i++ {
go func() {
res.processBuffer()
res.processing.Done()
}()
}
I'll push an update to this PR.
Reviewers, please check again. |
Signed-off-by: Tigran Najaryan <[email protected]>
@yurishkuro @pavolloffay can you review this, please? |
thanks @tigrannajaryan @freeformz ! |
Added a Sleep to work around a data race bug in Jaeger (jaegertracing/jaeger#1625) caused by stopping immediately after starting. Without this Sleep we were observing this bug on our side: open-telemetry#43 The Sleep ensures that Jaeger Start() is fully completed before we call Jaeger Stop(). TODO: Jaeger bug is already fixed, remove this once we update Jaeger to latest version. Testing done: make test
Added a Sleep to work around a data race bug in Jaeger (jaegertracing/jaeger#1625) caused by stopping immediately after starting. Without this Sleep we were observing this bug on our side: #43 The Sleep ensures that Jaeger Start() is fully completed before we call Jaeger Stop(). TODO: Jaeger bug is already fixed, remove this once we update Jaeger to latest version. Testing done: make test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grouping Add() and Done() closer is nice, but doesn't have a functional impact.
How does moving the go-routines from Serve() to constructor solve the data race?
@yurishkuro the race was happening when Stop() was called immediately after Run(). It was happening because when Run() was returning there was no guarantee that This change guarantees that |
Which problem is this PR solving?
We had a data race with Agent.Stop() failing if called immediately
after Agent.Run() returns.
Short description of the changes
The root cause of the race was that Stop() called WaitGroup.Wait()
before Run() called WaitGroup.Add() for the first time which is
prohibited: https://golang.org/pkg/sync/#WaitGroup.Add
This change moves WaitGroup.Add() to earlier stage which guarantees
that WaitGroup.Wait() will be called after that.
Github issue: #1624
Testing done:
Added an automated test which was failing before the bug was fixed
and does not fail any more after the fix.
Also verified that
make test
passes.Signed-off-by: Tigran Najaryan [email protected]