Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd-runner: remove mutex on validate() and release() in global.go #7902

Merged
merged 2 commits into from
May 10, 2017

Conversation

fanminshi
Copy link
Member

election runner can deadlock in atomic release().

suppose election runner has two clients A and B.
if A is a leader and B is a follower, B obtains lock
for release() and waits for A to close(nextc) which signal
next round is ready. However, A can only close(nextc) if it
obtains lock for release(); hence deadlock.

this pr removes atomicity of validate() and release() in global.go
and gives the responsibility of locking to each runner.

FIXES #7891

@@ -69,12 +72,16 @@ func runRacerFunc(cmd *cobra.Command, args []string) {
m := concurrency.NewMutex(s, racers)
rcs[i].acquire = func() error { return m.Lock(ctx) }
rcs[i].validate = func() error {
mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this functionally different from what's in the code now?

The current code:

mu.Lock()
f()
mu.Unlock()

does the same thing as this new code:

func f() {
    mu.Lock()
    defer mu.Lock()
    ...
}

(panics not withstanding)

Copy link
Contributor

@heyitsanthony heyitsanthony May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK nevermind the mutex isn't being applied to the election. Is this the right fix, though? The follower acquires the election but the doRounds code is assuming exclusive ownership and that's causing a deadlock. Should acquire always mean exclusive ownership then and so the reuse for elections is a hack that should be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just tear our the entire doRounds acquire/release thing for elections. There's no way to sanely reason about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The follower acquires the election but the doRounds code is assuming exclusive ownership and that's causing a deadlock.

Follower can't win the election if it is a follower. Only one client can win the election. The deadlock is caused when the follower block on rcNextc when it obtained mu.Lock() before leader does. However, if leader obtains mu.Lock() before follower does, then everything is fine.

remove the lock fixes the above issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about whether the doRounds loop is the right abstraction here. I know how elections work.

@heyitsanthony
Copy link
Contributor

Tried this with -race?

==================
WARNING: DATA RACE
Write at 0x00c4201c2460 by goroutine 154:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc.func2()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:93 +0x392
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.doRounds.func1()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/global.go:72 +0x1f2

Previous write at 0x00c4201c2460 by goroutine 161:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc.func2.1()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:83 +0x165

Goroutine 154 (running) created at:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.doRounds()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/global.go:86 +0x23e
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:142 +0xbec
  github.com/spf13/cobra.(*Command).execute()
      /Users/anthony/go/src/github.com/spf13/cobra/command.go:572 +0x801
  github.com/spf13/cobra.(*Command).ExecuteC()
      /Users/anthony/go/src/github.com/spf13/cobra/command.go:662 +0x458
  github.com/spf13/cobra.(*Command).Execute()
      /Users/anthony/go/src/github.com/spf13/cobra/command.go:618 +0x38
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.Start()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/root.go:66 +0x107
  main.main()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/main.go:21 +0x2f

Goroutine 161 (finished) created at:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc.func2()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:90 +0x161
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.doRounds.func1()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/global.go:72 +0x1f2
==================

@fanminshi
Copy link
Member Author

@heyitsanthony good catch on the data race. let me see what's going on.

@fanminshi
Copy link
Member Author

The race is on var [observedLeader ](https://github.com/coreos/etcd/blob/master/tools/functional-tester/etcd-runner/command/election_command.go#L560 which is a local variable for the each client. It is not caused by this pr.

I'll have another pr to fix that.

@heyitsanthony
Copy link
Contributor

@fanminshi the race should be fixed here

@fanminshi
Copy link
Member Author

@heyitsanthony alright will fix the race #7903 in this pr too.

@codecov-io
Copy link

codecov-io commented May 9, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@a53a9e1). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #7902   +/-   ##
=========================================
  Coverage          ?   75.65%           
=========================================
  Files             ?      332           
  Lines             ?    26316           
  Branches          ?        0           
=========================================
  Hits              ?    19910           
  Misses            ?     4975           
  Partials          ?     1431

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a53a9e1...b44bd6d. Read the comment docs.

election runner can deadlock in atomic release().

suppose election runner has two clients A and B.
if A is a leader and B is a follower, B obtains lock
for release() and waits for A to close(nextc) which signal
next round is ready. However, A can only close(nextc) if it
obtains lock for release(); hence deadlock.

this pr removes atomicity of validate() and release() in global.go
and gives the responsibility of locking to each runner.

FIXES etcd-io#7891
@fanminshi
Copy link
Member Author

fixed #7903.

@heyitsanthony
Copy link
Contributor

==================
WARNING: DATA RACE
Read at 0x00c420078118 by goroutine 161:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc.func1()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:60 +0x67
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc.func3()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:115 +0x132
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.doRounds.func1()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/global.go:75 +0x1ff

Previous write at 0x00c420078118 by goroutine 151:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc.func4()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:133 +0x3ea
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.doRounds.func1()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/global.go:83 +0x314

Goroutine 161 (running) created at:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.doRounds()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/global.go:86 +0x23e
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:142 +0xbec
  github.com/spf13/cobra.(*Command).execute()
      /Users/anthony/hm/src/github.com/spf13/cobra/command.go:572 +0x801
  github.com/spf13/cobra.(*Command).ExecuteC()
      /Users/anthony/hm/src/github.com/spf13/cobra/command.go:662 +0x458
  github.com/spf13/cobra.(*Command).Execute()
      /Users/anthony/hm/src/github.com/spf13/cobra/command.go:618 +0x38
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.Start()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/root.go:66 +0x107
  main.main()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/main.go:21 +0x2f

Goroutine 151 (running) created at:
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.doRounds()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/global.go:86 +0x23e
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.runElectionFunc()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/election_command.go:142 +0xbec
  github.com/spf13/cobra.(*Command).execute()
      /Users/anthony/hm/src/github.com/spf13/cobra/command.go:572 +0x801
  github.com/spf13/cobra.(*Command).ExecuteC()
      /Users/anthony/hm/src/github.com/spf13/cobra/command.go:662 +0x458
  github.com/spf13/cobra.(*Command).Execute()
      /Users/anthony/hm/src/github.com/spf13/cobra/command.go:618 +0x38
  github.com/coreos/etcd/tools/functional-tester/etcd-runner/command.Start()
      /Users/anthony/hm/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/command/root.go:66 +0x107
  main.main()
      /Users/anthony/go/src/github.com/coreos/etcd/tools/functional-tester/etcd-runner/main.go:21 +0x2f
==================

@fanminshi
Copy link
Member Author

fanminshi commented May 9, 2017

I am unsure how the above race can happen.

setRcNextc() races against nextc = make(chan struct{})

The validateWaiters loop ensures that all followers have been validated in which setRcNextc() has executed.

for validateWaiters > 0 {
	select {
	case <-validatec:
		validateWaiters--
	default:
		return fmt.Errorf("waiting on followers")
	}
}

https://github.com/coreos/etcd/blob/master/tools/functional-tester/etcd-runner/command/election_command.go

before leader sets nextc

if observedLeader == v {
	close(nextc)
	nextc = make(chan struct{}) // race here
} 

https://github.com/coreos/etcd/blob/master/tools/functional-tester/etcd-runner/command/election_command.go#L133

edit: I observe the race locally as well. Investigating...

@fanminshi
Copy link
Member Author

fanminshi commented May 10, 2017

Fixed race on nextc

if observedLeader == v {
	close(nextc)
	nextc = make(chan struct{}) // race here
} 

leader close(nextc) before nextc = make(chan struct{} can cause follower to start next around and execute setRcNextc() at same time the leader executing nextc = make(chan struct{}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

etcd-runner: fails on etcd-tester (election, no progress after 1 minute)
3 participants