Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent the router from deadlocking itself when calling Commit() #13717

Merged

Conversation

knobunc
Copy link
Contributor

@knobunc knobunc commented Apr 11, 2017

The router reload function (Commit()) was changed so that it could be
rate limited. The rate limiter tracks changes in a kcache.Queue
object. It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
Thread 1 (the rate-limiter background process):

  • Wake up
  • Ratelimit sees there is work to be done and calls fifo.go's Pop() function
  • Fifo.go acquires a fifo lock and call the processing function
  • Router.go's commitAndReload() function acquires a lock on the router object

Thread 2 (triggered by the event handler that commit's changes to the router):

  • Get the event and process it
  • Since there are changes to be made, call router.Commit()
  • Commit() grabs a lock on the router object
  • Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
  • In order to queue the work... it acquires a lock on the fifo

In summary: thread 1 locks fifo then router; thread 2 locks router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function. The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)

@knobunc
Copy link
Contributor Author

knobunc commented Apr 11, 2017

@openshift/networking PTAL

@knobunc
Copy link
Contributor Author

knobunc commented Apr 11, 2017

[test]


if r.stateChanged {
if needCommit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had time to review more I'd suggest simply refactoring rate limited commit function - it's not really well factored for the end goal. But I don't have time for that.

}
needCommit := false

func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why try so hard for defer? I would simply reorganize this with an inline lock. Defer makes it harder to reason about in this case.

@knobunc
Copy link
Contributor Author

knobunc commented Apr 11, 2017

@smarterclayton do you prefer it this way?

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move the 'stateChanged = false' statement into the commitAndReload function, under the r.lock that actually clears state changes.

Copy link
Contributor

@pecameron pecameron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@knobunc
Copy link
Contributor Author

knobunc commented Apr 12, 2017

@rajatchopra I didn't like the gap between the Commit() releasing the lock and the commitAndReload() picking up the work in another thread to do the work... but I guess that is safe because anything that called Commit() again before commitAndReload() will just change the state that the subsequent commitAndReload() will process. And correspondingly, since commitAndReload() is holding the router lock while it works, no other call to Commit() can cause something to be queued that would then be ignored. So, okay, I think I'm convinced that's safe.

r.lock.Unlock()

if needsCommit {
go r.rateLimitedCommitFunction.Invoke(r.rateLimitedCommitFunction)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to call it as a goroutine because this will block under two circumstances:

  1. If another thread has the lock pushing (called from another Invoke) which should be fast
  2. If another thread has the lock doing the processing of the function call... which could take a while

If the reload is being processed, there is no reason to block the event processing while we wait for that to occur, so we may as well fire off multiple Invokes() since there is no state associated with them and it doesn't matter if a bunch back up. They will all process quickly and replace one another as soon as the lock is released by the code doing the processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a downside to Invoke getting blocked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope... stupid idea. It blocks earlier because the long-running job holds the router lock. Pulling the change.

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Squash the commits maybe

@knobunc knobunc force-pushed the bug/bz1440977-router-deadlock-fix branch from cfee269 to 7dc887d Compare April 12, 2017 19:20
@knobunc
Copy link
Contributor Author

knobunc commented Apr 12, 2017

Squashed. [merge]

knobunc added a commit to knobunc/origin that referenced this pull request Apr 12, 2017
Backport of openshift#13717

The router reload function (Commit()) was changed so that it could be
rate limited.  The rate limiter tracks changes in a kcache.Queue
object.  It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
 Thread 1 (the rate-limiter background process):
   - Wake up
   - Ratelimit sees there is work to be done and calls fifo.go's Pop() function
   - Fifo.go acquires a fifo lock and call the processing function
   - Router.go's commitAndReload() function acquires a lock on the router object
 Thread 2 (triggered by the event handler that commit's changes to the router):
   - Get the event and process it
   - Since there are changes to be made, call router.Commit()
   - Commit() grabs a lock on the router object
   - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
   - In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function.  The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 13, 2017 via email

knobunc added a commit to knobunc/origin that referenced this pull request Apr 13, 2017
Backport of openshift#13717

The router reload function (Commit()) was changed so that it could be
rate limited.  The rate limiter tracks changes in a kcache.Queue
object.  It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
 Thread 1 (the rate-limiter background process):
   - Wake up
   - Ratelimit sees there is work to be done and calls fifo.go's Pop() function
   - Fifo.go acquires a fifo lock and call the processing function
   - Router.go's commitAndReload() function acquires a lock on the router object
 Thread 2 (triggered by the event handler that commit's changes to the router):
   - Get the event and process it
   - Since there are changes to be made, call router.Commit()
   - Commit() grabs a lock on the router object
   - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
   - In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function.  The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
The router reload function (Commit()) was changed so that it could be
rate limited.  The rate limiter tracks changes in a kcache.Queue
object.  It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
 Thread 1 (the rate-limiter background process):
   - Wake up
   - Ratelimit sees there is work to be done and calls fifo.go's Pop() function
   - Fifo.go acquires a fifo lock and call the processing function
   - Router.go's commitAndReload() function acquires a lock on the router object
 Thread 2 (triggered by the event handler that commit's changes to the router):
   - Get the event and process it
   - Since there are changes to be made, call router.Commit()
   - Commit() grabs a lock on the router object
   - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
   - In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function.  The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
@knobunc knobunc force-pushed the bug/bz1440977-router-deadlock-fix branch from 7dc887d to fad6679 Compare April 13, 2017 18:29
@knobunc
Copy link
Contributor Author

knobunc commented Apr 13, 2017

@smarterclayton yeah, fixed it. It was a problem with the testing needing too much information about the locking. All the more reason to replace the rate limiter soon.

@knobunc
Copy link
Contributor Author

knobunc commented Apr 14, 2017

[test] [merge] Last flaked on #8427 logs at https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/344/

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to fad6679

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to fad6679

@openshift-bot
Copy link
Contributor

openshift-bot commented Apr 14, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/347/) (Base Commit: 7044e57) (Image: devenv-rhel7_6143)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/747/) (Base Commit: 7044e57)

@openshift-bot openshift-bot merged commit bf9854f into openshift:master Apr 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants