-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent the router from deadlocking itself when calling Commit() #13717
Prevent the router from deadlocking itself when calling Commit() #13717
Conversation
@openshift/networking PTAL |
[test] |
pkg/router/template/router.go
Outdated
|
||
if r.stateChanged { | ||
if needCommit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I had time to review more I'd suggest simply refactoring rate limited commit function - it's not really well factored for the end goal. But I don't have time for that.
pkg/router/template/router.go
Outdated
} | ||
needCommit := false | ||
|
||
func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why try so hard for defer? I would simply reorganize this with an inline lock. Defer makes it harder to reason about in this case.
@smarterclayton do you prefer it this way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move the 'stateChanged = false' statement into the commitAndReload function, under the r.lock that actually clears state changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@rajatchopra I didn't like the gap between the Commit() releasing the lock and the commitAndReload() picking up the work in another thread to do the work... but I guess that is safe because anything that called Commit() again before commitAndReload() will just change the state that the subsequent commitAndReload() will process. And correspondingly, since commitAndReload() is holding the router lock while it works, no other call to Commit() can cause something to be queued that would then be ignored. So, okay, I think I'm convinced that's safe. |
pkg/router/template/router.go
Outdated
r.lock.Unlock() | ||
|
||
if needsCommit { | ||
go r.rateLimitedCommitFunction.Invoke(r.rateLimitedCommitFunction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to call it as a goroutine because this will block under two circumstances:
- If another thread has the lock pushing (called from another Invoke) which should be fast
- If another thread has the lock doing the processing of the function call... which could take a while
If the reload is being processed, there is no reason to block the event processing while we wait for that to occur, so we may as well fire off multiple Invokes() since there is no state associated with them and it doesn't matter if a bunch back up. They will all process quickly and replace one another as soon as the lock is released by the code doing the processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a downside to Invoke getting blocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope... stupid idea. It blocks earlier because the long-running job holds the router lock. Pulling the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Squash the commits maybe
cfee269
to
7dc887d
Compare
Squashed. [merge] |
Backport of openshift#13717 The router reload function (Commit()) was changed so that it could be rate limited. The rate limiter tracks changes in a kcache.Queue object. It will coalesce changes to the same call so that if three calls to Invoke() the rate limited function happen before the next time it is allowed to process, then only one will occur. Our problem was that we were doing: Thread 1 (the rate-limiter background process): - Wake up - Ratelimit sees there is work to be done and calls fifo.go's Pop() function - Fifo.go acquires a fifo lock and call the processing function - Router.go's commitAndReload() function acquires a lock on the router object Thread 2 (triggered by the event handler that commit's changes to the router): - Get the event and process it - Since there are changes to be made, call router.Commit() - Commit() grabs a lock on the router object - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work - In order to queue the work... it acquires a lock on the fifo So thread 1 locks: fifo then router; thread 2 locks: router then fifo. If you get unlucky, those threads deadlock and you never process another event. The fix is to release the lock on the router object in our Commit() function before we call Invoke on the rate limited function. The lock is not actually protecting anything at that point since the rate limited function does its own locking, and is run in a separate thread anyway. Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
Uh oh:
controller.go:308] Event: MODIFIED
I0413 00:06:51.524669 27327 status.go:245] admit: route already admitted
I0413 00:06:51.524721 27327 controller.go:305] Processing Route:
namespace-8wmj4/route-070n0 -> invalid-service
I0413 00:06:51.524729 27327 controller.go:306] Alias:
www-0-4.example.com
I0413 00:06:51.524732 27327 controller.go:307] Path:
I0413 00:06:51.524735 27327 controller.go:308] Event: MODIFIED
I0413 00:06:51.526902 27327 unique_host.go:138] Route
namespace-8wmj4/route-070n0 cannot take www-0-4.example.com from
namespace-8wmj4/route-p7qws
I0413 00:06:51.526919 27327 status.go:288] reject: no changes to
route needed: namespace-8wmj4/route-070n0
E0413 00:06:51.526927 27327 controller.go:311] route route-p7qws
already exposes www-0-4.example.com and is older--- FAIL:
TestRouterReloadSuppressionOnSync
(5.10s) router_without_haproxy_test.go:195: One or more routers
reloaded more than once
etcd.go:99: dumping etcd to
"/tmp/openshift/test-integration/etcd-dump-runtime.call32.json"
FAIL
|
Backport of openshift#13717 The router reload function (Commit()) was changed so that it could be rate limited. The rate limiter tracks changes in a kcache.Queue object. It will coalesce changes to the same call so that if three calls to Invoke() the rate limited function happen before the next time it is allowed to process, then only one will occur. Our problem was that we were doing: Thread 1 (the rate-limiter background process): - Wake up - Ratelimit sees there is work to be done and calls fifo.go's Pop() function - Fifo.go acquires a fifo lock and call the processing function - Router.go's commitAndReload() function acquires a lock on the router object Thread 2 (triggered by the event handler that commit's changes to the router): - Get the event and process it - Since there are changes to be made, call router.Commit() - Commit() grabs a lock on the router object - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work - In order to queue the work... it acquires a lock on the fifo So thread 1 locks: fifo then router; thread 2 locks: router then fifo. If you get unlucky, those threads deadlock and you never process another event. The fix is to release the lock on the router object in our Commit() function before we call Invoke on the rate limited function. The lock is not actually protecting anything at that point since the rate limited function does its own locking, and is run in a separate thread anyway. Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
The router reload function (Commit()) was changed so that it could be rate limited. The rate limiter tracks changes in a kcache.Queue object. It will coalesce changes to the same call so that if three calls to Invoke() the rate limited function happen before the next time it is allowed to process, then only one will occur. Our problem was that we were doing: Thread 1 (the rate-limiter background process): - Wake up - Ratelimit sees there is work to be done and calls fifo.go's Pop() function - Fifo.go acquires a fifo lock and call the processing function - Router.go's commitAndReload() function acquires a lock on the router object Thread 2 (triggered by the event handler that commit's changes to the router): - Get the event and process it - Since there are changes to be made, call router.Commit() - Commit() grabs a lock on the router object - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work - In order to queue the work... it acquires a lock on the fifo So thread 1 locks: fifo then router; thread 2 locks: router then fifo. If you get unlucky, those threads deadlock and you never process another event. The fix is to release the lock on the router object in our Commit() function before we call Invoke on the rate limited function. The lock is not actually protecting anything at that point since the rate limited function does its own locking, and is run in a separate thread anyway. Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
7dc887d
to
fad6679
Compare
@smarterclayton yeah, fixed it. It was a problem with the testing needing too much information about the locking. All the more reason to replace the rate limiter soon. |
[test] [merge] Last flaked on #8427 logs at https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/344/ |
Evaluated for origin test up to fad6679 |
Evaluated for origin merge up to fad6679 |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/347/) (Base Commit: 7044e57) (Image: devenv-rhel7_6143) |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/747/) (Base Commit: 7044e57) |
The router reload function (Commit()) was changed so that it could be
rate limited. The rate limiter tracks changes in a kcache.Queue
object. It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.
Our problem was that we were doing:
Thread 1 (the rate-limiter background process):
Thread 2 (triggered by the event handler that commit's changes to the router):
In summary: thread 1 locks fifo then router; thread 2 locks router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.
The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function. The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.
Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)