-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Back-port: Prevent the router from deadlocking itself when calling Commit() #13744
Back-port: Prevent the router from deadlocking itself when calling Commit() #13744
Conversation
[test] |
LGTM [merge] |
continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 3 |
@stevekuznetsov is this also flaking in master?
On Apr 13, 2017, at 5:37 AM, OpenShift Bot <[email protected]> wrote:
continuous-integration/openshift-jenkins/merge FAILURE (
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/333/)
(Base Commit: 0b3bc05
<0b3bc05>
)
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#13744 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_px2S0kgAissUhwlIjiyEIbOJrCK7ks5rvezTgaJpZM4M78aJ>
.
|
I wonder if this actually regresses a previous fix for duplicate resync (possibly a bug in rate limited function). |
You mean the |
Backport of openshift#13717 The router reload function (Commit()) was changed so that it could be rate limited. The rate limiter tracks changes in a kcache.Queue object. It will coalesce changes to the same call so that if three calls to Invoke() the rate limited function happen before the next time it is allowed to process, then only one will occur. Our problem was that we were doing: Thread 1 (the rate-limiter background process): - Wake up - Ratelimit sees there is work to be done and calls fifo.go's Pop() function - Fifo.go acquires a fifo lock and call the processing function - Router.go's commitAndReload() function acquires a lock on the router object Thread 2 (triggered by the event handler that commit's changes to the router): - Get the event and process it - Since there are changes to be made, call router.Commit() - Commit() grabs a lock on the router object - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work - In order to queue the work... it acquires a lock on the fifo So thread 1 locks: fifo then router; thread 2 locks: router then fifo. If you get unlucky, those threads deadlock and you never process another event. The fix is to release the lock on the router object in our Commit() function before we call Invoke on the rate limited function. The lock is not actually protecting anything at that point since the rate limited function does its own locking, and is run in a separate thread anyway. Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
9f0e757
to
92ede84
Compare
Evaluated for origin merge up to 92ede84 |
Evaluated for origin test up to 92ede84 |
The problem was that I needed to update the background commit function in the test. I've done that now. I'm pretty sure the double reload problem is not with the rate-limiter. I put the rate limiter code through its paces to make sure it worked correctly: https://gist.github.com/knobunc/c03ec0c70ec23f79129b8d7f1584aaa1 |
continuous-integration/openshift-jenkins/test UNSTABLE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/738/) (Base Commit: 0b3bc05) |
The UNSTABLE is due to |
Backported in #13764 |
Backport of #13717
The router reload function (Commit()) was changed so that it could be
rate limited. The rate limiter tracks changes in a kcache.Queue
object. It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.
Our problem was that we were doing:
Thread 1 (the rate-limiter background process):
Thread 2 (triggered by the event handler that commit's changes to the router):
So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.
The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function. The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.
Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)