-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RefreshWorker dequeue race condition #17187
Conversation
Checked commit tumido@73fd5ff with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Awesome. It's a small change, but it was tough to isolate the issue.
@miq-bot add_label gaprindashvili/yes |
Wow, it sounds like the drb prefetcher isn't handling Since the drb prefetcher is going away SOON with containers... I'm fine with making this use sql... thoughts @carbonin @gtanzillo |
@jrafanie yes, you do. The However, this is not a problem in a normal case scenario, when the job in question is simply updated once. It's just skipped for the time being and when the |
I think this is still an open question unfortunately. That said, I don't think refresh is what causes the majority of the contention on the queue that we are trying to avoid by prefetching. @Fryguy can check me on that, but I'm okay with this. Good find @tumido ! 💯 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this really is an amazing find. That's for documenting it so thoroughly and clearly @tumido.
I'm good with this change 👍
@carbonin correct, maximum amount of jobs for refresh workers is very small (max 1 in dequeue and 1 in ready state per manager), so pre-fetch is not having any real benefit |
YOLO MERGE. @tumido awesome detective and thank you for clearly documenting it!!! 🏆 ❤️ |
WOW... I'm concerned this might put pressure on the queue from multiple workers. While pre-fetch might not have any benefit, this may have now introduced a growing prefetch-queue on the server side that never gets reaped, because the caller never takes it from the server. Can we verify that is not that case? |
I'm ok with this as a band-aid, but I'm wondering if it's better to remove the lock_version from the prefetch, or perhaps from the actual query. The idea behind the lock_version being part of the prefetch was to avoid multiple workers trying to work on the same record. However, that can also be done by checking the state of the queue message before working on it. So, instead of the query including lock_version, we may be able to just include state instead. |
@Fryguy, how do you imagine such "growing prefetch-queue" to be happening? Because the prefetch in I don't see how this can affect other workers or the amount of jobs in the queue. Can you please explain? |
Fix RefreshWorker dequeue race condition (cherry picked from commit 4d6b447) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1560699
Gaprindashvili backport details:
|
A race condition is present on
RefreshWorker
when the worker is consolidated on related provider managers and one refresh tries to queue another refresh of all related managers.Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1559442
DRB dequeue
Default method for dequeue on any
QueueWorker
is:drb
. This method is using a specialDequeue
module. This is an optimization, which does a job prefetch (in batches) for a worker - it peeks on theMiqQueue
and finds a batch of jobs matching the workerqueue_name
andpriority
. That is done as one select query. Then, when a job from this batch is selected by the worker, the worker fetches whole record form database and proceeds to work on this job.This means the database load is lowered, because
MiqQueue
is scanned and filtered only once for a batch of jobs.SQL dequeue
On the other hand the
:sql
dequeue strategy filters theMiqQueue
scanning for a viable job and returns the first record available. This might be slower due to the need for scanning of the queue each time.Race condition
In
RefreshWorker
s we're trying to achieve a chained reaction on consolidated workers (for example here on Azure and Amazon). If a user requests a fullCloudManager
refresh, we want to schedule aNetworkManager
(or any other manager) refresh as well. This brings a new challenge and risk of job starvation when the:drb
strategy is used:RefreshWorker
, when started, schedules an initial refresh for all itsExtManagementSystem
s. In other words a new job is added toMiqQueue
for eachems
. This job belongs to the queue of theems
'sRefreshWorker
= all this jobs will be processed by this worker.:drb
strategy prefetches some amount of jobs for the worker and for each job it remembers:id
,:lock_version
,:priority
and:role
. When the worker requests a new job to work on, it is selected from this batch.CloudManager
initial refresh, because the provider addition strategy prefers cloud over network managers and naturally the record precedes by:id
in database, so the job for cloud manager is queued as first.MiqQueue.put_or_update()
call. And the call finds the already existent initial refresh jobs for these managers so instead of putting new jobs in the queue, it rather updates the targets of the refresh.MiqQueue
supports optimistic locking mechanism. The access and update done byput_or_update
results in incrementation of the:lock_version
counter.RefreshWorker
tries to pick a new job. The:drb
strategy has already prefetched the list of jobs which includes the initial refresh for the other managers as well. But these jobs have a different:lock_version
now, so they are skipped and next job is picked.CloudManager
refreshes in the time a different cloud refresh of the sameems
is being processed, it results in starvation for the jobs on other managers (which are queued by the cloud refresh), because suitable cloud refresh job is present every time.Proposed solution
The
RefreshWorker
is designed for heavy lifting on one provider and the access time to get a new job is a low priority in comparison of fast refresh resolutions. Using the:sql
dequeue strategy eliminates the race condition, because it searches theMiqQueue
each time - a fresh record is acquired every time and possibility of stalling is lowered.@miq-bot add_label bug, core/workers
cc @Ladas