Fix RefreshWorker dequeue race condition #17187

tumido · 2018-03-22T14:37:28Z

A race condition is present on RefreshWorker when the worker is consolidated on related provider managers and one refresh tries to queue another refresh of all related managers.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1559442

DRB dequeue

Default method for dequeue on any QueueWorker is :drb. This method is using a special Dequeue module. This is an optimization, which does a job prefetch (in batches) for a worker - it peeks on the MiqQueue and finds a batch of jobs matching the worker queue_name and priority. That is done as one select query. Then, when a job from this batch is selected by the worker, the worker fetches whole record form database and proceeds to work on this job.

This means the database load is lowered, because MiqQueue is scanned and filtered only once for a batch of jobs.

SQL dequeue

On the other hand the :sql dequeue strategy filters the MiqQueue scanning for a viable job and returns the first record available. This might be slower due to the need for scanning of the queue each time.

Race condition

In RefreshWorkers we're trying to achieve a chained reaction on consolidated workers (for example here on Azure and Amazon). If a user requests a full CloudManager refresh, we want to schedule a NetworkManager (or any other manager) refresh as well. This brings a new challenge and risk of job starvation when the :drb strategy is used:

Each RefreshWorker, when started, schedules an initial refresh for all its ExtManagementSystems. In other words a new job is added to MiqQueue for each ems. This job belongs to the queue of the ems's RefreshWorker = all this jobs will be processed by this worker.
The :drb strategy prefetches some amount of jobs for the worker and for each job it remembers :id, :lock_version, :priority and :role. When the worker requests a new job to work on, it is selected from this batch.
The first job picked is the CloudManager initial refresh, because the provider addition strategy prefers cloud over network managers and naturally the record precedes by :id in database, so the job for cloud manager is queued as first.
When this job is done, it automatically tries to queue the job for other managers (as described above). This addition to queue is done via MiqQueue.put_or_update() call. And the call finds the already existent initial refresh jobs for these managers so instead of putting new jobs in the queue, it rather updates the targets of the refresh.
MiqQueue supports optimistic locking mechanism. The access and update done by put_or_update results in incrementation of the :lock_version counter.
The RefreshWorker tries to pick a new job. The :drb strategy has already prefetched the list of jobs which includes the initial refresh for the other managers as well. But these jobs have a different :lock_version now, so they are skipped and next job is picked.
If user starts scheduling new CloudManager refreshes in the time a different cloud refresh of the same ems is being processed, it results in starvation for the jobs on other managers (which are queued by the cloud refresh), because suitable cloud refresh job is present every time.
The other manager can be refreshed if and only if queues processed by the worker are free of any other cloud refreshes.

Proposed solution

The RefreshWorker is designed for heavy lifting on one provider and the access time to get a new job is a low priority in comparison of fast refresh resolutions. Using the :sql dequeue strategy eliminates the race condition, because it searches the MiqQueue each time - a fresh record is acquired every time and possibility of stalling is lowered.

@miq-bot add_label bug, core/workers
cc @Ladas

miq-bot · 2018-03-22T14:45:58Z

Checked commit tumido@73fd5ff with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
0 files checked, 0 offenses detected
Everything looks fine. 👍

Ladas

👍 Awesome. It's a small change, but it was tough to isolate the issue.

Ladas · 2018-03-22T14:48:02Z

cc @agrare @Fryguy

Ladas · 2018-03-22T14:48:22Z

@miq-bot assign @agrare

tumido · 2018-03-22T15:00:51Z

@miq-bot add_label gaprindashvili/yes

agrare · 2018-03-22T18:43:47Z

I'm not familiar enough with why we dequeue via drb vs sql, assigning to @jrafanie
Either way really great find @tumido !

jrafanie · 2018-03-22T21:49:11Z

Wow, it sounds like the drb prefetcher isn't handling put_or_update changes to already prefetched rows properly in general. Am I understanding that right? @tumido

Since the drb prefetcher is going away SOON with containers... I'm fine with making this use sql... thoughts @carbonin @gtanzillo

tumido · 2018-03-22T22:58:22Z

@jrafanie yes, you do. The :drb strategy gets the job's lock_version here. But at the time, the job is selected, this record may have already been updated. The verification if the lock matches is done here. The time between these two lock_version checks can span across multiple jobs and can be even hours (in case of legacy refresh, etc.).

However, this is not a problem in a normal case scenario, when the job in question is simply updated once. It's just skipped for the time being and when the :drb prefetches a new batch of jobs, it can be processed normally. The problem happens if such job is constantly being updated, so the user must be really lucky that the update is fitted into a very specific place and the job can processed... Otherwise it is hanging there forever.

carbonin · 2018-03-23T12:59:54Z

Since the drb prefetcher is going away SOON with containers

I think this is still an open question unfortunately.
We clearly wont be able to use drb, but we may still need some prefetching if we don't move the majority of the queue churn to artemis.

That said, I don't think refresh is what causes the majority of the contention on the queue that we are trying to avoid by prefetching. @Fryguy can check me on that, but I'm okay with this.

Good find @tumido ! 💯

gtanzillo

Wow, this really is an amazing find. That's for documenting it so thoroughly and clearly @tumido.

I'm good with this change 👍

Ladas · 2018-03-23T15:46:25Z

@carbonin correct, maximum amount of jobs for refresh workers is very small (max 1 in dequeue and 1 in ready state per manager), so pre-fetch is not having any real benefit

jrafanie · 2018-03-23T15:57:21Z

YOLO MERGE. @tumido awesome detective and thank you for clearly documenting it!!! 🏆 ❤️

Fryguy · 2018-03-23T17:29:30Z

WOW... I'm concerned this might put pressure on the queue from multiple workers. While pre-fetch might not have any benefit, this may have now introduced a growing prefetch-queue on the server side that never gets reaped, because the caller never takes it from the server. Can we verify that is not that case?

Fryguy · 2018-03-23T17:32:44Z

I'm ok with this as a band-aid, but I'm wondering if it's better to remove the lock_version from the prefetch, or perhaps from the actual query. The idea behind the lock_version being part of the prefetch was to avoid multiple workers trying to work on the same record. However, that can also be done by checking the state of the queue message before working on it. So, instead of the query including lock_version, we may be able to just include state instead.

tumido · 2018-03-23T18:24:31Z

@Fryguy, how do you imagine such "growing prefetch-queue" to be happening? Because the prefetch in :drb doesn't remove the jobs from queue, that means the amount of jobs there remains all the same. The only load increase is on each RefreshWorker which would do an MiqQueue.get each time, instead of MiqQueue,peek once in a while and then MiqQueue.find for each job.

I don't see how this can affect other workers or the amount of jobs in the queue. Can you please explain?

Fix RefreshWorker dequeue race condition (cherry picked from commit 4d6b447) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1560699

simaishi · 2018-03-26T19:15:35Z

Gaprindashvili backport details:

$ git log -1
commit 94e2618059218ef6ffb76a11b7fe07ebd8411ef2
Author: Joe Rafaniello <[email protected]>
Date:   Fri Mar 23 11:57:25 2018 -0400

    Merge pull request #17187 from tumido/fix_refresh_race_cond
    
    Fix RefreshWorker dequeue race condition
    (cherry picked from commit 4d6b447f3520fe0ffc5b9cfc6c874c006acaec72)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1560699

Fix RefreshWorker dequeue race condition

73fd5ff

miq-bot added bug core/workers labels Mar 22, 2018

Ladas approved these changes Mar 22, 2018

View reviewed changes

miq-bot assigned agrare Mar 22, 2018

miq-bot added the gaprindashvili/yes label Mar 22, 2018

blomquisg requested review from jrafanie and Fryguy March 22, 2018 15:12

agrare assigned jrafanie and unassigned agrare Mar 22, 2018

gtanzillo approved these changes Mar 23, 2018

View reviewed changes

jrafanie merged commit 4d6b447 into ManageIQ:master Mar 23, 2018

jrafanie added this to the Sprint 82 Ending Mar 26, 2018 milestone Mar 23, 2018

simaishi pushed a commit that referenced this pull request Mar 26, 2018

Merge pull request #17187 from tumido/fix_refresh_race_cond

94e2618

Fix RefreshWorker dequeue race condition (cherry picked from commit 4d6b447) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1560699

simaishi added gaprindashvili/backported and removed gaprindashvili/yes labels Mar 26, 2018

tumido deleted the fix_refresh_race_cond branch June 26, 2018 14:17

agrare mentioned this pull request Jun 22, 2020

Simplify refresh of child managers for collapsed worker/manager providers #20301

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RefreshWorker dequeue race condition #17187

Fix RefreshWorker dequeue race condition #17187

tumido commented Mar 22, 2018 •

edited

Loading

miq-bot commented Mar 22, 2018

Ladas left a comment

Ladas commented Mar 22, 2018

Ladas commented Mar 22, 2018

tumido commented Mar 22, 2018

agrare commented Mar 22, 2018

jrafanie commented Mar 22, 2018

tumido commented Mar 22, 2018

carbonin commented Mar 23, 2018

gtanzillo left a comment

Ladas commented Mar 23, 2018

jrafanie commented Mar 23, 2018

Fryguy commented Mar 23, 2018

Fryguy commented Mar 23, 2018

tumido commented Mar 23, 2018 •

edited

Loading

simaishi commented Mar 26, 2018

Fix RefreshWorker dequeue race condition #17187

Fix RefreshWorker dequeue race condition #17187

Conversation

tumido commented Mar 22, 2018 • edited Loading

DRB dequeue

SQL dequeue

Race condition

Proposed solution

miq-bot commented Mar 22, 2018

Ladas left a comment

Choose a reason for hiding this comment

Ladas commented Mar 22, 2018

Ladas commented Mar 22, 2018

tumido commented Mar 22, 2018

agrare commented Mar 22, 2018

jrafanie commented Mar 22, 2018

tumido commented Mar 22, 2018

carbonin commented Mar 23, 2018

gtanzillo left a comment

Choose a reason for hiding this comment

Ladas commented Mar 23, 2018

jrafanie commented Mar 23, 2018

Fryguy commented Mar 23, 2018

Fryguy commented Mar 23, 2018

tumido commented Mar 23, 2018 • edited Loading

simaishi commented Mar 26, 2018

tumido commented Mar 22, 2018 •

edited

Loading

tumido commented Mar 23, 2018 •

edited

Loading