Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RefreshWorker dequeue race condition #17187

Merged
merged 1 commit into from
Mar 23, 2018

Conversation

tumido
Copy link
Member

@tumido tumido commented Mar 22, 2018

A race condition is present on RefreshWorker when the worker is consolidated on related provider managers and one refresh tries to queue another refresh of all related managers.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1559442

DRB dequeue

Default method for dequeue on any QueueWorker is :drb. This method is using a special Dequeue module. This is an optimization, which does a job prefetch (in batches) for a worker - it peeks on the MiqQueue and finds a batch of jobs matching the worker queue_name and priority. That is done as one select query. Then, when a job from this batch is selected by the worker, the worker fetches whole record form database and proceeds to work on this job.

This means the database load is lowered, because MiqQueue is scanned and filtered only once for a batch of jobs.

SQL dequeue

On the other hand the :sql dequeue strategy filters the MiqQueue scanning for a viable job and returns the first record available. This might be slower due to the need for scanning of the queue each time.

Race condition

In RefreshWorkers we're trying to achieve a chained reaction on consolidated workers (for example here on Azure and Amazon). If a user requests a full CloudManager refresh, we want to schedule a NetworkManager (or any other manager) refresh as well. This brings a new challenge and risk of job starvation when the :drb strategy is used:

  1. Each RefreshWorker, when started, schedules an initial refresh for all its ExtManagementSystems. In other words a new job is added to MiqQueue for each ems. This job belongs to the queue of the ems's RefreshWorker = all this jobs will be processed by this worker.
  2. The :drb strategy prefetches some amount of jobs for the worker and for each job it remembers :id, :lock_version, :priority and :role. When the worker requests a new job to work on, it is selected from this batch.
  3. The first job picked is the CloudManager initial refresh, because the provider addition strategy prefers cloud over network managers and naturally the record precedes by :id in database, so the job for cloud manager is queued as first.
  4. When this job is done, it automatically tries to queue the job for other managers (as described above). This addition to queue is done via MiqQueue.put_or_update() call. And the call finds the already existent initial refresh jobs for these managers so instead of putting new jobs in the queue, it rather updates the targets of the refresh.
  5. MiqQueue supports optimistic locking mechanism. The access and update done by put_or_update results in incrementation of the :lock_version counter.
  6. The RefreshWorker tries to pick a new job. The :drb strategy has already prefetched the list of jobs which includes the initial refresh for the other managers as well. But these jobs have a different :lock_version now, so they are skipped and next job is picked.
  7. If user starts scheduling new CloudManager refreshes in the time a different cloud refresh of the same ems is being processed, it results in starvation for the jobs on other managers (which are queued by the cloud refresh), because suitable cloud refresh job is present every time.
  8. The other manager can be refreshed if and only if queues processed by the worker are free of any other cloud refreshes.

Proposed solution

The RefreshWorker is designed for heavy lifting on one provider and the access time to get a new job is a low priority in comparison of fast refresh resolutions. Using the :sql dequeue strategy eliminates the race condition, because it searches the MiqQueue each time - a fresh record is acquired every time and possibility of stalling is lowered.

@miq-bot add_label bug, core/workers
cc @Ladas

@miq-bot
Copy link
Member

miq-bot commented Mar 22, 2018

Checked commit tumido@73fd5ff with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
0 files checked, 0 offenses detected
Everything looks fine. 👍

Copy link
Contributor

@Ladas Ladas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Awesome. It's a small change, but it was tough to isolate the issue.

@Ladas
Copy link
Contributor

Ladas commented Mar 22, 2018

cc @agrare @Fryguy

@Ladas
Copy link
Contributor

Ladas commented Mar 22, 2018

@miq-bot assign @agrare

@tumido
Copy link
Member Author

tumido commented Mar 22, 2018

@miq-bot add_label gaprindashvili/yes

@agrare
Copy link
Member

agrare commented Mar 22, 2018

I'm not familiar enough with why we dequeue via drb vs sql, assigning to @jrafanie
Either way really great find @tumido !

@agrare agrare assigned jrafanie and unassigned agrare Mar 22, 2018
@jrafanie
Copy link
Member

Wow, it sounds like the drb prefetcher isn't handling put_or_update changes to already prefetched rows properly in general. Am I understanding that right? @tumido

Since the drb prefetcher is going away SOON with containers... I'm fine with making this use sql... thoughts @carbonin @gtanzillo

@tumido
Copy link
Member Author

tumido commented Mar 22, 2018

@jrafanie yes, you do. The :drb strategy gets the job's lock_version here. But at the time, the job is selected, this record may have already been updated. The verification if the lock matches is done here. The time between these two lock_version checks can span across multiple jobs and can be even hours (in case of legacy refresh, etc.).

However, this is not a problem in a normal case scenario, when the job in question is simply updated once. It's just skipped for the time being and when the :drb prefetches a new batch of jobs, it can be processed normally. The problem happens if such job is constantly being updated, so the user must be really lucky that the update is fitted into a very specific place and the job can processed... Otherwise it is hanging there forever.

@carbonin
Copy link
Member

Since the drb prefetcher is going away SOON with containers

I think this is still an open question unfortunately.
We clearly wont be able to use drb, but we may still need some prefetching if we don't move the majority of the queue churn to artemis.

That said, I don't think refresh is what causes the majority of the contention on the queue that we are trying to avoid by prefetching. @Fryguy can check me on that, but I'm okay with this.

Good find @tumido ! 💯

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this really is an amazing find. That's for documenting it so thoroughly and clearly @tumido.

I'm good with this change 👍

@Ladas
Copy link
Contributor

Ladas commented Mar 23, 2018

@carbonin correct, maximum amount of jobs for refresh workers is very small (max 1 in dequeue and 1 in ready state per manager), so pre-fetch is not having any real benefit

@jrafanie
Copy link
Member

YOLO MERGE. @tumido awesome detective and thank you for clearly documenting it!!! 🏆 ❤️ :shipit:

@jrafanie jrafanie merged commit 4d6b447 into ManageIQ:master Mar 23, 2018
@jrafanie jrafanie added this to the Sprint 82 Ending Mar 26, 2018 milestone Mar 23, 2018
@Fryguy
Copy link
Member

Fryguy commented Mar 23, 2018

WOW... I'm concerned this might put pressure on the queue from multiple workers. While pre-fetch might not have any benefit, this may have now introduced a growing prefetch-queue on the server side that never gets reaped, because the caller never takes it from the server. Can we verify that is not that case?

@Fryguy
Copy link
Member

Fryguy commented Mar 23, 2018

I'm ok with this as a band-aid, but I'm wondering if it's better to remove the lock_version from the prefetch, or perhaps from the actual query. The idea behind the lock_version being part of the prefetch was to avoid multiple workers trying to work on the same record. However, that can also be done by checking the state of the queue message before working on it. So, instead of the query including lock_version, we may be able to just include state instead.

@tumido
Copy link
Member Author

tumido commented Mar 23, 2018

@Fryguy, how do you imagine such "growing prefetch-queue" to be happening? Because the prefetch in :drb doesn't remove the jobs from queue, that means the amount of jobs there remains all the same. The only load increase is on each RefreshWorker which would do an MiqQueue.get each time, instead of MiqQueue,peek once in a while and then MiqQueue.find for each job.

I don't see how this can affect other workers or the amount of jobs in the queue. Can you please explain?

simaishi pushed a commit that referenced this pull request Mar 26, 2018
Fix RefreshWorker dequeue race condition
(cherry picked from commit 4d6b447)

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1560699
@simaishi
Copy link
Contributor

Gaprindashvili backport details:

$ git log -1
commit 94e2618059218ef6ffb76a11b7fe07ebd8411ef2
Author: Joe Rafaniello <[email protected]>
Date:   Fri Mar 23 11:57:25 2018 -0400

    Merge pull request #17187 from tumido/fix_refresh_race_cond
    
    Fix RefreshWorker dequeue race condition
    (cherry picked from commit 4d6b447f3520fe0ffc5b9cfc6c874c006acaec72)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1560699

@tumido tumido deleted the fix_refresh_race_cond branch June 26, 2018 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants