Don't queue things that need to run on the same worker container #19956

carbonin · 2020-03-12T18:05:19Z

This commit changes every AnsibleRunnerWorkflow#queue_signal call
to a straight #signal call between checking out the repo and
cleaning it up.

Without this change it's possible for separate generic workers to
pick up individual states and only one of them will have the running
ansible_runner process to query

carbonin · 2020-03-12T18:05:47Z

Also, @Fryguy why do we have a bonus embedded ansible label?

carbonin · 2020-03-12T18:07:36Z

Now, I can make this conditional on being in pods or not, but I'd rather not.

So what is the concern with not queueing these things that would have been queued for the same server anyway? Is it just timeout? Each check to see if ansible running is still working used to have a separate queue timeout. Is there a way to get around that with this?

carbonin · 2020-03-12T18:08:27Z

I don't think there's a way to queue something for the same worker, right?

Fryguy · 2020-03-12T18:11:31Z

The queue_signals were intentional particularly for the first one. After the first queueing, the queue_signal pins to the server, so it should be in the same location each time. Is this a podified concern?

Fryguy · 2020-03-12T18:12:26Z

I don't think there's a way to queue something for the same worker, right?

I don't believe so...right now it's by server, which is where the checkout is (technically any worker can pick it up anyway)

carbonin · 2020-03-12T18:15:58Z

The queue_signals were intentional particularly for the first one.

@Fryguy I left the first one which is in #start

so it should be in the same location each time. Is this a podified concern?

Yup, it's still the same "server" but different containers

NickLaMuro

I don't see what benefit there is to spreading this out to multiple queue items for a single piece of work like this, so I am good with this change.

That said, I also don't know what was the original rational for splitting it up either, so I think that renders by +1 as only slightly useful at best.

Fryguy · 2020-03-12T20:11:39Z

So, the way it works now (and it's admittedly strange), is that the start is signaled synchronously and the first thing it does is queue for the correct role in order to move the workload to the right location. After that, queue_signal will use the server_guid to pin all work to the server where the git clone as well as the ansible_runner artifacts are located. After the first time, the usage of queue_signal is really to break up potentially long running calls to avoid timeouts (assume a playbook could run for, say, a whole day). So the really important queue_signal is the one in poll_runner, and possibly the one near the git clone (since a clone could also take a while on a slow link)...the rest are arguably not important, I was mostly being overly cautious.

All that being said, there is clearly an issue with running embedded ansible in pods. Since "server" is a virtual concept in pods, it's not tied to the filesystem like it is in appliances, and unfortunately, by using server_guid we are making that a hard assumption. Note that this concept is not limited to ansible runner...it's anywhere we make a server <-> file system assumption, so for example, fleecing will have the same problem. I think we need to discuss and design a more robust solution for that. @carbonin and I discussed this offline and he's going to open an issue for it.

The PR as it is now will break the "long running playbooks on appliances" case, so for now, since a majority of our users are appliance users and not pods users, I'd err on the side of appliance users, and reject this PR.

I guess as a short term partial-fix, we can tweak this PR to be pod aware and be synchronous in pods, which allows "short-running playbooks in pods", but wouldn't allow "long-running playbooks in pods". It's a little better than "nothing at all in pods", especially since in reality I'd guess most playbooks would be considered short-running.

Fryguy

☝️ #19956 (comment)

NickLaMuro

At least needs some questions answered, but I do actually like the direction you took with this. A question and a request for more documentation is all, the I am sure at least one of those might be contested.

app/models/manageiq/providers/ansible_runner_workflow.rb

Fryguy · 2020-03-17T18:24:49Z

Should we move route_signal into the base Job class to live alongside queue_signal?

manageiq/app/models/job/state_machine.rb

Lines 44 to 68 in 17d4194

    
           def signal(signal, *args) 
        
             signal = :abort_job if signal == :abort 
        
             if transit_state(signal) 
        
               save 
        
               send(signal, *args) if respond_to?(signal) 
        
             else 
        
               raise _("%{signal} is not permitted at state %{state}") % {:signal => signal, :state => state} 
        
             end 
        
           end 
        
           def queue_signal(*args, priority: MiqQueue::NORMAL_PRIORITY, role: nil, deliver_on: nil, server_guid: nil, queue_name: nil) 
        
             MiqQueue.put( 
        
               :class_name  => self.class.name, 
        
               :method_name => "signal", 
        
               :instance_id => id, 
        
               :priority    => priority, 
        
               :role        => role, 
        
               :zone        => zone, 
        
               :queue_name  => queue_name, 
        
               :task_id     => guid, 
        
               :args        => args, 
        
               :deliver_on  => deliver_on, 
        
               :server_guid => server_guid 
        
             ) 
        
           end

carbonin · 2020-03-17T18:51:21Z

Should we move route_signal into the base Job class to live alongside queue_signal?

Maybe? But I would rather do that as a follow-up when we have another use for it.
I assume we'll find another job that needs this kind of treatment, but for now I'd rather not treat the hack like it was a well thought out addition to the Job API 😆

Fryguy · 2020-03-17T18:58:36Z

@carbonin Sounds good... cc @agrare ☝️ #19956 (comment)

Fryguy · 2020-03-17T19:22:25Z

Also, @Fryguy why do we have a bonus embedded ansible label?

We had a typo in the bot and the bot added it back in...I'll fix that up.

NickLaMuro

I think things would look a bit cleaner if we did the following.

app/models/manageiq/providers/ansible_runner_workflow.rb

This commit adds an intermediate method (#route_signal) to determine if a call should be queued or not. When running in containers, each generic worker is a separate container so we can't queue anything between checking out the playbook repository and cleaning it up. If we do, it might end up executing on a container that doesn't have the repo checked out or isn't running the ansible runner process. We want to continue queueing these operations on appliances as the previous reasoning doesn't apply (we will always queue for a worker on the same server) and we still need to handle ansible playbooks that might run longer than the timeout for a single queue message. For now these long-running playbooks won't have a solution on pods, but shorter ones will work.

For a sufficiently long-running job, we could exceed the stack size threshold, so this commit implements a loop in poll_runner that is only used when we're running in pods and waiting for the ansible runner process to finish.

This also renames the local variable from response to monitor becasue this is really the object that is responsible for checking on the runner process. Also `result = response.response` makes me cringe

This breaks up the #poll_runner method into smaller, more easily comprehensible parts, and specifically only implements a loop in the pods-specific method.

miq-bot · 2020-03-18T18:20:25Z

Some comments on commits carbonin/manageiq@06b7465~...d0135bb

spec/models/manageiq/providers/ansible_role_workflow_spec.rb

⚠️ - 135 - Detected expect_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.

miq-bot · 2020-03-18T18:20:31Z

Checked commits carbonin/manageiq@06b7465~...d0135bb with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
3 files checked, 1 offense detected

app/models/manageiq/providers/ansible_runner_workflow.rb

❗ - Line 130, Col 7 - Layout/EmptyLineAfterGuardClause - Add empty line after guard clause.

NickLaMuro

Thanks for making the changes and sorry for being "that guy" with this review. Looks good!

Don't queue things that need to run on the same worker container (cherry picked from commit 3b0c1af)

simaishi · 2020-03-20T20:17:13Z

Jansa backport details:

$ git log -1
commit ce6f0c2a297606190a367466bc59403001cd029f
Author: Jason Frey <[email protected]>
Date:   Wed Mar 18 16:00:08 2020 -0400

    Merge pull request #19956 from carbonin/dont_queue_ansible_runner_stuff

    Don't queue things that need to run on the same worker container

    (cherry picked from commit 3b0c1afe659408a2047aad85c1fc85e5280cf671)

carbonin added bug core/embedded ansible labels Mar 12, 2020

carbonin requested a review from NickLaMuro March 12, 2020 18:05

carbonin requested a review from agrare as a code owner March 12, 2020 18:05

carbonin assigned Fryguy Mar 12, 2020

carbonin added the jansa/yes? label Mar 12, 2020

NickLaMuro approved these changes Mar 12, 2020

View reviewed changes

carbonin changed the title ~~Don't queue things that need to run on the same worker container~~ [WIP] Don't queue things that need to run on the same worker container Mar 12, 2020

miq-bot added the wip label Mar 12, 2020

Fryguy requested changes Mar 12, 2020

View reviewed changes

carbonin mentioned this pull request Mar 12, 2020

Assumptions about queue message for a server are broken in pods #19957

Open

carbonin force-pushed the dont_queue_ansible_runner_stuff branch 2 times, most recently from 734ec1c to 4191d52 Compare March 16, 2020 19:04

NickLaMuro suggested changes Mar 16, 2020

View reviewed changes

app/models/manageiq/providers/ansible_runner_workflow.rb Show resolved Hide resolved

app/models/manageiq/providers/ansible_runner_workflow.rb Outdated Show resolved Hide resolved

carbonin force-pushed the dont_queue_ansible_runner_stuff branch from 4191d52 to 76ac969 Compare March 16, 2020 21:13

carbonin requested a review from gtanzillo as a code owner March 16, 2020 21:13

carbonin changed the title ~~[WIP] Don't queue things that need to run on the same worker container~~ Don't queue things that need to run on the same worker container Mar 17, 2020

carbonin removed the wip label Mar 17, 2020

Fryguy approved these changes Mar 17, 2020

View reviewed changes

NickLaMuro suggested changes Mar 17, 2020

View reviewed changes

app/models/manageiq/providers/ansible_runner_workflow.rb Outdated Show resolved Hide resolved

carbonin added 5 commits March 18, 2020 14:14

Fix ansible playbook workflow specs

ae011bc

Don't recurse into poll_runner when in pods, loop instead

b82fc87

For a sufficiently long-running job, we could exceed the stack size threshold, so this commit implements a loop in poll_runner that is only used when we're running in pods and waiting for the ansible runner process to finish.

Create a method for loading the runner async result

4949950

This also renames the local variable from response to monitor becasue this is really the object that is responsible for checking on the runner process. Also `result = response.response` makes me cringe

Cleanup AnsibleRunnerWorkflow#poll_runner method

d0135bb

This breaks up the #poll_runner method into smaller, more easily comprehensible parts, and specifically only implements a loop in the pods-specific method.

carbonin force-pushed the dont_queue_ansible_runner_stuff branch from 0ad07b5 to d0135bb Compare March 18, 2020 18:16

NickLaMuro approved these changes Mar 18, 2020

View reviewed changes

Fryguy merged commit 3b0c1af into ManageIQ:master Mar 18, 2020

chessbyte added jansa/yes and removed jansa/yes? labels Mar 19, 2020

simaishi pushed a commit that referenced this pull request Mar 20, 2020

Merge pull request #19956 from carbonin/dont_queue_ansible_runner_stuff

ce6f0c2

Don't queue things that need to run on the same worker container (cherry picked from commit 3b0c1af)

simaishi added jansa/backported and removed jansa/yes labels Mar 20, 2020

carbonin deleted the dont_queue_ansible_runner_stuff branch March 30, 2020 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't queue things that need to run on the same worker container #19956

Don't queue things that need to run on the same worker container #19956

carbonin commented Mar 12, 2020

carbonin commented Mar 12, 2020

carbonin commented Mar 12, 2020

carbonin commented Mar 12, 2020

Fryguy commented Mar 12, 2020

Fryguy commented Mar 12, 2020

carbonin commented Mar 12, 2020

NickLaMuro left a comment

Fryguy commented Mar 12, 2020 •

edited

Loading

Fryguy left a comment

NickLaMuro left a comment

Fryguy commented Mar 17, 2020 •

edited

Loading

carbonin commented Mar 17, 2020

Fryguy commented Mar 17, 2020

Fryguy commented Mar 17, 2020

NickLaMuro left a comment

miq-bot commented Mar 18, 2020

miq-bot commented Mar 18, 2020

NickLaMuro left a comment

simaishi commented Mar 20, 2020

Don't queue things that need to run on the same worker container #19956

Don't queue things that need to run on the same worker container #19956

Conversation

carbonin commented Mar 12, 2020

carbonin commented Mar 12, 2020

carbonin commented Mar 12, 2020

carbonin commented Mar 12, 2020

Fryguy commented Mar 12, 2020

Fryguy commented Mar 12, 2020

carbonin commented Mar 12, 2020

NickLaMuro left a comment

Choose a reason for hiding this comment

Fryguy commented Mar 12, 2020 • edited Loading

Fryguy left a comment

Choose a reason for hiding this comment

NickLaMuro left a comment

Choose a reason for hiding this comment

Fryguy commented Mar 17, 2020 • edited Loading

carbonin commented Mar 17, 2020

Fryguy commented Mar 17, 2020

Fryguy commented Mar 17, 2020

NickLaMuro left a comment

Choose a reason for hiding this comment

miq-bot commented Mar 18, 2020

miq-bot commented Mar 18, 2020

NickLaMuro left a comment

Choose a reason for hiding this comment

simaishi commented Mar 20, 2020

Fryguy commented Mar 12, 2020 •

edited

Loading

Fryguy commented Mar 17, 2020 •

edited

Loading