Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't queue things that need to run on the same worker container #19956

Merged
merged 5 commits into from
Mar 18, 2020

Conversation

carbonin
Copy link
Member

This commit changes every AnsibleRunnerWorkflow#queue_signal call
to a straight #signal call between checking out the repo and
cleaning it up.

Without this change it's possible for separate generic workers to
pick up individual states and only one of them will have the running
ansible_runner process to query

@carbonin
Copy link
Member Author

Also, @Fryguy why do we have a bonus embedded ansible label?
image

@carbonin
Copy link
Member Author

Now, I can make this conditional on being in pods or not, but I'd rather not.

So what is the concern with not queueing these things that would have been queued for the same server anyway? Is it just timeout? Each check to see if ansible running is still working used to have a separate queue timeout. Is there a way to get around that with this?

@carbonin
Copy link
Member Author

I don't think there's a way to queue something for the same worker, right?

@Fryguy
Copy link
Member

Fryguy commented Mar 12, 2020

The queue_signals were intentional particularly for the first one. After the first queueing, the queue_signal pins to the server, so it should be in the same location each time. Is this a podified concern?

@Fryguy
Copy link
Member

Fryguy commented Mar 12, 2020

I don't think there's a way to queue something for the same worker, right?

I don't believe so...right now it's by server, which is where the checkout is (technically any worker can pick it up anyway)

@carbonin
Copy link
Member Author

The queue_signals were intentional particularly for the first one.

@Fryguy I left the first one which is in #start

so it should be in the same location each time. Is this a podified concern?

Yup, it's still the same "server" but different containers

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what benefit there is to spreading this out to multiple queue items for a single piece of work like this, so I am good with this change.

That said, I also don't know what was the original rational for splitting it up either, so I think that renders by +1 as only slightly useful at best.

@carbonin carbonin changed the title Don't queue things that need to run on the same worker container [WIP] Don't queue things that need to run on the same worker container Mar 12, 2020
@miq-bot miq-bot added the wip label Mar 12, 2020
@Fryguy
Copy link
Member

Fryguy commented Mar 12, 2020

So, the way it works now (and it's admittedly strange), is that the start is signaled synchronously and the first thing it does is queue for the correct role in order to move the workload to the right location. After that, queue_signal will use the server_guid to pin all work to the server where the git clone as well as the ansible_runner artifacts are located. After the first time, the usage of queue_signal is really to break up potentially long running calls to avoid timeouts (assume a playbook could run for, say, a whole day). So the really important queue_signal is the one in poll_runner, and possibly the one near the git clone (since a clone could also take a while on a slow link)...the rest are arguably not important, I was mostly being overly cautious.


All that being said, there is clearly an issue with running embedded ansible in pods. Since "server" is a virtual concept in pods, it's not tied to the filesystem like it is in appliances, and unfortunately, by using server_guid we are making that a hard assumption. Note that this concept is not limited to ansible runner...it's anywhere we make a server <-> file system assumption, so for example, fleecing will have the same problem. I think we need to discuss and design a more robust solution for that. @carbonin and I discussed this offline and he's going to open an issue for it.

The PR as it is now will break the "long running playbooks on appliances" case, so for now, since a majority of our users are appliance users and not pods users, I'd err on the side of appliance users, and reject this PR.

I guess as a short term partial-fix, we can tweak this PR to be pod aware and be synchronous in pods, which allows "short-running playbooks in pods", but wouldn't allow "long-running playbooks in pods". It's a little better than "nothing at all in pods", especially since in reality I'd guess most playbooks would be considered short-running.

Copy link
Member

@Fryguy Fryguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin carbonin force-pushed the dont_queue_ansible_runner_stuff branch 2 times, most recently from 734ec1c to 4191d52 Compare March 16, 2020 19:04
Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least needs some questions answered, but I do actually like the direction you took with this. A question and a request for more documentation is all, the I am sure at least one of those might be contested.

@carbonin carbonin force-pushed the dont_queue_ansible_runner_stuff branch from 4191d52 to 76ac969 Compare March 16, 2020 21:13
@carbonin carbonin requested a review from gtanzillo as a code owner March 16, 2020 21:13
@Fryguy
Copy link
Member

Fryguy commented Mar 17, 2020

Should we move route_signal into the base Job class to live alongside queue_signal?

def signal(signal, *args)
signal = :abort_job if signal == :abort
if transit_state(signal)
save
send(signal, *args) if respond_to?(signal)
else
raise _("%{signal} is not permitted at state %{state}") % {:signal => signal, :state => state}
end
end
def queue_signal(*args, priority: MiqQueue::NORMAL_PRIORITY, role: nil, deliver_on: nil, server_guid: nil, queue_name: nil)
MiqQueue.put(
:class_name => self.class.name,
:method_name => "signal",
:instance_id => id,
:priority => priority,
:role => role,
:zone => zone,
:queue_name => queue_name,
:task_id => guid,
:args => args,
:deliver_on => deliver_on,
:server_guid => server_guid
)
end

@carbonin
Copy link
Member Author

Should we move route_signal into the base Job class to live alongside queue_signal?

Maybe? But I would rather do that as a follow-up when we have another use for it.
I assume we'll find another job that needs this kind of treatment, but for now I'd rather not treat the hack like it was a well thought out addition to the Job API 😆

@carbonin carbonin changed the title [WIP] Don't queue things that need to run on the same worker container Don't queue things that need to run on the same worker container Mar 17, 2020
@carbonin carbonin removed the wip label Mar 17, 2020
@Fryguy
Copy link
Member

Fryguy commented Mar 17, 2020

@carbonin Sounds good... cc @agrare ☝️ #19956 (comment)

@Fryguy
Copy link
Member

Fryguy commented Mar 17, 2020

Also, @Fryguy why do we have a bonus embedded ansible label?

We had a typo in the bot and the bot added it back in...I'll fix that up.

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things would look a bit cleaner if we did the following.

app/models/manageiq/providers/ansible_runner_workflow.rb Outdated Show resolved Hide resolved
This commit adds an intermediate method (#route_signal) to determine
if a call should be queued or not.

When running in containers, each generic worker is a separate container
so we can't queue anything between checking out the playbook
repository and cleaning it up. If we do, it might end up executing
on a container that doesn't have the repo checked out or isn't
running the ansible runner process.

We want to continue queueing these operations on appliances as the
previous reasoning doesn't apply (we will always queue for a worker
on the same server) and we still need to handle ansible playbooks
that might run longer than the timeout for a single queue message.

For now these long-running playbooks won't have a solution on pods,
but shorter ones will work.
For a sufficiently long-running job, we could exceed the stack
size threshold, so this commit implements a loop in poll_runner
that is only used when we're running in pods and waiting for the
ansible runner process to finish.
This also renames the local variable from response to monitor
becasue this is really the object that is responsible for checking
on the runner process.

Also `result = response.response` makes me cringe
This breaks up the #poll_runner method into smaller, more easily
comprehensible parts, and specifically only implements a loop
in the pods-specific method.
@carbonin carbonin force-pushed the dont_queue_ansible_runner_stuff branch from 0ad07b5 to d0135bb Compare March 18, 2020 18:16
@miq-bot
Copy link
Member

miq-bot commented Mar 18, 2020

Some comments on commits carbonin/manageiq@06b7465~...d0135bb

spec/models/manageiq/providers/ansible_role_workflow_spec.rb

  • ⚠️ - 135 - Detected expect_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.

@miq-bot
Copy link
Member

miq-bot commented Mar 18, 2020

Checked commits carbonin/manageiq@06b7465~...d0135bb with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
3 files checked, 1 offense detected

app/models/manageiq/providers/ansible_runner_workflow.rb

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes and sorry for being "that guy" with this review. Looks good!

@Fryguy Fryguy merged commit 3b0c1af into ManageIQ:master Mar 18, 2020
simaishi pushed a commit that referenced this pull request Mar 20, 2020
Don't queue things that need to run on the same worker container

(cherry picked from commit 3b0c1af)
@simaishi
Copy link
Contributor

Jansa backport details:

$ git log -1
commit ce6f0c2a297606190a367466bc59403001cd029f
Author: Jason Frey <[email protected]>
Date:   Wed Mar 18 16:00:08 2020 -0400

    Merge pull request #19956 from carbonin/dont_queue_ansible_runner_stuff

    Don't queue things that need to run on the same worker container

    (cherry picked from commit 3b0c1afe659408a2047aad85c1fc85e5280cf671)

@carbonin carbonin deleted the dont_queue_ansible_runner_stuff branch March 30, 2020 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants