-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't queue things that need to run on the same worker container #19956
Don't queue things that need to run on the same worker container #19956
Conversation
Also, @Fryguy why do we have a bonus embedded ansible label? |
Now, I can make this conditional on being in pods or not, but I'd rather not. So what is the concern with not queueing these things that would have been queued for the same server anyway? Is it just timeout? Each check to see if ansible running is still working used to have a separate queue timeout. Is there a way to get around that with this? |
I don't think there's a way to queue something for the same worker, right? |
The |
I don't believe so...right now it's by server, which is where the checkout is (technically any worker can pick it up anyway) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what benefit there is to spreading this out to multiple queue items for a single piece of work like this, so I am good with this change.
That said, I also don't know what was the original rational for splitting it up either, so I think that renders by +1
as only slightly useful at best.
So, the way it works now (and it's admittedly strange), is that the start is All that being said, there is clearly an issue with running embedded ansible in pods. Since "server" is a virtual concept in pods, it's not tied to the filesystem like it is in appliances, and unfortunately, by using server_guid we are making that a hard assumption. Note that this concept is not limited to ansible runner...it's anywhere we make a server <-> file system assumption, so for example, fleecing will have the same problem. I think we need to discuss and design a more robust solution for that. @carbonin and I discussed this offline and he's going to open an issue for it. The PR as it is now will break the "long running playbooks on appliances" case, so for now, since a majority of our users are appliance users and not pods users, I'd err on the side of appliance users, and reject this PR. I guess as a short term partial-fix, we can tweak this PR to be pod aware and be synchronous in pods, which allows "short-running playbooks in pods", but wouldn't allow "long-running playbooks in pods". It's a little better than "nothing at all in pods", especially since in reality I'd guess most playbooks would be considered short-running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
734ec1c
to
4191d52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least needs some questions answered, but I do actually like the direction you took with this. A question and a request for more documentation is all, the I am sure at least one of those might be contested.
4191d52
to
76ac969
Compare
Should we move route_signal into the base Job class to live alongside queue_signal? manageiq/app/models/job/state_machine.rb Lines 44 to 68 in 17d4194
|
Maybe? But I would rather do that as a follow-up when we have another use for it. |
@carbonin Sounds good... cc @agrare ☝️ #19956 (comment) |
We had a typo in the bot and the bot added it back in...I'll fix that up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think things would look a bit cleaner if we did the following.
This commit adds an intermediate method (#route_signal) to determine if a call should be queued or not. When running in containers, each generic worker is a separate container so we can't queue anything between checking out the playbook repository and cleaning it up. If we do, it might end up executing on a container that doesn't have the repo checked out or isn't running the ansible runner process. We want to continue queueing these operations on appliances as the previous reasoning doesn't apply (we will always queue for a worker on the same server) and we still need to handle ansible playbooks that might run longer than the timeout for a single queue message. For now these long-running playbooks won't have a solution on pods, but shorter ones will work.
For a sufficiently long-running job, we could exceed the stack size threshold, so this commit implements a loop in poll_runner that is only used when we're running in pods and waiting for the ansible runner process to finish.
This also renames the local variable from response to monitor becasue this is really the object that is responsible for checking on the runner process. Also `result = response.response` makes me cringe
This breaks up the #poll_runner method into smaller, more easily comprehensible parts, and specifically only implements a loop in the pods-specific method.
0ad07b5
to
d0135bb
Compare
Some comments on commits carbonin/manageiq@06b7465~...d0135bb spec/models/manageiq/providers/ansible_role_workflow_spec.rb
|
Checked commits carbonin/manageiq@06b7465~...d0135bb with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint app/models/manageiq/providers/ansible_runner_workflow.rb
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes and sorry for being "that guy" with this review. Looks good!
Don't queue things that need to run on the same worker container (cherry picked from commit 3b0c1af)
Jansa backport details:
|
This commit changes every
AnsibleRunnerWorkflow#queue_signal
callto a straight
#signal
call between checking out the repo andcleaning it up.
Without this change it's possible for separate generic workers to
pick up individual states and only one of them will have the running
ansible_runner process to query