-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set default playbook service timeout to 100 minutes #19279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the job_timeout
for ansible-runner
as well?
https://ansible-runner.readthedocs.io/en/latest/intro.html#runnersettings
Also, not sure where this error in the ticket is coming from.
@NickLaMuro That is a good point. Is there a value so the ansible job can run forever?
From this line. |
Oh butts, I was looking around for this line:
Still don't know where that is from, but what you provided is enough. Thanks! |
@lfu I think the quick fix is we can manually set the But I would have to check if that works as expected myself. Unsure, but I will look into it for the time being. |
@lfu @NickLaMuro so the proposal is to essentially disable the timeout in Runner and rely on Automate's timeout? Is this how things worked with Embedded Tower - we relied solely on Automate's timeout? |
Since the AnsibleRunnerWorkflow can be used outside of an automate context (e.g. for provider operations), I don't think we should use the automate timeout solely. I could see the automate specific caller setting it to None in order to allow automate's timeout to supercede (however, we still have to call .stop on the thing somehow), but I don't we should set the default of AnsibleRunnerWorkflow to None. |
Also, I don't understand how this PR turns off the Ansible runner timeout, since timeout parameter is still being passed to the job. |
c4f9573
to
57e8844
Compare
Are you suggesting that automate is the only one that uses this timeout, because I am pretty sure that services will make use of it too: Everything should be using
I don't think anyone is suggesting it is. |
Yeah, that I agree is a probably with using |
|
Does
|
@lfu I think the concern with this is that there is a default option for
So this is already set by default, unless specifically |
Yes. So, yes, if the service chooses to avoid the internal AnsiblePlaybookRuner's timeout by passing nil, then process_abort seems like a good place to do the stop, however,
So, given that, it feels rather complicated. |
What is the actual issue here? Is there a timeout race of some sort? If so, what happens when the service "wins". What is the downside to just using the ansible-runner built-in timeout set to the same value as the service timeout? |
I think it is simply that something in our system (either the "Automate timeout" or the "runner timeout") are causing the playbook that should take an hour or more to not complete because it reaches the timeout. I don't think the timeout conflict is happening yet, but I think we are concern about this solution not being enough if the underlying timeout in
I am personally fine with this, though it was one more "special snowflake" instance instead of having |
57e8844
to
64f841d
Compare
Automate has its own timeout which is set by In the case reported in the BZ, automate timeout is 100 minutes, service timeout is not set and playbook takes around 65 minutes to complete. So with old version, the service was able to finish while with new version, service failed due to the 60 minutes timeout from ansible runner. |
@@ -16,6 +16,12 @@ def launch_runner | |||
Ansible::Runner.run_async(env_vars, extra_vars, playbook_path, kwargs) | |||
end | |||
|
|||
def process_abort(*args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice! I didn't realize process_abort was part of the ansible_playbook_workflow...I thought it was something on the service side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lfu Can you move this into the base class? Role will have to do the exact same thing, so this should be in the base runner class.
So if the service timeout (execution_ttl) is not set, then it's inherting the default ansible_playbook_workflow timeout. It sounds to me like the caller should pass the automation timeout if the service timeout is not set, so as to avoid inherting that default. That is, somehow tweak this [1] to pass the automation timeout in addition, or change the method to use |
@NickLaMuro Thoughts on my proposal vs what's in this PR? The PR at the moment I think looks good, and I'd be ok with merging, but I'm not sure I like the general idea of AnsiblePlaybookWorkflow not "owning" it's own timeout. Feels like we could get in trouble if it's allowed to run forever and expecting something else to clean it up. I'm not sure....thoughts? |
This would be the perfect solution! But I doubt it is possible to get the automate timeout from a service or a configuration_script. That is why I have to set it to |
64f841d
to
fb65d9a
Compare
@Fryguy I think it is fine, I just don't know if the expectation from QA is that it is able to run longer than an hour: And I don't know if |
@NickLaMuro According to QE, the AWX / Tower solution in 5.10 supported greater than an hour runs based on https://bugzilla.redhat.com/show_bug.cgi?id=1750370#c3 if I read that correctly |
fb65d9a
to
29188fe
Compare
@@ -55,7 +55,7 @@ def execute | |||
def poll_runner | |||
response = Ansible::Runner::ResponseAsync.load(context[:ansible_runner_response]) | |||
if response.running? | |||
if started_on + options[:timeout] < Time.now.utc | |||
if options[:timeout].present? && (started_on + options[:timeout]) < Time.now.utc | |||
response.stop | |||
|
|||
queue_signal(:abort, "ansible #{execution_type} has been running longer than timeout", "error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lfu in the non-automate case where timeout is passed, then this will try to abort after doing response.stop
and then I think it will hit the process_abort method and try to response.stop again. If so, I think you need to remove the response.stop 2 lines up and just let it flwo through your new abort method. Can you verify?
EDIT: Also, I'm not sure this path even works...does queue_signal(:abort)
even work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure queue_signal(:abort)
works!
692d813
to
e475d8e
Compare
app/models/manageiq/providers/embedded_ansible/automation_manager/configuration_script.rb
Outdated
Show resolved
Hide resolved
e475d8e
to
861cc99
Compare
Checked commit lfu@861cc99 with ruby 2.4.6, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0 |
Set default playbook service timeout to 100 minutes (cherry picked from commit e1e730f) https://bugzilla.redhat.com/show_bug.cgi?id=1750370
Ivanchuk backport details:
|
Automate has its own timeout. This Ansible runner timeout does not play nice with automate timeout.
https://bugzilla.redhat.com/show_bug.cgi?id=1750370
@miq-bot assign @tinaafitz
@miq-bot add_label bug, Ivanchuk/yes, changelog/yes, blocker
cc @Fryguy @NickLaMuro