Set default playbook service timeout to 100 minutes #19279

lfu · 2019-09-10T14:04:11Z

Automate has its own timeout. This Ansible runner timeout does not play nice with automate timeout.

https://bugzilla.redhat.com/show_bug.cgi?id=1750370

@miq-bot assign @tinaafitz
@miq-bot add_label bug, Ivanchuk/yes, changelog/yes, blocker

cc @Fryguy @NickLaMuro

NickLaMuro

Do we need to update the job_timeout for ansible-runner as well?

https://ansible-runner.readthedocs.io/en/latest/intro.html#runnersettings

Also, not sure where this error in the ticket is coming from.

lfu · 2019-09-10T15:11:49Z

Do we need to update the job_timeout for ansible-runner as well?

@NickLaMuro That is a good point. Is there a value so the ansible job can run forever?
Ideally job_timeout should be the same as automate timeout. But I'm not sure where to get this automate timeout from.
@tinaafitz @gmcculloug Thoughts?

Also, not sure where this error in the ticket is coming from.

From this line.

NickLaMuro · 2019-09-10T15:15:19Z

Also, not sure where this error in the ticket is coming from.

From this line.

Oh butts, I was looking around for this line:

EVM.log -> "restarting worker after [3600] job aborting ..."

Still don't know where that is from, but what you provided is enough. Thanks!

NickLaMuro · 2019-09-10T15:18:19Z

@NickLaMuro That is a good point. Is there a value so the ansible job can run forever?

@lfu I think the quick fix is we can manually set the job_timeout to None:

https://github.com/ansible/ansible-runner/blob/f6f6a24538817a84d09feeac85a747b5b9238c0e/ansible_runner/runner.py#L193

But I would have to check if that works as expected myself. Unsure, but I will look into it for the time being.

dmetzger57 · 2019-09-10T15:39:09Z

@lfu @NickLaMuro so the proposal is to essentially disable the timeout in Runner and rely on Automate's timeout? Is this how things worked with Embedded Tower - we relied solely on Automate's timeout?

Fryguy · 2019-09-10T15:42:55Z

Since the AnsibleRunnerWorkflow can be used outside of an automate context (e.g. for provider operations), I don't think we should use the automate timeout solely. I could see the automate specific caller setting it to None in order to allow automate's timeout to supercede (however, we still have to call .stop on the thing somehow), but I don't we should set the default of AnsibleRunnerWorkflow to None.

Fryguy · 2019-09-10T15:43:48Z

Also, I don't understand how this PR turns off the Ansible runner timeout, since timeout parameter is still being passed to the job.

NickLaMuro · 2019-09-10T15:54:50Z

Since the AnsibleRunnerWorkflow can be used outside of an automate context (e.g. for provider operations), I don't think we should use the automate timeout solely...

Are you suggesting that automate is the only one that uses this timeout, because I am pretty sure that services will make use of it too:

https://github.com/ManageIQ/manageiq/blob/575def3be/app/models/manageiq/providers/ansible_runner_workflow.rb#L51

Everything should be using AnsiblePlaybookWorkflow (inherits from the above) to handle running jobs, so I don't see how this wouldn't be used by all of them.

Also, I don't understand how this PR turns off the Ansible runner timeout, since timeout parameter is still being passed to the job.

I don't think anyone is suggesting it is.

NickLaMuro · 2019-09-10T16:00:17Z

however, we still have to call .stop on the thing somehow

Yeah, that I agree is a probably with using job_timeout: None, we would be probably risking that, so it would be a poor choice long term to go with that option.

lfu · 2019-09-10T16:09:42Z

Also, I don't understand how this PR turns off the Ansible runner timeout, since timeout parameter is still being passed to the job.

timeout parameter is passed with nil which would bypass the timeout checking.

lfu · 2019-09-10T16:23:51Z

however, we still have to call .stop on the thing somehow

Does .stop actually kill the ansible job? If that is the case, we might want to do it in process_abort when job got timeout from automate.

[----] W, [2019-09-10T10:36:55.287875 #1866:2b17c6bfc5bc]  WARN -- : MIQ(ManageIQ::Providers::AnsiblePlaybookWorkflow#timeout!) Job: guid: [356c7324-14ac-48f8-aa0b-174ae45028a1], job timed out after 305.327338904 seconds of inactivity.  Inactivity threshold [300 seconds], aborting
[----] I, [2019-09-10T10:36:55.297601 #1866:2b17c6bfc5bc]  INFO -- : MIQ(MiqQueue.put) Message id: [22439],  id: [], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [2], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 305.327338904 seconds of inactivity.  Inactivity threshold [300 seconds]", "error"]

NickLaMuro · 2019-09-10T16:24:54Z

@lfu I think the concern with this is that there is a default option for :timeout in the kwargs of this class:

manageiq/app/models/manageiq/providers/ansible_runner_workflow.rb

Line 4 in 57e8844

timeout: 1.hour, poll_interval: 1.second, verbosity: 0, become_enabled: false)

So this is already set by default, unless specifically :timeout => nil is passed.

Fryguy · 2019-09-10T16:33:23Z

Does .stop actually kill the ansible job? If that is the case, we might want to do it in process_abort when job got timeout from automate.

Yes. So, yes, if the service chooses to avoid the internal AnsiblePlaybookRuner's timeout by passing nil, then process_abort seems like a good place to do the stop, however,

you have to call the stop on the Ansible::Runner::AsyncResponse somehow
the stop has to be called on the same appliance that the ansible runner instance, which is not the same box that the service's check_completed loop is running on. That is, process_abort will have to asynchronously call it on a remote appliance via the queue and then, I guess, wait_for_task, or assume it does it's job.

So, given that, it feels rather complicated.

Fryguy · 2019-09-10T16:34:56Z

What is the actual issue here? Is there a timeout race of some sort? If so, what happens when the service "wins". What is the downside to just using the ansible-runner built-in timeout set to the same value as the service timeout?

NickLaMuro · 2019-09-10T16:41:01Z

What is the actual issue here? Is there a timeout race of some sort?

I think it is simply that something in our system (either the "Automate timeout" or the "runner timeout") are causing the playbook that should take an hour or more to not complete because it reaches the timeout. I don't think the timeout conflict is happening yet, but I think we are concern about this solution not being enough if the underlying timeout in ansible-runner/ansible is then triggered.

What is the downside to just using the ansible-runner built-in timeout set to the same value as the service timeout?

I am personally fine with this, though it was one more "special snowflake" instance instead of having ansible-runner conform to our standard. Don't have strong feelings one way or the other, but that is the counter argument I can think of.

lfu · 2019-09-10T16:57:09Z

Automate has its own timeout which is set by max retries and service timeout is set by execution_ttl. Currently service timeout is passed to the job if it is set.

In the case reported in the BZ, automate timeout is 100 minutes, service timeout is not set and playbook takes around 65 minutes to complete. So with old version, the service was able to finish while with new version, service failed due to the 60 minutes timeout from ansible runner.

Fryguy · 2019-09-10T17:18:23Z

app/models/manageiq/providers/ansible_playbook_workflow.rb

@@ -16,6 +16,12 @@ def launch_runner
    Ansible::Runner.run_async(env_vars, extra_vars, playbook_path, kwargs)
  end

+  def process_abort(*args)


oh nice! I didn't realize process_abort was part of the ansible_playbook_workflow...I thought it was something on the service side.

@lfu Can you move this into the base class? Role will have to do the exact same thing, so this should be in the base runner class.

app/models/manageiq/providers/ansible_playbook_workflow.rb

Fryguy · 2019-09-10T17:24:34Z

service timeout is not set
...
service failed due to the 60 minutes timeout from ansible runner.

So if the service timeout (execution_ttl) is not set, then it's inherting the default ansible_playbook_workflow timeout. It sounds to me like the caller should pass the automation timeout if the service timeout is not set, so as to avoid inherting that default.

That is, somehow tweak this [1] to pass the automation timeout in addition, or change the method to use execution_ttl || automation_timeout

[1] https://github.com/ManageIQ/manageiq/blob/master/app/models/service_ansible_playbook.rb#L112-L126

Fryguy · 2019-09-10T17:26:05Z

@NickLaMuro Thoughts on my proposal vs what's in this PR? The PR at the moment I think looks good, and I'd be ok with merging, but I'm not sure I like the general idea of AnsiblePlaybookWorkflow not "owning" it's own timeout. Feels like we could get in trouble if it's allowed to run forever and expecting something else to clean it up. I'm not sure....thoughts?

lfu · 2019-09-10T17:50:29Z

That is, somehow tweak this [1] to pass the automation timeout in addition, or change the method to use execution_ttl || automation_timeout

This would be the perfect solution! But I doubt it is possible to get the automate timeout from a service or a configuration_script. That is why I have to set it to nil to bypass the timeout checking and default to automate timeout.

NickLaMuro · 2019-09-10T17:55:22Z

Thoughts on my proposal vs what's in this PR? The PR at the moment I think looks good, and I'd be ok with merging, but I'm not sure I like the general idea of AnsiblePlaybookWorkflow not "owning" it's own timeout. Feels like we could get in trouble if it's allowed to run forever and expecting something else to clean it up. I'm not sure....thoughts?

@Fryguy I think it is fine, I just don't know if the expectation from QA is that it is able to run longer than an hour:

https://github.com/sbulage/ansible_playbooks/blob/096f0ab9b4b7eb22bb136e8ce350c58ee73ce3aa/timeout_60mins_playbook.yaml#L2

And I don't know if awx/Tower is allowing it to run for longer. Unsure, but this might get pushed back after this is merged if that is the case.

dmetzger57 · 2019-09-10T18:29:08Z

And I don't know if awx/Tower is allowing it to run for longer. Unsure, but this might get pushed back after this is merged if that is the case.

@NickLaMuro According to QE, the AWX / Tower solution in 5.10 supported greater than an hour runs based on https://bugzilla.redhat.com/show_bug.cgi?id=1750370#c3 if I read that correctly

Fryguy · 2019-09-10T20:21:23Z

app/models/manageiq/providers/ansible_runner_workflow.rb

@@ -55,7 +55,7 @@ def execute
  def poll_runner
    response = Ansible::Runner::ResponseAsync.load(context[:ansible_runner_response])
    if response.running?
-      if started_on + options[:timeout] < Time.now.utc
+      if options[:timeout].present? && (started_on + options[:timeout]) < Time.now.utc
        response.stop

        queue_signal(:abort, "ansible #{execution_type} has been running longer than timeout", "error")


@lfu in the non-automate case where timeout is passed, then this will try to abort after doing response.stop and then I think it will hit the process_abort method and try to response.stop again. If so, I think you need to remove the response.stop 2 lines up and just let it flwo through your new abort method. Can you verify?

EDIT: Also, I'm not sure this path even works...does queue_signal(:abort) even work?

I'm sure queue_signal(:abort) works!

app/models/manageiq/providers/embedded_ansible/automation_manager/configuration_script.rb

…method. https://bugzilla.redhat.com/show_bug.cgi?id=1750370

miq-bot · 2019-09-11T13:37:04Z

Checked commit lfu@861cc99 with ruby 2.4.6, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0
3 files checked, 0 offenses detected
Everything looks fine. 🍪

Set default playbook service timeout to 100 minutes (cherry picked from commit e1e730f) https://bugzilla.redhat.com/show_bug.cgi?id=1750370

simaishi · 2019-09-11T16:36:14Z

Ivanchuk backport details:

$ git log -1
commit 55bc9be7150764c2fa69037d781308bedc7c8327
Author: Jason Frey <[email protected]>
Date:   Wed Sep 11 12:25:56 2019 -0400

    Merge pull request #19279 from lfu/ansible_runner_timeout_1750370
    
    Set default playbook service timeout to 100 minutes
    
    (cherry picked from commit e1e730fb136f2702a02b04f815eb93f94934a426)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1750370

miq-bot assigned tinaafitz Sep 10, 2019

miq-bot added blocker bug changelog/yes ivanchuk/yes labels Sep 10, 2019

NickLaMuro reviewed Sep 10, 2019

View reviewed changes

lfu force-pushed the ansible_runner_timeout_1750370 branch from c4f9573 to 57e8844 Compare September 10, 2019 15:52

lfu force-pushed the ansible_runner_timeout_1750370 branch from 57e8844 to 64f841d Compare September 10, 2019 16:42

Fryguy reviewed Sep 10, 2019

View reviewed changes

app/models/manageiq/providers/ansible_playbook_workflow.rb Outdated Show resolved Hide resolved

lfu force-pushed the ansible_runner_timeout_1750370 branch from 64f841d to fb65d9a Compare September 10, 2019 17:52

lfu force-pushed the ansible_runner_timeout_1750370 branch from fb65d9a to 29188fe Compare September 10, 2019 20:12

Fryguy reviewed Sep 10, 2019

View reviewed changes

lfu force-pushed the ansible_runner_timeout_1750370 branch 2 times, most recently from 692d813 to e475d8e Compare September 10, 2019 21:58

lfu changed the title ~~Need to turn off Ansible runner timeout for automate.~~ Set default playbook service timeout to 100 minutes Sep 10, 2019

Fryguy reviewed Sep 10, 2019

View reviewed changes

app/models/manageiq/providers/embedded_ansible/automation_manager/configuration_script.rb Outdated Show resolved Hide resolved

Set default timeout to 100 minutes for playbook service and playbook …

861cc99

…method. https://bugzilla.redhat.com/show_bug.cgi?id=1750370

lfu force-pushed the ansible_runner_timeout_1750370 branch from e475d8e to 861cc99 Compare September 11, 2019 13:35

Fryguy merged commit e1e730f into ManageIQ:master Sep 11, 2019

Fryguy added this to the Sprint 120 Ending Sep 16, 2019 milestone Sep 11, 2019

simaishi pushed a commit that referenced this pull request Sep 11, 2019

Merge pull request #19279 from lfu/ansible_runner_timeout_1750370

55bc9be

Set default playbook service timeout to 100 minutes (cherry picked from commit e1e730f) https://bugzilla.redhat.com/show_bug.cgi?id=1750370

simaishi added ivanchuk/backported and removed ivanchuk/yes labels Sep 11, 2019

tinaafitz mentioned this pull request Sep 27, 2019

[HAMMER] Add retry_interval to VMCheckTransformed ManageIQ/manageiq-content#585

Merged

lfu deleted the ansible_runner_timeout_1750370 branch November 4, 2019 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set default playbook service timeout to 100 minutes #19279

Set default playbook service timeout to 100 minutes #19279

lfu commented Sep 10, 2019

NickLaMuro left a comment

lfu commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019 •

edited

Loading

NickLaMuro commented Sep 10, 2019

dmetzger57 commented Sep 10, 2019

Fryguy commented Sep 10, 2019

Fryguy commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

lfu commented Sep 10, 2019

lfu commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

Fryguy commented Sep 10, 2019

Fryguy commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

lfu commented Sep 10, 2019

Fryguy Sep 10, 2019

Fryguy Sep 10, 2019

Fryguy commented Sep 10, 2019

Fryguy commented Sep 10, 2019

lfu commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

dmetzger57 commented Sep 10, 2019

Fryguy Sep 10, 2019 •

edited

Loading

lfu Sep 10, 2019

miq-bot commented Sep 11, 2019

simaishi commented Sep 11, 2019

Set default playbook service timeout to 100 minutes #19279

Set default playbook service timeout to 100 minutes #19279

Conversation

lfu commented Sep 10, 2019

NickLaMuro left a comment

Choose a reason for hiding this comment

lfu commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019 • edited Loading

NickLaMuro commented Sep 10, 2019

dmetzger57 commented Sep 10, 2019

Fryguy commented Sep 10, 2019

Fryguy commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

lfu commented Sep 10, 2019

lfu commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

Fryguy commented Sep 10, 2019

Fryguy commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

lfu commented Sep 10, 2019

Fryguy Sep 10, 2019

Choose a reason for hiding this comment

Fryguy Sep 10, 2019

Choose a reason for hiding this comment

Fryguy commented Sep 10, 2019

Fryguy commented Sep 10, 2019

lfu commented Sep 10, 2019

NickLaMuro commented Sep 10, 2019

dmetzger57 commented Sep 10, 2019

Fryguy Sep 10, 2019 • edited Loading

Choose a reason for hiding this comment

lfu Sep 10, 2019

Choose a reason for hiding this comment

miq-bot commented Sep 11, 2019

simaishi commented Sep 11, 2019

NickLaMuro commented Sep 10, 2019 •

edited

Loading

Fryguy Sep 10, 2019 •

edited

Loading