Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a state machine for long ansible operations #17759

Merged
merged 12 commits into from
Jul 30, 2018

Conversation

agrare
Copy link
Member

@agrare agrare commented Jul 25, 2018

This adds a Job-based state machine for long-running asynchronous ansible operations.

This state machine can be used directly by providers just by passing in the same arguments you would pass to Ansible::Runner.run but instead to ManageIQ::Providers::AnsibleOperationWorkflow.create_job. You can also control the timeout and the poll interval as options to .create_job.

If there are any pre-playbook setup steps or post-playbook cleanup steps there are hooks in the state machine that can be implemented by deriving from this class and implementing them there.

@miq-bot miq-bot added the wip label Jul 25, 2018
@agrare agrare assigned agrare and gtanzillo and unassigned agrare Jul 25, 2018
@agrare agrare force-pushed the add_ansible_operations_state_machine branch from 4e51332 to eb82037 Compare July 25, 2018 18:54
if pid.nil?
queue_signal(:error)
else
context[:ansible_runner_pid] = pid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrare
This would require that you come back to the same server on a retry, where the process was started. During the requeue you would have to set the server_guid to be the current MiqServers guid.
@kbrock Server Affinity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I'm aware 😄 thanks though @mkanoor

@agrare agrare removed their assignment Jul 26, 2018
@agrare agrare force-pushed the add_ansible_operations_state_machine branch 5 times, most recently from fb094ac to c36784e Compare July 26, 2018 18:05
@agrare
Copy link
Member Author

agrare commented Jul 26, 2018

@gtanzillo @Ladas this is ready to review, I just need to update it to match @Ladas 's changes for running async playbooks.

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good... Still reviewing

@agrare agrare force-pushed the add_ansible_operations_state_machine branch from 9f331cc to 9710fcd Compare July 27, 2018 13:04
Copy link
Contributor

@Ladas Ladas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍

@mkanoor
Copy link
Contributor

mkanoor commented Jul 27, 2018

@agrare
Are you planning on returning a MiqTask that can be monitored for completion. It automatically allows us to display the MiqTask in our UI without any changes allowing users to monitor.
@bzwei has implemented a similar interface in his runner
https://github.com/ManageIQ/manageiq/blob/master/app/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner.rb
Which is missing here.

@mkanoor
Copy link
Contributor

mkanoor commented Jul 27, 2018

@agrare Using MiqTask also allows these commands to be called remotely using the REST API. There is already a whole interface built around the MiqTask.

@gmcculloug gmcculloug requested a review from bzwei July 27, 2018 13:56
@agrare
Copy link
Member Author

agrare commented Jul 27, 2018

Are you planning on returning a MiqTask that can be monitored for completion.

@mkanoor A job automatically create an miq_task, https://github.com/ManageIQ/manageiq/blob/master/app/models/job.rb#L32

@bzwei has implemented a similar interface in his runner
https://github.com/ManageIQ/manageiq/blob/master/app/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner.rb
Which is missing here.

I don't see anything extra in that state machine around tasks besides setting the started_on time for the task, what is missing that would prevent it from being monitored for completion?

Using MiqTask also allows these commands to be called remotely using the REST API.

We aren't intending for this to be called directly from the REST API, rather by provider operations methods

There is already a whole interface built around the MiqTask.

I'm confused, are you suggesting that we use MiqTask instead of Job?

@mkanoor
Copy link
Contributor

mkanoor commented Jul 27, 2018

@agrare
Jobs are not displayed in our UI but MiqTasks are and we have a whole framework to allow users to monitor these tasks and we have had first hand experiencing these delays when they happen at customer sites. Jobs cannot be monitored from REST API or Automate either.

So there is a Job as well as a MiqTask in the mix, the Job is the internal implementation and the MiqTask is the external view of it.

@bzwei can you comment where you create the MiqTask, I see you update the attributes here

If we can reuse some of the code between the 2 state machines it would help

@agrare
Copy link
Member Author

agrare commented Jul 27, 2018

@mkanoor there is a task created for all jobs so I'm wondering what about that isn't enough?

@bzwei can you comment where you create the MiqTask, I see you update the attributes here

The task is created here in the base Job class: https://github.com/ManageIQ/manageiq/blob/master/app/models/job.rb#L32

@mkanoor
Copy link
Contributor

mkanoor commented Jul 27, 2018

@agrare That sounds good can you update some of the properties on the message so we can monitor it, we might be getting something back from the runner, that might make sense for the user. I think we had to update the started_time and updated time.

@agrare
Copy link
Member Author

agrare commented Jul 27, 2018

Most properties are already updated whenever the job is updated https://github.com/ManageIQ/manageiq/blob/master/app/models/job.rb#L44-L46

started_on isn't one of them so yes i can set that.

@mkanoor
Copy link
Contributor

mkanoor commented Jul 27, 2018

@agrare
Automate hooks in here
https://github.com/ManageIQ/manageiq-automation_engine/blob/06af67f5fb365d411f549bea4d8c7a58b4f76926/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_playbook_method.rb#L41
expecting to get a task_id back that it can montior and loop around with a retry.

Is there a piece of code around the Job that Automate would be calling, is there a class similar to
ManageIQ::Providers::EmbeddedAnsible::AutomationManager::Playbook
In the Runner model

@agrare
Copy link
Member Author

agrare commented Jul 27, 2018

@mkanoor that's not really the purpose of this PR, this is intended to be used by providers in their ops methods. If we want to expose this to automate I'm okay with doing a follow-up PR but I think that is out of scope of this.

This adds a Job state machine for long running async ansible operations.
@agrare agrare force-pushed the add_ansible_operations_state_machine branch from 367544f to 5c306ae Compare July 27, 2018 15:29
@bzwei
Copy link
Contributor

bzwei commented Jul 27, 2018

@agrare @mkanoor ManageIQ::Providers::EmbeddedAnsible::AutomationManager::PlaybookRunner used by automate subclassed from Job, same as @agrare's AnsibleOperationWorkflow. Job#miq_task give you the task handler for monitoring.

Once this is ready we maybe able to share it to be used by automate to take advantage of Ansible native support for runner instead of forcing to create a job template.

@agrare Does the class name have to beAnsibleOperationWorkflow? Can PlaybookRunner or AnsibleRunner be better to align with the fact Ansible native runner is used? Ansible Tower has work flow which brings in more confusion with AnsibleOperationWorkflow.

@agrare
Copy link
Member Author

agrare commented Jul 27, 2018

@agrare Does the class name have to beAnsibleOperationWorkflow? Can PlaybookRunner or AnsibleRunner be better to align with the fact Ansible native runner is used? Ansible Tower has work flow which brings in more confusion with AnsibleOperationWorkflow.

Nope it can be named anything we like :) I was keeping consistent with ManageIQ::Providers::NativeOperationWorkflow but I'm open to anything.

I'd rather not call it ManageIQ::Providers::AnsibleRunner or ManageIQ::Providers::PlaybookRunner just because we already have Ansible::Runner and the key difference with this is that it is a Job/StateMachine around Ansible::Runner. Maybe ManageIQ::Providers::AnsibleRunnerWorkflow ?

@agrare agrare changed the title [WIP] Add a state machine for long ansible operations Add a state machine for long ansible operations Jul 27, 2018
@agrare
Copy link
Member Author

agrare commented Jul 27, 2018

Taking out of WIP now that @Ladas's async changes are merged

@agrare agrare removed the wip label Jul 27, 2018
@Ladas
Copy link
Contributor

Ladas commented Jul 30, 2018

ManageIQ::Providers::AnsibleRunnerWorkflow sounds nice

if started_on + options[:timeout] < Time.now.utc
response.stop

queue_signal(:abort, "Playbook has been running longer than timeout", "error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could log a warning here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is already logged by process_abort and looks like this:
ERROR -- : MIQ(ManageIQ::Providers::AnsibleOperationWorkflow#process_abort) job aborting, Playbook has been running longer than timeout

context[:ansible_runner_return_code] = result.return_code
context[:ansible_runner_stdout] = result.parsed_stdout

set_status("Playbook failed", "error") if result.return_code != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also log it? Or the thing is, the response object is the only one holding the result (we are deleting the data from the filesystem). So we need to show the failure somewhere.

I would probably start with log error of the whole response and we can see if I can extract, response.error_message and response.traceback/full_error in a generic way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right this wouldn't log anything, I'll log the full parsed_stdout if it fails for now. What I'd like to do is set the status to a useful error message if we can get at that in a meaningful way but this might have to be a post_playbook operation that has to be done specifically by the provider author not generically.

Copy link
Contributor

@Ladas Ladas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍

Lets take the logs/output to another PR, we'll need to make sure we can debug what went wrong from log, up to the nice notification in the UI.

@agrare
Copy link
Member Author

agrare commented Jul 30, 2018

Okay @Ladas @bzwei renamed the Job to ManageIQ::Providers::AnsibleRunnerWorkflow and added some error logging.

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, thanks @agrare!

@agrare agrare force-pushed the add_ansible_operations_state_machine branch from 311a0ce to 37acbdd Compare July 30, 2018 20:51
@miq-bot
Copy link
Member

miq-bot commented Jul 30, 2018

Checked commits agrare/manageiq@f70974f~...37acbdd with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
2 files checked, 0 offenses detected
Everything looks fine. ⭐

@gtanzillo gtanzillo added this to the Sprint 91 Ending Jul 30, 2018 milestone Jul 30, 2018
@gtanzillo gtanzillo merged commit f2aadc3 into ManageIQ:master Jul 30, 2018
@agrare agrare deleted the add_ansible_operations_state_machine branch July 30, 2018 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants