Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cause the clone task to restart if CheckProvision is in error #43

Closed

Conversation

mansam
Copy link
Contributor

@mansam mansam commented Feb 9, 2017

Cause the Clone from VM task to restart from the Placement step if CheckProvision enters an error state.

@gmcculloug this would go along with ManageIQ/manageiq#13608 to ensure that the exception that can be raised during cloning results in the statemachine restarting.

@gmcculloug gmcculloug requested a review from mkanoor February 10, 2017 13:17
@gmcculloug gmcculloug self-assigned this Feb 10, 2017
@gmcculloug
Copy link
Member

I can see something like this method existing as an example but it cannot be wired into the state machine by default. As written this would put the provision into a loop. It also does not take into account manual placement which means we would not change the settings but continually retry the provision.

@mansam
Copy link
Contributor Author

mansam commented Feb 10, 2017

@gmcculloug Is there a way I should be writing this to manually take advantage of the max_retries setting since it sounds like this wouldn't respect it, or is there some other way I should be approaching this?

$evm.log(:error, "miq_provision object not provided")
exit(MIQ_STOP)
end
status = $evm.inputs['status']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam
You would need some condition here to check that Placement really failed, if you had a script error this would keep running for ever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkanoor Is the exception that resulted in the error state exposed to automate so that the script can make a decision based on it, or do I have to determine another way? Additionally, is there a way to have this automatically respect max_retries, or am I going to have to implement my own mechanism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam
@gmcculloug Correct me if I am wrong.
There are 2 state machines here, 1 internal inside the app/models and the other one in Automate.

Internal State Machine:
The start_clone_task is an asynchronous process that talks to the provider to get the provisioning started.
Then the poll_clone_complete waits with a retry and then signals the vmdb in database after it has done an ems_refresh.

https://github.com/ManageIQ/manageiq-providers-vmware/blob/master/app/models/manageiq/providers/vmware/infra_manager/provision/state_machine.rb#L28

The external state machine define in Automate is in the check_provisioned loop waiting for the machine to show up in our database my checking the status of the task.

https://github.com/ManageIQ/manageiq-content/blob/master/content/automate/ManageIQ/Cloud/VM/Provisioning/StateMachines/Methods.class/__methods__/check_provisioned.rb#L7

So I am guessing we would have to depend on the provisioning task to provide more errors from the cloning process, I don't think we have an explicit column that stores the error code from the provider. So I think we would have to parse the tasks error message and see if we can glean something from there to trigger a retry on placement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkanoor Parsing the error message seems like a viable solution for this. Any thoughts on how I can make this work with the max_retries mechanism, or if I should even try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max retry is a per-step check so it will not work in this case. You would need to use set_state_var and get_state_var to monitor the number of attempts. https://github.com/pemcg/manageiq-automation-howto-guide/blob/master/chapter12/state_machines.md#saving-variables-between-state-retries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmcculloug Thanks for the clarification.

prov.miq_request.user_message = updated_message
prov.message = status

if $evm.root['ae_result'] == "error"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam
In line 13 we set ae_result to 'restart' so this line will not get executed unless you are going to put a condition around setting the ae_result to restart


DEFAULT_RETRIES = 3
max_placement_retries = $evm.inputs['retries'] || DEFAULT_RETRIES
if prov.message.include?("An error occurred while provisioning Instance")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam
Is this a good enough message to capture all cloud placement errors.
Do you want to expose the "placement error message" as a passed in value from $evm.inputs so someone can customize it if they want

DEFAULT_PLACEMENT_ERROR_MESSAGE = "An error occurred while provisioning Instance"
placement_error_message = $evm.inputs['placement_error_message'] || DEFAULT_PLACEMENT_ERROR_MESSAGE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only aware of the one exception, but I think exposing the error message seems like a good way to go about things.

Copy link
Contributor

@mkanoor mkanoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam mansam force-pushed the retry-clone-to-vm-from-placement branch 7 times, most recently from abb8ebc to ab21ccf Compare February 14, 2017 13:46
@mansam
Copy link
Contributor Author

mansam commented Feb 14, 2017

@mkanoor I think this is ready for you to rereview. :)

@handle.root['ae_result'] = 'restart'
@handle.root['ae_next_state'] = @restart_from_state
@handle.log("info", "Provisioning #{@prov.get_option(:vm_target_name)} failed, retrying #{@restart_from_state}.")
@handle.set_state_var(:state_retries, @retry_number + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam
Would placement_retries be more appropriate here instead of state_retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the entire script has been refactored to be more generic since there was no reason it needed to be placement-specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam Not disagreeing with you but state_retires too generic and likely confusing to anyone not intimately familiar with this PR. 😉

Maybe provision_retries would be a better name.

updated_message += "Status [#{status}] "
updated_message += "Message [#{@prov.message}] "
@prov.miq_request.user_message = updated_message
@prov.message = status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam
Should the message indicate that we are retrying placement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably a good idea to indicate in the status message what is happening in the event that the state is retried.

let(:ae_service) { Spec::Support::MiqAeMockService.new(root_object) }

it "retries and increments count because of a matching error" do
allow(ae_service).to receive(:inputs).and_return({})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam @gmcculloug
If this PR ManageIQ/manageiq#13912 gets merged then you dont need to have this line.
You can set input to be a hash from the spec if you need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay great

@mansam mansam force-pushed the retry-clone-to-vm-from-placement branch from ab21ccf to 72c1740 Compare February 15, 2017 15:54
@miq-bot
Copy link
Member

miq-bot commented Feb 15, 2017

Checked commits mansam/manageiq-content@8e7e07a~...72c1740 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 1 offense detected

content/automate/ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/methods/retry_state.rb

@mansam
Copy link
Contributor Author

mansam commented Feb 15, 2017

@mkanoor any other changes that need to be made?

@mansam
Copy link
Contributor Author

mansam commented Feb 20, 2017

Hi @mkanoor, anything else you need me to do on this?

Copy link
Member

@gmcculloug gmcculloug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam Still a number of changes required. Also, I'm concerned that jumping back to the placement in the current out-of-the-box state-machine is going to have minimal effect.

For example, the default placement has a guard clause in place so it will only run the placement method if auto-placement has been selected. Which means no changes will be applied and you will re-run the same exact provision again.

And if you do have auto-placement set the methods are checking if values have already been selected (see https://github.com/ManageIQ/manageiq-content/blob/master/content/automate/ManageIQ/Cloud/VM/Provisioning/Placement.class/__methods__/best_fit_openstack.rb#L13 for an example.)

I believe I mentioned before in a previous PR that once the task is in an error state it likely needs to be reset before you can jump back in the state-machine and rerun it.

Not sure if you are aware of these details or planning on introducing them in separate PRs. Currently I see no reason to introduce this change without some of the other issues addressed first.

I would suggest testing all this out on an appliance to flush out some of these details.

cc @tzumainn

module Generic
module StateMachines
module VMProvision_VM
class RetryState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This namespace ManageIQ::Automate::Service::Generic::StateMachines::VMProvision_VM does not match the intended placement: ManageIQ::Cloud::VM::Provisioning::StateMachines::VMProvision_VM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, thanks

@handle.root['ae_result'] = 'restart'
@handle.root['ae_next_state'] = @restart_from_state
@handle.log("info", "Provisioning #{@prov.get_option(:vm_target_name)} failed, retrying #{@restart_from_state}.")
@handle.set_state_var(:state_retries, @retry_number + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam Not disagreeing with you but state_retires too generic and likely confusing to anyone not intimately familiar with this PR. 😉

Maybe provision_retries would be a better name.

end

def retry_state
if @prov.message.include?(@error_to_catch) && (@retry_number < @max_retries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prov.message.include?(@error_to_catch) is a fragile way to make this decision. Pretty sure tests will not catch this if the message coming from the backend methods. I know you are trying to make this logic generic and re-usage but wondering if there is a better way. Need to review this more.

Copy link
Contributor Author

@mansam mansam Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely open to suggestions on a better way to do this. Parsing the exception is what was previously recommended to me, but like you said it's quite fragile.


def main
@handle.log("info", "Starting retry_state")
retry_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam This currently effects all cloud provisioning, is that intended or do we need to limit the scope?
cc @tinaafitz @mkanoor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is just to affect Openstack provisioning. I might have made a mistake with my scoping.

@@ -10,6 +10,8 @@ object:
fields:
- PreProvision:
value: "/Cloud/VM/Provisioning/StateMachines/Methods/PreProvision_Clone_to_VM#${/#miq_provision.source.vendor}"
- CheckProvisioned:
on_error: retry_state(status => 'Error Creating VM')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change would be required to the template.yaml in the same directory. One instance is used when you select the template from Lifecycle -> Provision Instances and the other when you select the image and use Lifecycle -> Provision Instances using this image

@miq-bot miq-bot closed this Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants