Cause the clone task to restart if CheckProvision is in error #43

mansam · 2017-02-09T20:05:59Z

Cause the Clone from VM task to restart from the Placement step if CheckProvision enters an error state.

@gmcculloug this would go along with ManageIQ/manageiq#13608 to ensure that the exception that can be raised during cloning results in the statemachine restarting.

gmcculloug · 2017-02-10T13:22:07Z

I can see something like this method existing as an example but it cannot be wired into the state machine by default. As written this would put the provision into a loop. It also does not take into account manual placement which means we would not change the settings but continually retry the provision.

mansam · 2017-02-10T14:15:36Z

@gmcculloug Is there a way I should be writing this to manually take advantage of the max_retries setting since it sounds like this wouldn't respect it, or is there some other way I should be approaching this?

mkanoor · 2017-02-10T14:30:59Z

...geIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_placement.rb

+  $evm.log(:error, "miq_provision object not provided")
+  exit(MIQ_STOP)
+end
+status = $evm.inputs['status']


@mansam
You would need some condition here to check that Placement really failed, if you had a script error this would keep running for ever.

@mkanoor Is the exception that resulted in the error state exposed to automate so that the script can make a decision based on it, or do I have to determine another way? Additionally, is there a way to have this automatically respect max_retries, or am I going to have to implement my own mechanism?

@mansam
@gmcculloug Correct me if I am wrong.
There are 2 state machines here, 1 internal inside the app/models and the other one in Automate.

Internal State Machine:
The start_clone_task is an asynchronous process that talks to the provider to get the provisioning started.
Then the poll_clone_complete waits with a retry and then signals the vmdb in database after it has done an ems_refresh.

https://github.com/ManageIQ/manageiq-providers-vmware/blob/master/app/models/manageiq/providers/vmware/infra_manager/provision/state_machine.rb#L28

The external state machine define in Automate is in the check_provisioned loop waiting for the machine to show up in our database my checking the status of the task.

https://github.com/ManageIQ/manageiq-content/blob/master/content/automate/ManageIQ/Cloud/VM/Provisioning/StateMachines/Methods.class/__methods__/check_provisioned.rb#L7

So I am guessing we would have to depend on the provisioning task to provide more errors from the cloning process, I don't think we have an explicit column that stores the error code from the provider. So I think we would have to parse the tasks error message and see if we can glean something from there to trigger a retry on placement.

@mkanoor Parsing the error message seems like a viable solution for this. Any thoughts on how I can make this work with the max_retries mechanism, or if I should even try?

Max retry is a per-step check so it will not work in this case. You would need to use set_state_var and get_state_var to monitor the number of attempts. https://github.com/pemcg/manageiq-automation-howto-guide/blob/master/chapter12/state_machines.md#saving-variables-between-state-retries

@gmcculloug Thanks for the clarification.

mkanoor · 2017-02-10T14:37:52Z

...geIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_placement.rb

+prov.miq_request.user_message = updated_message
+prov.message = status
+
+if $evm.root['ae_result'] == "error"


@mansam
In line 13 we set ae_result to 'restart' so this line will not get executed unless you are going to put a condition around setting the ae_result to restart

mkanoor · 2017-02-13T15:00:06Z

...geIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_placement.rb

+
+DEFAULT_RETRIES = 3
+max_placement_retries = $evm.inputs['retries'] || DEFAULT_RETRIES
+if prov.message.include?("An error occurred while provisioning Instance")


@mansam
Is this a good enough message to capture all cloud placement errors.
Do you want to expose the "placement error message" as a passed in value from $evm.inputs so someone can customize it if they want

DEFAULT_PLACEMENT_ERROR_MESSAGE = "An error occurred while provisioning Instance" placement_error_message = $evm.inputs['placement_error_message'] || DEFAULT_PLACEMENT_ERROR_MESSAGE

I'm only aware of the one exception, but I think exposing the error message seems like a good way to go about things.

mkanoor

@mansam
Could you also add a spec for this method.

For the newer methods we are following a module approach so that we can test them outside of the Automate Engine. A good example to follow would be a method and spec like

Method

https://github.com/ManageIQ/manageiq-content/blob/master/content/automate/ManageIQ/Service/Generic/StateMachines/GenericLifecycle.class/__methods__/check_completed.rb

Spec

https://github.com/ManageIQ/manageiq-content/blob/master/spec/content/automate/ManageIQ/Service/Generic/StateMachines/GenericLifecycle.class/__methods__/check_completed_spec.rb

mansam · 2017-02-14T16:06:21Z

@mkanoor I think this is ready for you to rereview. :)

mkanoor · 2017-02-14T19:53:24Z

...ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_state.rb

+                  @handle.root['ae_result'] = 'restart'
+                  @handle.root['ae_next_state'] = @restart_from_state
+                  @handle.log("info", "Provisioning #{@prov.get_option(:vm_target_name)} failed, retrying #{@restart_from_state}.")
+                  @handle.set_state_var(:state_retries, @retry_number + 1)


@mansam
Would placement_retries be more appropriate here instead of state_retries.

No, the entire script has been refactored to be more generic since there was no reason it needed to be placement-specific.

@mansam Not disagreeing with you but state_retires too generic and likely confusing to anyone not intimately familiar with this PR. 😉

Maybe provision_retries would be a better name.

mkanoor · 2017-02-14T19:54:03Z

...ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_state.rb

+                updated_message += "Status [#{status}] "
+                updated_message += "Message [#{@prov.message}] "
+                @prov.miq_request.user_message = updated_message
+                @prov.message = status


@mansam
Should the message indicate that we are retrying placement

Yeah, probably a good idea to indicate in the status message what is happening in the event that the state is retried.

mkanoor · 2017-02-14T19:57:10Z

...eIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_state_spec.rb

+    let(:ae_service) { Spec::Support::MiqAeMockService.new(root_object) }
+
+    it "retries and increments count because of a matching error" do
+      allow(ae_service).to receive(:inputs).and_return({})


@mansam @gmcculloug
If this PR ManageIQ/manageiq#13912 gets merged then you dont need to have this line.
You can set input to be a hash from the spec if you need it.

miq-bot · 2017-02-15T15:57:46Z

Checked commits mansam/manageiq-content@8e7e07a~...72c1740 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 1 offense detected

content/automate/ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/methods/retry_state.rb

❗ - Line 12, Col 18 - Style/ClassAndModuleCamelCase - Use CamelCase for classes and modules.

mansam · 2017-02-15T16:04:22Z

@mkanoor any other changes that need to be made?

mansam · 2017-02-20T14:46:23Z

Hi @mkanoor, anything else you need me to do on this?

gmcculloug

@mansam Still a number of changes required. Also, I'm concerned that jumping back to the placement in the current out-of-the-box state-machine is going to have minimal effect.

For example, the default placement has a guard clause in place so it will only run the placement method if auto-placement has been selected. Which means no changes will be applied and you will re-run the same exact provision again.

And if you do have auto-placement set the methods are checking if values have already been selected (see https://github.com/ManageIQ/manageiq-content/blob/master/content/automate/ManageIQ/Cloud/VM/Provisioning/Placement.class/__methods__/best_fit_openstack.rb#L13 for an example.)

I believe I mentioned before in a previous PR that once the task is in an error state it likely needs to be reset before you can jump back in the state-machine and rerun it.

Not sure if you are aware of these details or planning on introducing them in separate PRs. Currently I see no reason to introduce this change without some of the other issues addressed first.

I would suggest testing all this out on an appliance to flush out some of these details.

cc @tzumainn

gmcculloug · 2017-02-24T19:38:45Z

...ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_state.rb

+      module Generic
+        module StateMachines
+          module VMProvision_VM
+            class RetryState


This namespace ManageIQ::Automate::Service::Generic::StateMachines::VMProvision_VM does not match the intended placement: ManageIQ::Cloud::VM::Provisioning::StateMachines::VMProvision_VM

Noted, thanks

gmcculloug · 2017-02-24T19:46:18Z

...ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_state.rb

+                  @handle.root['ae_result'] = 'restart'
+                  @handle.root['ae_next_state'] = @restart_from_state
+                  @handle.log("info", "Provisioning #{@prov.get_option(:vm_target_name)} failed, retrying #{@restart_from_state}.")
+                  @handle.set_state_var(:state_retries, @retry_number + 1)


@mansam Not disagreeing with you but state_retires too generic and likely confusing to anyone not intimately familiar with this PR. 😉

Maybe provision_retries would be a better name.

gmcculloug · 2017-02-24T19:53:56Z

...ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_state.rb

+              end
+
+              def retry_state
+                if @prov.message.include?(@error_to_catch) && (@retry_number < @max_retries)


@prov.message.include?(@error_to_catch) is a fragile way to make this decision. Pretty sure tests will not catch this if the message coming from the backend methods. I know you are trying to make this logic generic and re-usage but wondering if there is a better way. Need to review this more.

I'm definitely open to suggestions on a better way to do this. Parsing the exception is what was previously recommended to me, but like you said it's quite fragile.

gmcculloug · 2017-02-24T19:55:17Z

...ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/__methods__/retry_state.rb

+
+              def main
+                @handle.log("info", "Starting retry_state")
+                retry_state


@mansam This currently effects all cloud provisioning, is that intended or do we need to limit the scope?
cc @tinaafitz @mkanoor

My intention is just to affect Openstack provisioning. I might have made a mistake with my scoping.

gmcculloug · 2017-02-24T20:02:10Z

.../automate/ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/clone_to_vm.yaml

@@ -10,6 +10,8 @@ object:
  fields:
  - PreProvision:
      value: "/Cloud/VM/Provisioning/StateMachines/Methods/PreProvision_Clone_to_VM#${/#miq_provision.source.vendor}"
+  - CheckProvisioned:
+      on_error: retry_state(status => 'Error Creating VM')


Same change would be required to the template.yaml in the same directory. One instance is used when you select the template from Lifecycle -> Provision Instances and the other when you select the image and use Lifecycle -> Provision Instances using this image

gmcculloug requested a review from mkanoor February 10, 2017 13:17

gmcculloug self-assigned this Feb 10, 2017

gmcculloug added the enhancement label Feb 10, 2017

mkanoor reviewed Feb 10, 2017

View reviewed changes

mkanoor reviewed Feb 13, 2017

View reviewed changes

mkanoor suggested changes Feb 13, 2017

View reviewed changes

mansam force-pushed the retry-clone-to-vm-from-placement branch 7 times, most recently from abb8ebc to ab21ccf Compare February 14, 2017 13:46

mkanoor reviewed Feb 14, 2017

View reviewed changes

mansam added 4 commits February 15, 2017 10:25

Cause the clone task to restart if CheckProvision is in error

8e7e07a

Parse error to determine whether to retry placement and count tries

02234e5

Refactor retry method into a module with a spec

7b57e79

Indicate in RetryState's status if a previous state will be retried

72c1740

mansam force-pushed the retry-clone-to-vm-from-placement branch from ab21ccf to 72c1740 Compare February 15, 2017 15:54

gmcculloug suggested changes Feb 24, 2017

View reviewed changes

miq-bot closed this Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cause the clone task to restart if CheckProvision is in error #43

Cause the clone task to restart if CheckProvision is in error #43

mansam commented Feb 9, 2017

gmcculloug commented Feb 10, 2017

mansam commented Feb 10, 2017 •

edited

Loading

mkanoor Feb 10, 2017

mansam Feb 10, 2017

mkanoor Feb 10, 2017

mansam Feb 10, 2017

gmcculloug Feb 10, 2017

mansam Feb 10, 2017

mkanoor Feb 10, 2017

mkanoor Feb 13, 2017

mansam Feb 13, 2017

mkanoor left a comment

mansam commented Feb 14, 2017

mkanoor Feb 14, 2017

mansam Feb 14, 2017

gmcculloug Feb 24, 2017

mkanoor Feb 14, 2017

mansam Feb 14, 2017

mkanoor Feb 14, 2017

mansam Feb 14, 2017

miq-bot commented Feb 15, 2017

mansam commented Feb 15, 2017

mansam commented Feb 20, 2017

gmcculloug left a comment

gmcculloug Feb 24, 2017

mansam Feb 24, 2017

gmcculloug Feb 24, 2017

gmcculloug Feb 24, 2017

mansam Feb 24, 2017 •

edited

Loading

gmcculloug Feb 24, 2017

mansam Feb 24, 2017

gmcculloug Feb 24, 2017

Cause the clone task to restart if CheckProvision is in error #43

Cause the clone task to restart if CheckProvision is in error #43

Conversation

mansam commented Feb 9, 2017

gmcculloug commented Feb 10, 2017

mansam commented Feb 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkanoor left a comment

Choose a reason for hiding this comment

mansam commented Feb 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miq-bot commented Feb 15, 2017

mansam commented Feb 15, 2017

mansam commented Feb 20, 2017

gmcculloug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mansam Feb 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mansam commented Feb 10, 2017 •

edited

Loading

mansam Feb 24, 2017 •

edited

Loading