-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cause the clone task to restart if CheckProvision is in error #43
Conversation
I can see something like this method existing as an example but it cannot be wired into the state machine by default. As written this would put the provision into a loop. It also does not take into account manual placement which means we would not change the settings but continually retry the provision. |
@gmcculloug Is there a way I should be writing this to manually take advantage of the max_retries setting since it sounds like this wouldn't respect it, or is there some other way I should be approaching this? |
$evm.log(:error, "miq_provision object not provided") | ||
exit(MIQ_STOP) | ||
end | ||
status = $evm.inputs['status'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam
You would need some condition here to check that Placement really failed, if you had a script error this would keep running for ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkanoor Is the exception that resulted in the error state exposed to automate so that the script can make a decision based on it, or do I have to determine another way? Additionally, is there a way to have this automatically respect max_retries
, or am I going to have to implement my own mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam
@gmcculloug Correct me if I am wrong.
There are 2 state machines here, 1 internal inside the app/models and the other one in Automate.
Internal State Machine:
The start_clone_task is an asynchronous process that talks to the provider to get the provisioning started.
Then the poll_clone_complete waits with a retry and then signals the vmdb in database after it has done an ems_refresh.
The external state machine define in Automate is in the check_provisioned loop waiting for the machine to show up in our database my checking the status of the task.
So I am guessing we would have to depend on the provisioning task to provide more errors from the cloning process, I don't think we have an explicit column that stores the error code from the provider. So I think we would have to parse the tasks error message and see if we can glean something from there to trigger a retry on placement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkanoor Parsing the error message seems like a viable solution for this. Any thoughts on how I can make this work with the max_retries mechanism, or if I should even try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max retry is a per-step check so it will not work in this case. You would need to use set_state_var
and get_state_var
to monitor the number of attempts. https://github.com/pemcg/manageiq-automation-howto-guide/blob/master/chapter12/state_machines.md#saving-variables-between-state-retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmcculloug Thanks for the clarification.
prov.miq_request.user_message = updated_message | ||
prov.message = status | ||
|
||
if $evm.root['ae_result'] == "error" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam
In line 13 we set ae_result to 'restart' so this line will not get executed unless you are going to put a condition around setting the ae_result to restart
|
||
DEFAULT_RETRIES = 3 | ||
max_placement_retries = $evm.inputs['retries'] || DEFAULT_RETRIES | ||
if prov.message.include?("An error occurred while provisioning Instance") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam
Is this a good enough message to capture all cloud placement errors.
Do you want to expose the "placement error message" as a passed in value from $evm.inputs so someone can customize it if they want
DEFAULT_PLACEMENT_ERROR_MESSAGE = "An error occurred while provisioning Instance"
placement_error_message = $evm.inputs['placement_error_message'] || DEFAULT_PLACEMENT_ERROR_MESSAGE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only aware of the one exception, but I think exposing the error message seems like a good way to go about things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam
Could you also add a spec for this method.
For the newer methods we are following a module approach so that we can test them outside of the Automate Engine. A good example to follow would be a method and spec like
Method
Spec
abb8ebc
to
ab21ccf
Compare
@mkanoor I think this is ready for you to rereview. :) |
@handle.root['ae_result'] = 'restart' | ||
@handle.root['ae_next_state'] = @restart_from_state | ||
@handle.log("info", "Provisioning #{@prov.get_option(:vm_target_name)} failed, retrying #{@restart_from_state}.") | ||
@handle.set_state_var(:state_retries, @retry_number + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam
Would placement_retries be more appropriate here instead of state_retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the entire script has been refactored to be more generic since there was no reason it needed to be placement-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam Not disagreeing with you but state_retires
too generic and likely confusing to anyone not intimately familiar with this PR. 😉
Maybe provision_retries
would be a better name.
updated_message += "Status [#{status}] " | ||
updated_message += "Message [#{@prov.message}] " | ||
@prov.miq_request.user_message = updated_message | ||
@prov.message = status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam
Should the message indicate that we are retrying placement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably a good idea to indicate in the status message what is happening in the event that the state is retried.
let(:ae_service) { Spec::Support::MiqAeMockService.new(root_object) } | ||
|
||
it "retries and increments count because of a matching error" do | ||
allow(ae_service).to receive(:inputs).and_return({}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam @gmcculloug
If this PR ManageIQ/manageiq#13912 gets merged then you dont need to have this line.
You can set input to be a hash from the spec if you need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay great
ab21ccf
to
72c1740
Compare
Checked commits mansam/manageiq-content@8e7e07a~...72c1740 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 content/automate/ManageIQ/Cloud/VM/Provisioning/StateMachines/VMProvision_VM.class/methods/retry_state.rb
|
@mkanoor any other changes that need to be made? |
Hi @mkanoor, anything else you need me to do on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam Still a number of changes required. Also, I'm concerned that jumping back to the placement in the current out-of-the-box state-machine is going to have minimal effect.
For example, the default placement has a guard clause in place so it will only run the placement method if auto-placement has been selected. Which means no changes will be applied and you will re-run the same exact provision again.
And if you do have auto-placement set the methods are checking if values have already been selected (see https://github.com/ManageIQ/manageiq-content/blob/master/content/automate/ManageIQ/Cloud/VM/Provisioning/Placement.class/__methods__/best_fit_openstack.rb#L13 for an example.)
I believe I mentioned before in a previous PR that once the task is in an error state it likely needs to be reset before you can jump back in the state-machine and rerun it.
Not sure if you are aware of these details or planning on introducing them in separate PRs. Currently I see no reason to introduce this change without some of the other issues addressed first.
I would suggest testing all this out on an appliance to flush out some of these details.
cc @tzumainn
module Generic | ||
module StateMachines | ||
module VMProvision_VM | ||
class RetryState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This namespace ManageIQ::Automate::Service::Generic::StateMachines::VMProvision_VM
does not match the intended placement: ManageIQ::Cloud::VM::Provisioning::StateMachines::VMProvision_VM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, thanks
@handle.root['ae_result'] = 'restart' | ||
@handle.root['ae_next_state'] = @restart_from_state | ||
@handle.log("info", "Provisioning #{@prov.get_option(:vm_target_name)} failed, retrying #{@restart_from_state}.") | ||
@handle.set_state_var(:state_retries, @retry_number + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam Not disagreeing with you but state_retires
too generic and likely confusing to anyone not intimately familiar with this PR. 😉
Maybe provision_retries
would be a better name.
end | ||
|
||
def retry_state | ||
if @prov.message.include?(@error_to_catch) && (@retry_number < @max_retries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prov.message.include?(@error_to_catch)
is a fragile way to make this decision. Pretty sure tests will not catch this if the message coming from the backend methods. I know you are trying to make this logic generic and re-usage but wondering if there is a better way. Need to review this more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely open to suggestions on a better way to do this. Parsing the exception is what was previously recommended to me, but like you said it's quite fragile.
|
||
def main | ||
@handle.log("info", "Starting retry_state") | ||
retry_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mansam This currently effects all cloud provisioning, is that intended or do we need to limit the scope?
cc @tinaafitz @mkanoor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is just to affect Openstack provisioning. I might have made a mistake with my scoping.
@@ -10,6 +10,8 @@ object: | |||
fields: | |||
- PreProvision: | |||
value: "/Cloud/VM/Provisioning/StateMachines/Methods/PreProvision_Clone_to_VM#${/#miq_provision.source.vendor}" | |||
- CheckProvisioned: | |||
on_error: retry_state(status => 'Error Creating VM') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same change would be required to the template.yaml
in the same directory. One instance is used when you select the template from Lifecycle -> Provision Instances
and the other when you select the image and use Lifecycle -> Provision Instances using this image
Cause the Clone from VM task to restart from the Placement step if CheckProvision enters an error state.
@gmcculloug this would go along with ManageIQ/manageiq#13608 to ensure that the exception that can be raised during cloning results in the statemachine restarting.