Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for ems workers to finish before destroying the ems #14848

Merged
merged 2 commits into from
Jun 7, 2017

Conversation

zeari
Copy link

@zeari zeari commented Apr 23, 2017

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1437549
continue work from #14675

  • migration and toggling of disable flag
  • dont start disabled workers
  • orchestrate deletion ( this PR )

@durandom @Fryguy @jrafanie @kbrock I have no idea whats the correct way to do this so lets discuss that here. The code added here outlines the basic functionality. orchestrate_destroy should replace destroy_queue in https://github.com/manageiq/manageiq-ui-classic/blob/master/app/controllers/ems_common.rb#L898.

cc @moolitayer @cben

Update (@blomquisg)
BZs:

@zeari zeari changed the title wait for an emss workers to finish before destroying it wait for a ems workers to finish before destroying the ems Apr 23, 2017
@zeari zeari changed the title wait for a ems workers to finish before destroying the ems [WIP] Wait for a ems workers to finish before destroying the ems Apr 23, 2017
@zeari
Copy link
Author

zeari commented Apr 23, 2017

@miq-bot add_label wip

@miq-bot miq-bot added the wip label Apr 23, 2017
@moolitayer
Copy link

Can a worker be killed mid refresh? if so then this might not solve the BZ

@zeari zeari changed the title [WIP] Wait for a ems workers to finish before destroying the ems [WIP] Wait for ems workers to finish before destroying the ems Apr 23, 2017
Copy link
Member

@durandom durandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a orchestrate_destroy call be cancelled?

I also see some queue items rescheduled via MiqException::MiqQueueRetryLater

How does the UI and rest api currently destroy an ems?

# Something like a 15 seconds before trying again?
MiqQueue.put(
:deliver_on => 15.seconds.from_now,
:class_name => name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt this be self.class.name ?

@@ -426,6 +426,24 @@ def enable!
update!(:enabled => true)
end

# Wait until all associated workers are dead to destroy this ems
def orchestrate_destroy
disable_ems if enabled?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was renamed to disable!

disable_ems if enabled?

workers = MiqWorker.find_current_or_starting.where(:queue_name => "ems_#{id}") #better way to get 'queue_name'?
if workers.count == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe these checks can be put into a rails before_destroy callback and cancel the destroy via normal rails methods?

@zeari
Copy link
Author

zeari commented Apr 24, 2017

How can a orchestrate_destroy call be cancelled?

Do we want to enable that? i dont think we currently allow to cancel an ems delete

@zeari zeari force-pushed the orchestrate_destroy2 branch from 2575061 to aa7c523 Compare April 24, 2017 11:26
@durandom
Copy link
Member

Do we want to enable that? i dont think we currently allow to cancel an ems delete

not sure, but in this loop case, it can result in an endless loop. So maybe the queue retry logic is better suited for this...

def orchestrate_destroy
disable! if enabled?

workers = MiqWorker.find_current_or_starting.where(:queue_name => "ems_#{id}") #better way to get 'queue_name'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a followup PR, can you do this?

def queue_name_for_ems(ems)
# Host objects do not have dedicated refresh workers so request a generic worker which will
# be used to make a web-service call to a SmartProxy to initiate inventory collection.
return "generic" if ems.kind_of?(Host) && ems.acts_as_ems?
return ems unless ems.kind_of?(ExtManagementSystem)
"ems_#{ems.id}"
end

Change:

def queue_name_for_ems(ems)
  # Host objects do not have dedicated refresh workers so request a generic worker which will
  # be used to make a web-service call to a SmartProxy to initiate inventory collection.
  return "generic" if ems.kind_of?(Host) && ems.acts_as_ems?
  
  return ems unless ems.kind_of?(ExtManagementSystem)
  "ems_#{ems.id}"
end

To:

def queue_name_for_ems(ems)
  # Host objects do not have dedicated refresh workers so request a generic worker which will
  # be used to make a web-service call to a SmartProxy to initiate inventory collection.
  return "generic" if ems.kind_of?(Host) && ems.acts_as_ems?

  return ems unless ems.kind_of?(ExtManagementSystem)
  ems.queue_name
end

And in ExtManagementSystem:

def queue_name
  "ems_#{ems.id}"
end

Finally... then this line becomes:

workers = MiqWorker.find_current_or_starting.where(:queue_name => ems.queue_name) 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:deliver_on => 15.seconds.from_now,
:class_name => self.class.name,
:instance_id => id,
:method_name => "orchestrate_destroy"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, per ems workers can be running a single queue item for a long time... we could go through lots of these queue messages before the worker exits and we can destroy the ems.

How often do this happen in practice? Can we just wait 5 minutes between checks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do this happen in practice? Can we just wait 5 minutes between checks?

@jrafanie I think users expect for the deletion process to be fairly quick. Can we instead cancel working jobs? can we do that safely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a silly question, since we recently added the ability to disable a provider, can we prevent deletion of enabled providers? The disable could be responsible for making sure any ems workers are also stopped before it's marked at disable. In other words, force them to first disable the provider, then delete it?

@cben
Copy link
Contributor

cben commented Apr 24, 2017

How often do this happen in practice? Can we just wait 5 minutes between checks

I think users expect for the deletion process to be fairly quick

Or at least do exponential backoff...

@zeari
Copy link
Author

zeari commented Apr 24, 2017

Or at least do exponential backoff...

@cben possibly, i dont know if we can 'remember' the previous wait time between iterations.

@zeari
Copy link
Author

zeari commented Apr 24, 2017

not sure, but in this loop case, it can result in an endless loop. So maybe the queue retry logic is better suited for this...

seems correct. ill make the change after we discuss this a while.

@cben
Copy link
Contributor

cben commented Apr 25, 2017 via email

@jrafanie
Copy link
Member

Wouldn't that be a race condition, with sync_workers trying to bring them
up?

I don't remember how enable flag works but I would assume that sync_workers would not start workers for disable ems'.

@durandom
Copy link
Member

I would

  • add a before_destroy callback to check if all pre conditions are met and otherwise cancel the destroy - the api even has that use case
  • let destroy_queue tear down the queues and just call destroy - if that fails, raise MiqQueueRetryLater
  • leave the disable! to just toggle the flag (because of the race condition mentioned)

I also assume the destroy_queue is the way that the UI and API triggers a destroy

@zeari
Copy link
Author

zeari commented Apr 27, 2017

cc @simon3z

@zeari
Copy link
Author

zeari commented Apr 30, 2017

add a before_destroy callback to check if all pre conditions are met and otherwise cancel the destroy - the api even has that use case

so what should be in the before_destroy? just an if active_workers.count == zero?

let destroy_queue tear down the queues and just call destroy - if that fails, raise
MiqQueueRetryLater

so destroy_queue should run disable! for the ems? or before_destroy?

leave the disable! to just toggle the flag (because of the race condition mentioned)
I also assume the destroy_queue is the way that the UI and API triggers a destroy

at least the UI does it this way

@durandom
Copy link
Member

durandom commented May 2, 2017

@zeari yes to all :)

so destroy_queue should run disable! for the ems? or before_destroy?

it calls disable!. Because before_destroy is always called. Thats just the safe check to block a destroy.

disable! if enabled?

if self.destroy == nil
destroy_queue(15.seconds.from_now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt raising MiqQueueRetryLater work?

Copy link
Author

@zeari zeari May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with raise MiqException::MiqQueueRetryLater.new(:deliver_on => 15.seconds.from_now)
but it would just queue the job immediatly. It did write a fancy message from https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_queue.rb#L328 though.
I cant find the correct way to utilize that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its because im testing with simulate_queue_worker and that it doesnt take deliver_on into account

@zeari zeari force-pushed the orchestrate_destroy2 branch 2 times, most recently from 0a95e41 to 589fa35 Compare May 15, 2017 13:58
@@ -426,6 +426,32 @@ def enable!
update!(:enabled => true)
end

# override destroy_queue from AsyncDeleteMixin
def destroy_queue(deliver_on = Time.now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave :deliver_on nil?

Copy link
Member

@durandom durandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great 👍
And indeed it looks like a minimal invasive solution.

One thing I see coming in our way is how to cancel a destroy. In this case we have no way to enable the ems again, because it's disabled everytime orchestrate_destroy is called.
But I think thats ok for now....


before_destroy :before_destroy

def before_destroy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would opt for a descriptive name, like assert_no_queues_present or wrap the assertion in a block

before_destroy do |ems|
  throw(:abort) if !MiqWorker.find_current_or_starting.where(:queue_name => ems.queue_name).count.zero?
end

disable! if enabled?

if self.destroy == false
raise MiqException::MiqQueueRetryLater.new(:deliver_on => 15.seconds.from_now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you skip deliver_on whats the default?
And is there a throttling in place?
Will the retry be until it succeeds or does it give up eventually?

Copy link
Author

@zeari zeari May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you skip deliver_on whats the default?

immediately

And is there a throttling in place?

I dont think so...

Will the retry be until it succeeds or does it give up eventually?

I found that we can add an expires_on setting on it. Maybe cancel after 5 mins?
that wont work since we dont 'carry' that attribute to the next time we queue the destroy.

@zeari
Copy link
Author

zeari commented May 29, 2017

@miq-bot remove_label wip

@miq-bot miq-bot changed the title [WIP] Wait for ems workers to finish before destroying the ems Wait for ems workers to finish before destroying the ems May 29, 2017
@miq-bot miq-bot removed the wip label May 29, 2017
@zeari zeari force-pushed the orchestrate_destroy2 branch 3 times, most recently from 2833708 to 36bd868 Compare May 29, 2017 14:04
end

# override destroy_queue from AsyncDeleteMixin
def destroy_queue(deliver_on = nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method signature is confusing to me. The instance method takes a time, and the class method takes an array of ids. In addition, they're very similar with only the deliver_on being the main difference.

Can this method be called destroy_queue_in_future, schedule_destroy_queue or something else...
and have it call the class method directly, and make the class method take an optional deliver_on along with the ids?

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking nice.

Just the one to remove to_miq_a and a few questions.

LGTM

@@ -422,6 +422,47 @@ def enable!
update!(:enabled => true)
end

def self.destroy_queue(ids)
ids = ids.to_miq_a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use to_miq_a, use Array.wrap(ids) if you need

def self.destroy_queue(ids)
ids = ids.to_miq_a
_log.info("Queuing destroy of #{name} with the following ids: #{ids.inspect}")
ids.each do |id|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really have to send a different message for every destroy?
Better yet, could we just send in the where clause and not be id centric?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well we would have to manage that where clause since in each iteration some ids(providers) would be deleted and some would have to try again later so im not sure its worth it


if self.destroy == false
_log.info("Cant #{self.class.name} with id: #{id}, workers still in progress. Requeuing destroy...")
raise MiqException::MiqQueueRetryLater.new(:deliver_on => 15.seconds.from_now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we rarely use MiqQueueRetryLater and are going away from that since it is queue system specific. Can we just queue another destroy?

@@ -97,7 +97,7 @@ def queue_name_for_ems(ems)
return "generic" if ems.kind_of?(Host) && ems.acts_as_ems?

return ems unless ems.kind_of?(ExtManagementSystem)
"ems_#{ems.id}"
ems.queue_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. good change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its also in #14864

@zeari zeari force-pushed the orchestrate_destroy2 branch 2 times, most recently from 1d767a3 to b3339f9 Compare June 4, 2017 11:12
@zeari zeari force-pushed the orchestrate_destroy2 branch from b3339f9 to f6d04e8 Compare June 4, 2017 12:31
@zeari
Copy link
Author

zeari commented Jun 4, 2017

@jrafanie @kbrock comments fixed (as best i could)
testing this takes time since i have to setup the race condition in the BZ. so If youre satisfied with the changes dont merge right away, tell me and let me test one last time.

@miq-bot
Copy link
Member

miq-bot commented Jun 4, 2017

Checked commits zeari/manageiq@423b115~...f6d04e8 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
4 files checked, 1 offense detected

app/models/ext_management_system.rb

def orchestrate_destroy
disable! if enabled?

if self.destroy == false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove self to make rubocop happy

@durandom
Copy link
Member

durandom commented Jun 7, 2017

@jrafanie I'd say merge #14864 first then rebase this one more time

Code wise I'm 👍

@jrafanie jrafanie merged commit b19f1c4 into ManageIQ:master Jun 7, 2017
@jrafanie jrafanie added this to the Sprint 63 Ending Jun 19, 2017 milestone Jun 7, 2017
@durandom
Copy link
Member

durandom commented Jun 7, 2017

awesome addition @zeari 👏

@moolitayer
Copy link

🎉 👏

@zeari
Copy link
Author

zeari commented Jun 8, 2017

@jrafanie @kbrock comments fixed (as best i could)
testing this takes time since i have to setup the race condition in the BZ. so If youre satisfied with the changes dont merge right away, tell me and let me test one last time.

Testing.....

@zeari
Copy link
Author

zeari commented Jun 8, 2017

so this yields a couple of errors from the previous restructuring. id rather we unmerge, ill fix those and merge again....

@zeari zeari mentioned this pull request Jun 8, 2017
@zeari
Copy link
Author

zeari commented Jun 8, 2017

id rather we unmerge, ill fix those and merge again....

nvm, fix is here: #14848

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants