Wait for ems workers to finish before destroying the ems #14848

zeari · 2017-04-23T16:29:11Z

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1437549
continue work from #14675

migration and toggling of disable flag
dont start disabled workers
orchestrate deletion ( this PR )

@durandom @Fryguy @jrafanie @kbrock ~~I have no idea whats the correct way to do this so lets discuss that here. The code added here outlines the basic functionality. orchestrate_destroy should replace destroy_queue~~ in https://github.com/manageiq/manageiq-ui-classic/blob/master/app/controllers/ems_common.rb#L898.

cc @moolitayer @cben

Update (@blomquisg)
BZs:

zeari · 2017-04-23T16:38:09Z

@miq-bot add_label wip

moolitayer · 2017-04-23T16:48:57Z

Can a worker be killed mid refresh? if so then this might not solve the BZ

durandom

How can a orchestrate_destroy call be cancelled?

I also see some queue items rescheduled via MiqException::MiqQueueRetryLater

How does the UI and rest api currently destroy an ems?

durandom · 2017-04-24T09:51:43Z

app/models/ext_management_system.rb

+      # Something like a 15 seconds before trying again?
+      MiqQueue.put(
+          :deliver_on  => 15.seconds.from_now,
+          :class_name  => name,


shouldnt this be self.class.name ?

durandom · 2017-04-24T09:55:13Z

app/models/ext_management_system.rb

@@ -426,6 +426,24 @@ def enable!
    update!(:enabled => true)
  end

+  # Wait until all associated workers are dead to destroy this ems
+  def orchestrate_destroy
+    disable_ems if enabled?


this was renamed to disable!

durandom · 2017-04-24T10:15:36Z

app/models/ext_management_system.rb

+    disable_ems if enabled?
+
+    workers = MiqWorker.find_current_or_starting.where(:queue_name => "ems_#{id}") #better way to get 'queue_name'?
+    if workers.count == 0


maybe these checks can be put into a rails before_destroy callback and cancel the destroy via normal rails methods?

zeari · 2017-04-24T11:22:04Z

How can a orchestrate_destroy call be cancelled?

Do we want to enable that? i dont think we currently allow to cancel an ems delete

durandom · 2017-04-24T12:14:26Z

Do we want to enable that? i dont think we currently allow to cancel an ems delete

not sure, but in this loop case, it can result in an endless loop. So maybe the queue retry logic is better suited for this...

jrafanie · 2017-04-24T18:07:25Z

app/models/ext_management_system.rb

+  def orchestrate_destroy
+    disable! if enabled?
+
+    workers = MiqWorker.find_current_or_starting.where(:queue_name => "ems_#{id}") #better way to get 'queue_name'?


As a followup PR, can you do this?

manageiq/app/models/mixins/per_ems_worker_mixin.rb

Lines 94 to 101 in d199f44

def queue_name_for_ems(ems)

# Host objects do not have dedicated refresh workers so request a generic worker which will

# be used to make a web-service call to a SmartProxy to initiate inventory collection.

return "generic" if ems.kind_of?(Host) && ems.acts_as_ems?

return ems unless ems.kind_of?(ExtManagementSystem)

"ems_#{ems.id}"

end

Change:

def queue_name_for_ems(ems) # Host objects do not have dedicated refresh workers so request a generic worker which will # be used to make a web-service call to a SmartProxy to initiate inventory collection. return "generic" if ems.kind_of?(Host) && ems.acts_as_ems? return ems unless ems.kind_of?(ExtManagementSystem) "ems_#{ems.id}" end

To:

def queue_name_for_ems(ems) # Host objects do not have dedicated refresh workers so request a generic worker which will # be used to make a web-service call to a SmartProxy to initiate inventory collection. return "generic" if ems.kind_of?(Host) && ems.acts_as_ems? return ems unless ems.kind_of?(ExtManagementSystem) ems.queue_name end

And in ExtManagementSystem:

def queue_name "ems_#{ems.id}" end

Finally... then this line becomes:

workers = MiqWorker.find_current_or_starting.where(:queue_name => ems.queue_name)

jrafanie · 2017-04-24T18:20:49Z

app/models/ext_management_system.rb

+          :deliver_on  => 15.seconds.from_now,
+          :class_name  => self.class.name,
+          :instance_id => id,
+          :method_name => "orchestrate_destroy"


Hmm, per ems workers can be running a single queue item for a long time... we could go through lots of these queue messages before the worker exits and we can destroy the ems.

How often do this happen in practice? Can we just wait 5 minutes between checks?

How often do this happen in practice? Can we just wait 5 minutes between checks?

@jrafanie I think users expect for the deletion process to be fairly quick. Can we instead cancel working jobs? can we do that safely?

I have a silly question, since we recently added the ability to disable a provider, can we prevent deletion of enabled providers? The disable could be responsible for making sure any ems workers are also stopped before it's marked at disable. In other words, force them to first disable the provider, then delete it?

cben · 2017-04-24T19:08:45Z

How often do this happen in practice? Can we just wait 5 minutes between checks

I think users expect for the deletion process to be fairly quick

Or at least do exponential backoff...

zeari · 2017-04-24T19:11:48Z

Or at least do exponential backoff...

@cben possibly, i dont know if we can 'remember' the previous wait time between iterations.

zeari · 2017-04-24T19:14:41Z

not sure, but in this loop case, it can result in an endless loop. So maybe the queue retry logic is better suited for this...

seems correct. ill make the change after we discuss this a while.

cben · 2017-04-25T04:34:24Z

The disable could be responsible for making sure any ems workers are also

stopped before it's marked at disable Wouldn't that be a race condition, with sync_workers trying to bring them up?

jrafanie · 2017-04-25T14:45:16Z

Wouldn't that be a race condition, with sync_workers trying to bring them
up?

I don't remember how enable flag works but I would assume that sync_workers would not start workers for disable ems'.

durandom · 2017-04-25T14:49:10Z

I would

add a before_destroy callback to check if all pre conditions are met and otherwise cancel the destroy - the api even has that use case
let destroy_queue tear down the queues and just call destroy - if that fails, raise MiqQueueRetryLater
leave the disable! to just toggle the flag (because of the race condition mentioned)

I also assume the destroy_queue is the way that the UI and API triggers a destroy

zeari · 2017-04-27T08:19:05Z

cc @simon3z

zeari · 2017-04-30T12:38:37Z

add a before_destroy callback to check if all pre conditions are met and otherwise cancel the destroy - the api even has that use case

so what should be in the before_destroy? just an if active_workers.count == zero?

let destroy_queue tear down the queues and just call destroy - if that fails, raise
MiqQueueRetryLater

so destroy_queue should run disable! for the ems? or before_destroy?

leave the disable! to just toggle the flag (because of the race condition mentioned)
I also assume the destroy_queue is the way that the UI and API triggers a destroy

at least the UI does it this way

durandom · 2017-05-02T09:52:54Z

@zeari yes to all :)

so destroy_queue should run disable! for the ems? or before_destroy?

it calls disable!. Because before_destroy is always called. Thats just the safe check to block a destroy.

durandom · 2017-05-12T07:01:14Z

app/models/ext_management_system.rb

+    disable! if enabled?
+
+    if self.destroy == nil
+      destroy_queue(15.seconds.from_now)


doesnt raising MiqQueueRetryLater work?

I tried with raise MiqException::MiqQueueRetryLater.new(:deliver_on => 15.seconds.from_now)
but it would just queue the job immediatly. It did write a fancy message from https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_queue.rb#L328 though.
I cant find the correct way to utilize that.

I think its because im testing with simulate_queue_worker and that it doesnt take deliver_on into account

agrare · 2017-05-15T14:09:51Z

app/models/ext_management_system.rb

@@ -426,6 +426,32 @@ def enable!
    update!(:enabled => true)
  end

+  # override destroy_queue from AsyncDeleteMixin
+  def destroy_queue(deliver_on = Time.now)


Can you leave :deliver_on nil?

durandom

That looks great 👍
And indeed it looks like a minimal invasive solution.

One thing I see coming in our way is how to cancel a destroy. In this case we have no way to enable the ems again, because it's disabled everytime orchestrate_destroy is called.
But I think thats ok for now....

durandom · 2017-05-16T06:37:01Z

app/models/ext_management_system.rb

+
+  before_destroy :before_destroy
+
+  def before_destroy


I would opt for a descriptive name, like assert_no_queues_present or wrap the assertion in a block

before_destroy do |ems| throw(:abort) if !MiqWorker.find_current_or_starting.where(:queue_name => ems.queue_name).count.zero? end

durandom · 2017-05-16T06:40:27Z

app/models/ext_management_system.rb

+    disable! if enabled?
+
+    if self.destroy == false
+      raise MiqException::MiqQueueRetryLater.new(:deliver_on => 15.seconds.from_now)


If you skip deliver_on whats the default?
And is there a throttling in place?
Will the retry be until it succeeds or does it give up eventually?

If you skip deliver_on whats the default?

immediately

And is there a throttling in place?

I dont think so...

Will the retry be until it succeeds or does it give up eventually?

~~I found that we can add an expires_on setting on it. Maybe cancel after 5 mins?~~
that wont work since we dont 'carry' that attribute to the next time we queue the destroy.

zeari · 2017-05-29T12:34:07Z

@miq-bot remove_label wip

jrafanie · 2017-05-30T15:38:36Z

app/models/ext_management_system.rb

+  end
+
+  # override destroy_queue from AsyncDeleteMixin
+  def destroy_queue(deliver_on = nil)


This method signature is confusing to me. The instance method takes a time, and the class method takes an array of ids. In addition, they're very similar with only the deliver_on being the main difference.

Can this method be called destroy_queue_in_future, schedule_destroy_queue or something else...
and have it call the class method directly, and make the class method take an optional deliver_on along with the ids?

kbrock

Looking nice.

Just the one to remove to_miq_a and a few questions.

LGTM

kbrock · 2017-05-31T15:03:14Z

app/models/ext_management_system.rb

@@ -422,6 +422,47 @@ def enable!
    update!(:enabled => true)
  end

+  def self.destroy_queue(ids)
+    ids = ids.to_miq_a


please don't use to_miq_a, use Array.wrap(ids) if you need

kbrock · 2017-05-31T15:05:40Z

app/models/ext_management_system.rb

+  def self.destroy_queue(ids)
+    ids = ids.to_miq_a
+    _log.info("Queuing destroy of #{name} with the following ids: #{ids.inspect}")
+    ids.each do |id|


do we really have to send a different message for every destroy?
Better yet, could we just send in the where clause and not be id centric?

well we would have to manage that where clause since in each iteration some ids(providers) would be deleted and some would have to try again later so im not sure its worth it

kbrock · 2017-05-31T15:08:05Z

app/models/ext_management_system.rb

+
+    if self.destroy == false
+      _log.info("Cant #{self.class.name} with id: #{id}, workers still in progress. Requeuing destroy...")
+      raise MiqException::MiqQueueRetryLater.new(:deliver_on => 15.seconds.from_now)


we rarely use MiqQueueRetryLater and are going away from that since it is queue system specific. Can we just queue another destroy?

kbrock · 2017-05-31T15:08:19Z

app/models/mixins/per_ems_worker_mixin.rb

@@ -97,7 +97,7 @@ def queue_name_for_ems(ems)
      return "generic" if ems.kind_of?(Host) && ems.acts_as_ems?

      return ems unless ems.kind_of?(ExtManagementSystem)
-      "ems_#{ems.id}"
+      ems.queue_name


thanks. good change

its also in #14864

zeari · 2017-06-04T12:34:15Z

@jrafanie @kbrock comments fixed (as best i could)
testing this takes time since i have to setup the race condition in the BZ. so If youre satisfied with the changes dont merge right away, tell me and let me test one last time.

miq-bot · 2017-06-04T12:37:43Z

Checked commits zeari/manageiq@423b115~...f6d04e8 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
4 files checked, 1 offense detected

app/models/ext_management_system.rb

❗ - Line 451, Col 8 - Style/RedundantSelf - Redundant self detected.

durandom · 2017-06-07T10:18:22Z

app/models/ext_management_system.rb

+  def orchestrate_destroy
+    disable! if enabled?
+
+    if self.destroy == false


remove self to make rubocop happy

durandom · 2017-06-07T10:21:00Z

@jrafanie I'd say merge #14864 first then rebase this one more time

Code wise I'm 👍

durandom · 2017-06-07T15:05:50Z

awesome addition @zeari 👏

moolitayer · 2017-06-07T15:11:04Z

🎉 👏

zeari · 2017-06-08T11:35:54Z

@jrafanie @kbrock comments fixed (as best i could)
testing this takes time since i have to setup the race condition in the BZ. so If youre satisfied with the changes dont merge right away, tell me and let me test one last time.

Testing.....

zeari · 2017-06-08T12:18:27Z

so this yields a couple of errors from the previous restructuring. id rather we unmerge, ill fix those and merge again....

zeari · 2017-06-08T12:44:51Z

id rather we unmerge, ill fix those and merge again....

nvm, fix is here: #14848

zeari changed the title ~~wait for an emss workers to finish before destroying it~~ wait for a ems workers to finish before destroying the ems Apr 23, 2017

zeari changed the title ~~wait for a ems workers to finish before destroying the ems~~ [WIP] Wait for a ems workers to finish before destroying the ems Apr 23, 2017

miq-bot added the wip label Apr 23, 2017

zeari changed the title ~~[WIP] Wait for a ems workers to finish before destroying the ems~~ [WIP] Wait for ems workers to finish before destroying the ems Apr 23, 2017

durandom reviewed Apr 24, 2017

View reviewed changes

zeari force-pushed the orchestrate_destroy2 branch from 2575061 to aa7c523 Compare April 24, 2017 11:26

jrafanie reviewed Apr 24, 2017

View reviewed changes

zeari mentioned this pull request Apr 24, 2017

move queue_name to ext_management_system #14864

Merged

chessbyte assigned blomquisg May 3, 2017

zeari force-pushed the orchestrate_destroy2 branch from aa7c523 to a81f3ba Compare May 11, 2017 13:17

durandom reviewed May 12, 2017

View reviewed changes

zeari force-pushed the orchestrate_destroy2 branch 2 times, most recently from 0a95e41 to 589fa35 Compare May 15, 2017 13:58

agrare reviewed May 15, 2017

View reviewed changes

durandom approved these changes May 16, 2017

View reviewed changes

miq-bot changed the title ~~[WIP] Wait for ems workers to finish before destroying the ems~~ Wait for ems workers to finish before destroying the ems May 29, 2017

miq-bot removed the wip label May 29, 2017

zeari force-pushed the orchestrate_destroy2 branch 3 times, most recently from 2833708 to 36bd868 Compare May 29, 2017 14:04

jrafanie reviewed May 30, 2017

View reviewed changes

kbrock reviewed May 31, 2017

View reviewed changes

move queue_name to ext_management_system

423b115

zeari force-pushed the orchestrate_destroy2 branch 2 times, most recently from 1d767a3 to b3339f9 Compare June 4, 2017 11:12

wait for an emss workers to finish before destroying it

f6d04e8

zeari force-pushed the orchestrate_destroy2 branch from b3339f9 to f6d04e8 Compare June 4, 2017 12:31

durandom reviewed Jun 7, 2017

View reviewed changes

jrafanie merged commit b19f1c4 into ManageIQ:master Jun 7, 2017

jrafanie added providers bug labels Jun 7, 2017

jrafanie added this to the Sprint 63 Ending Jun 19, 2017 milestone Jun 7, 2017

zeari mentioned this pull request Jun 8, 2017

fix orchestrated destroy #15339

Merged

durandom mentioned this pull request Jul 18, 2017

orchestrate destroy of dependent managers #15590

Merged

jntullo mentioned this pull request Dec 13, 2017

Do not return task ID on provider delete ManageIQ/manageiq-api#248

Closed

cben mentioned this pull request Jul 30, 2019

Do not get stuck on destroy #19077

Merged

	def queue_name_for_ems(ems)
	# Host objects do not have dedicated refresh workers so request a generic worker which will
	# be used to make a web-service call to a SmartProxy to initiate inventory collection.
	return "generic" if ems.kind_of?(Host) && ems.acts_as_ems?

	return ems unless ems.kind_of?(ExtManagementSystem)
	"ems_#{ems.id}"
	end

Wait for ems workers to finish before destroying the ems #14848

Wait for ems workers to finish before destroying the ems #14848

Conversation

zeari commented Apr 23, 2017 • edited by blomquisg Loading

zeari commented Apr 23, 2017

moolitayer commented Apr 23, 2017

durandom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeari commented Apr 24, 2017

durandom commented Apr 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cben commented Apr 24, 2017

zeari commented Apr 24, 2017

zeari commented Apr 24, 2017

cben commented Apr 25, 2017 via email

jrafanie commented Apr 25, 2017

durandom commented Apr 25, 2017

zeari commented Apr 27, 2017 • edited Loading

zeari commented Apr 30, 2017 • edited Loading

durandom commented May 2, 2017

Choose a reason for hiding this comment

zeari May 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

durandom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeari May 18, 2017 • edited Loading

Choose a reason for hiding this comment

zeari commented May 29, 2017

Choose a reason for hiding this comment

kbrock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeari commented Jun 4, 2017

miq-bot commented Jun 4, 2017

Choose a reason for hiding this comment

durandom commented Jun 7, 2017

durandom commented Jun 7, 2017

moolitayer commented Jun 7, 2017

zeari commented Jun 8, 2017

zeari commented Jun 8, 2017

zeari commented Jun 8, 2017

zeari commented Apr 23, 2017 •

edited by blomquisg

Loading

zeari commented Apr 27, 2017 •

edited

Loading

zeari commented Apr 30, 2017 •

edited

Loading

zeari May 15, 2017 •

edited

Loading

zeari May 18, 2017 •

edited

Loading