Add Heartbeat Thread to SmartProxy Worker #16685

jerryk55 · 2017-12-19T15:19:36Z

In order to fix an issue where long-running Smartstate jobs get killed
under the mistaken assumption that they are being unresponsive when they
are actually quite busy, a separate thread is being added to the SmartProxy Worker
which just heartbeats every 30 seconds.

Please note - this PR does NOT address the issue whereby SSA jobs run for extremely long
periods of time - over one hour in cases. It simply allows those jobs to complete successfully.

@roliveri @hsong-rh Please review.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538

Link

https://bugzilla.redhat.com/show_bug.cgi?id=1519538

Steps for Testing/QA

Find a VM / Instance that typically is very long-running for SSA. Run SSA on it.
I currently have access to an Azure Windows Instance in the Australia East region - when running from the US East Coast it takes more than an hour to run. This PR allows the job to run to completion.

roliveri · 2017-12-19T16:25:22Z

app/models/miq_smart_proxy_worker/runner.rb

+  def heartbeat_thread
+    @heartbeat_started.set
+    until @exit_requested do
+      heartbeat


Should this also check to see that the main thread is still alive?

@roliveri that would be beneficial. I'd have to think about how to do that. Note that this code was mostly adapted from the Event Monitor Thread in the EventCatcher runner. I don't see that thread checking for life of the main thread either - should it?

@jerryk55 Currently, the work thread is the main thread, and it spawns the heartbeat thread. So, the heartbeat thread may die if the worker thread terminates, but I'm not sure. If the heartbeat thread continues to run when the worker thread dies, then the heartbeat thread should check the worker thread. You probably have to do some testing to determine the behavior.

@roliveri docs state that the peer threads exit if the main thread does but just to be sure I tested the theory by adding a Thread.exit to the main thread after the Heartbeat thread was started and the entire process shut down. In my opinion the heartbeat thread does not need to check if the main thread is alive. I think this is ready to be merged unless you have any other questions or issues. Thanks.

jerryk55 · 2017-12-19T16:51:21Z

@miq-bot add_label bug

jerryk55 · 2017-12-19T16:51:34Z

@miq-bot add_label gaprindashvili/yes

jerryk55 · 2017-12-19T16:51:44Z

@miq-bot add_label fine/yes

In order to fix an issue where long-running Smartstate jobs get killed under the mistaken assumption that they are being unresponsive when they are actually quite busy, a separate thread is being added to the SmartProxy Worker which just heartbeats every 30 seconds. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538

Clean up two RuboCop warnings - One for a do in an until clause One for a 'rescue nil' modifier - changed to a begin-rescue-end block for the NoMethodError that would result if the thread id was nil.

bdunne · 2017-12-21T21:06:14Z

app/models/miq_smart_proxy_worker/runner.rb

+    #
+    # Wait for the Heartbeat Thread to stop
+    #
+    unless @tid.nil?


Prefer the positive case: if @tid

That's fine. Please note that a lot of this was glommed from the monitor_thread in the event_catcher. I note that this comment and that at line 61 below were direct copies from there. I will change it here but it still lives there.

bdunne · 2017-12-21T21:08:12Z

app/models/miq_smart_proxy_worker/runner.rb

+      safe_log("#{message} Waiting for Heartbeat Thread to Stop.")
+      begin
+        @tid.join(worker_settings[:heartbeat_thread_shutdown_timeout])
+      rescue NoMethodError => join_err


Under what circumstances would you receive a NoMethodError?

In the event_catcher, this block is written as

@tid.join(worker_settings[:timeout_value]) rescue nil

Rubocop complained about the rescue clause written that way so I changed it to what was more appropriate - that is if @tid is nil here, you get a NoMethodError on the join (there is no "join" method on a nil).
Now, I realize we just checked for @tid.nil in line 18 above, but my understanding is that the value could be reset between checking and joining. If that's not correct (in either worker) then we can remove this rescue.

bdunne · 2017-12-21T21:16:06Z

app/models/miq_smart_proxy_worker/runner.rb

+
+  def do_work
+    if @tid.nil? || [email protected]?
+      if [email protected]? && @tid.status.nil?


unless @tid.try(:status)?

Does this condition need to live inside the condition above? It looks like it could be moved above the outer conditional and improve readability.

bdunne · 2017-12-21T21:19:25Z

config/settings.yml

@@ -972,7 +972,7 @@
  :prefetch_stale_threshold: 30.seconds
  :rails_server: puma
  :remote_console_type: VMRC
-  :role: database_operations,event,reporting,scheduler,smartstate,ems_operations,ems_inventory,user_interface,websocket,web_services,automate
+  :role: database_operations,event,reporting,scheduler,smartstate,ems_operations,ems_inventory,user_interface,websocket,web_services,automate,smartproxy


Is this turning the role on by default?

That line shouldn't be in there. Only the timeout below was meant to be changed. Thanks for noticing.

The settings file had an extra modified role that was not intended as part of this PR. Style comments in runner.rb

miq-bot · 2017-12-21T22:13:04Z

Checked commits jerryk55/manageiq@906ed99~...5d0ce26 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
1 file checked, 0 offenses detected
Everything looks fine. 🍰

jerryk55 · 2018-01-03T20:07:54Z

@bdunne changes have been made as requested.

Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1531299

simaishi · 2018-01-04T23:05:07Z

Gaprindashvili backport details:

$ git log -1
commit 77f897bd62cc30b89e31c7c0dddad882fd0a104f
Author: Richard Oliveri <[email protected]>
Date:   Thu Jan 4 16:43:31 2018 -0500

    Merge pull request #16685 from jerryk55/smart_proxy_heartbeat_thread
    
    Add Heartbeat Thread to SmartProxy Worker
    (cherry picked from commit a34f5e00f80da4e1fbfad17a5f7607b236a97c7d)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1531299

Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854

simaishi · 2018-01-09T22:09:33Z

Fine backport details:

$ git log -1
commit 9c8fe1965657a809b0e0f525de2ce5913ba7a448
Author: Richard Oliveri <[email protected]>
Date:   Thu Jan 4 16:43:31 2018 -0500

    Merge pull request #16685 from jerryk55/smart_proxy_heartbeat_thread
    
    Add Heartbeat Thread to SmartProxy Worker
    (cherry picked from commit a34f5e00f80da4e1fbfad17a5f7607b236a97c7d)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854

…_thread Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854

roliveri reviewed Dec 19, 2017

View reviewed changes

miq-bot added bug gaprindashvili/yes labels Dec 19, 2017

jerryk55 force-pushed the smart_proxy_heartbeat_thread branch 2 times, most recently from db97c6d to e9398dc Compare December 19, 2017 17:19

jerryk55 added 2 commits December 21, 2017 15:48

RuboCop Warning Cleanup

46f9a9a

Clean up two RuboCop warnings - One for a do in an until clause One for a 'rescue nil' modifier - changed to a begin-rescue-end block for the NoMethodError that would result if the thread id was nil.

bdunne requested changes Dec 21, 2017

View reviewed changes

Review Comments

5d0ce26

The settings file had an extra modified role that was not intended as part of this PR. Style comments in runner.rb

jerryk55 force-pushed the smart_proxy_heartbeat_thread branch from e9398dc to 5d0ce26 Compare December 21, 2017 22:08

roliveri merged commit a34f5e0 into ManageIQ:master Jan 4, 2018

roliveri added this to the Sprint 77 Ending Jan 15, 2018 milestone Jan 4, 2018

roliveri added the core/smart state label Jan 4, 2018

simaishi pushed a commit that referenced this pull request Jan 4, 2018

Merge pull request #16685 from jerryk55/smart_proxy_heartbeat_thread

77f897b

Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1531299

simaishi added gaprindashvili/backported and removed gaprindashvili/yes labels Jan 4, 2018

simaishi pushed a commit that referenced this pull request Jan 9, 2018

Merge pull request #16685 from jerryk55/smart_proxy_heartbeat_thread

9c8fe19

Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854

simaishi added fine/backported and removed fine/yes labels Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Heartbeat Thread to SmartProxy Worker #16685

Add Heartbeat Thread to SmartProxy Worker #16685

jerryk55 commented Dec 19, 2017

roliveri Dec 19, 2017

jerryk55 Dec 19, 2017

roliveri Dec 19, 2017

jerryk55 Jan 3, 2018

jerryk55 commented Dec 19, 2017

jerryk55 commented Dec 19, 2017

jerryk55 commented Dec 19, 2017

bdunne Dec 21, 2017

jerryk55 Dec 21, 2017

bdunne Dec 21, 2017

jerryk55 Dec 21, 2017

bdunne Dec 21, 2017

bdunne Dec 21, 2017

bdunne Dec 21, 2017

jerryk55 Dec 21, 2017

miq-bot commented Dec 21, 2017

jerryk55 commented Jan 3, 2018

simaishi commented Jan 4, 2018

simaishi commented Jan 9, 2018

Add Heartbeat Thread to SmartProxy Worker #16685

Add Heartbeat Thread to SmartProxy Worker #16685

Conversation

jerryk55 commented Dec 19, 2017

Link

Steps for Testing/QA

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryk55 commented Dec 19, 2017

jerryk55 commented Dec 19, 2017

jerryk55 commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miq-bot commented Dec 21, 2017

jerryk55 commented Jan 3, 2018

simaishi commented Jan 4, 2018

simaishi commented Jan 9, 2018