Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Heartbeat Thread to SmartProxy Worker #16685

Merged
merged 3 commits into from
Jan 4, 2018

Conversation

jerryk55
Copy link
Member

In order to fix an issue where long-running Smartstate jobs get killed
under the mistaken assumption that they are being unresponsive when they
are actually quite busy, a separate thread is being added to the SmartProxy Worker
which just heartbeats every 30 seconds.

Please note - this PR does NOT address the issue whereby SSA jobs run for extremely long
periods of time - over one hour in cases. It simply allows those jobs to complete successfully.

@roliveri @hsong-rh Please review.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538

Link

Steps for Testing/QA

Find a VM / Instance that typically is very long-running for SSA. Run SSA on it.
I currently have access to an Azure Windows Instance in the Australia East region - when running from the US East Coast it takes more than an hour to run. This PR allows the job to run to completion.

def heartbeat_thread
@heartbeat_started.set
until @exit_requested do
heartbeat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check to see that the main thread is still alive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roliveri that would be beneficial. I'd have to think about how to do that. Note that this code was mostly adapted from the Event Monitor Thread in the EventCatcher runner. I don't see that thread checking for life of the main thread either - should it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryk55 Currently, the work thread is the main thread, and it spawns the heartbeat thread. So, the heartbeat thread may die if the worker thread terminates, but I'm not sure. If the heartbeat thread continues to run when the worker thread dies, then the heartbeat thread should check the worker thread. You probably have to do some testing to determine the behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roliveri docs state that the peer threads exit if the main thread does but just to be sure I tested the theory by adding a Thread.exit to the main thread after the Heartbeat thread was started and the entire process shut down. In my opinion the heartbeat thread does not need to check if the main thread is alive. I think this is ready to be merged unless you have any other questions or issues. Thanks.

@jerryk55
Copy link
Member Author

@miq-bot add_label bug

@jerryk55
Copy link
Member Author

@miq-bot add_label gaprindashvili/yes

@jerryk55
Copy link
Member Author

@miq-bot add_label fine/yes

@jerryk55 jerryk55 force-pushed the smart_proxy_heartbeat_thread branch 2 times, most recently from db97c6d to e9398dc Compare December 19, 2017 17:19
In order to fix an issue where long-running Smartstate jobs get killed
under the mistaken assumption that they are being unresponsive when they
are actually quite busy, a separate thread is being added to the SmartProxy Worker
which just heartbeats every 30 seconds.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538
Clean up two RuboCop warnings -
One for a do in an until clause
One for a 'rescue nil' modifier - changed to
a begin-rescue-end block for the NoMethodError
that would result if the thread id was nil.
#
# Wait for the Heartbeat Thread to stop
#
unless @tid.nil?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer the positive case: if @tid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Please note that a lot of this was glommed from the monitor_thread in the event_catcher. I note that this comment and that at line 61 below were direct copies from there. I will change it here but it still lives there.

safe_log("#{message} Waiting for Heartbeat Thread to Stop.")
begin
@tid.join(worker_settings[:heartbeat_thread_shutdown_timeout])
rescue NoMethodError => join_err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would you receive a NoMethodError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event_catcher, this block is written as

@tid.join(worker_settings[:timeout_value]) rescue nil

Rubocop complained about the rescue clause written that way so I changed it to what was more appropriate - that is if @tid is nil here, you get a NoMethodError on the join (there is no "join" method on a nil).
Now, I realize we just checked for @tid.nil in line 18 above, but my understanding is that the value could be reset between checking and joining. If that's not correct (in either worker) then we can remove this rescue.


def do_work
if @tid.nil? || [email protected]?
if [email protected]? && @tid.status.nil?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless @tid.try(:status)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this condition need to live inside the condition above? It looks like it could be moved above the outer conditional and improve readability.

@@ -972,7 +972,7 @@
:prefetch_stale_threshold: 30.seconds
:rails_server: puma
:remote_console_type: VMRC
:role: database_operations,event,reporting,scheduler,smartstate,ems_operations,ems_inventory,user_interface,websocket,web_services,automate
:role: database_operations,event,reporting,scheduler,smartstate,ems_operations,ems_inventory,user_interface,websocket,web_services,automate,smartproxy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this turning the role on by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line shouldn't be in there. Only the timeout below was meant to be changed. Thanks for noticing.

The settings file had an extra modified role that was not intended as part of this PR.
Style comments in runner.rb
@jerryk55 jerryk55 force-pushed the smart_proxy_heartbeat_thread branch from e9398dc to 5d0ce26 Compare December 21, 2017 22:08
@miq-bot
Copy link
Member

miq-bot commented Dec 21, 2017

Checked commits jerryk55/manageiq@906ed99~...5d0ce26 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
1 file checked, 0 offenses detected
Everything looks fine. 🍰

@jerryk55
Copy link
Member Author

jerryk55 commented Jan 3, 2018

@bdunne changes have been made as requested.

@roliveri roliveri merged commit a34f5e0 into ManageIQ:master Jan 4, 2018
@roliveri roliveri added this to the Sprint 77 Ending Jan 15, 2018 milestone Jan 4, 2018
simaishi pushed a commit that referenced this pull request Jan 4, 2018
@simaishi
Copy link
Contributor

simaishi commented Jan 4, 2018

Gaprindashvili backport details:

$ git log -1
commit 77f897bd62cc30b89e31c7c0dddad882fd0a104f
Author: Richard Oliveri <[email protected]>
Date:   Thu Jan 4 16:43:31 2018 -0500

    Merge pull request #16685 from jerryk55/smart_proxy_heartbeat_thread
    
    Add Heartbeat Thread to SmartProxy Worker
    (cherry picked from commit a34f5e00f80da4e1fbfad17a5f7607b236a97c7d)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1531299

simaishi pushed a commit that referenced this pull request Jan 9, 2018
@simaishi
Copy link
Contributor

simaishi commented Jan 9, 2018

Fine backport details:

$ git log -1
commit 9c8fe1965657a809b0e0f525de2ce5913ba7a448
Author: Richard Oliveri <[email protected]>
Date:   Thu Jan 4 16:43:31 2018 -0500

    Merge pull request #16685 from jerryk55/smart_proxy_heartbeat_thread
    
    Add Heartbeat Thread to SmartProxy Worker
    (cherry picked from commit a34f5e00f80da4e1fbfad17a5f7607b236a97c7d)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854

d-m-u pushed a commit to d-m-u/manageiq that referenced this pull request Jun 6, 2018
…_thread

Add Heartbeat Thread to SmartProxy Worker
(cherry picked from commit a34f5e0)

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants