-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Heartbeat Thread to SmartProxy Worker #16685
Add Heartbeat Thread to SmartProxy Worker #16685
Conversation
def heartbeat_thread | ||
@heartbeat_started.set | ||
until @exit_requested do | ||
heartbeat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also check to see that the main thread is still alive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roliveri that would be beneficial. I'd have to think about how to do that. Note that this code was mostly adapted from the Event Monitor Thread in the EventCatcher runner. I don't see that thread checking for life of the main thread either - should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryk55 Currently, the work thread is the main thread, and it spawns the heartbeat thread. So, the heartbeat thread may die if the worker thread terminates, but I'm not sure. If the heartbeat thread continues to run when the worker thread dies, then the heartbeat thread should check the worker thread. You probably have to do some testing to determine the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roliveri docs state that the peer threads exit if the main thread does but just to be sure I tested the theory by adding a Thread.exit to the main thread after the Heartbeat thread was started and the entire process shut down. In my opinion the heartbeat thread does not need to check if the main thread is alive. I think this is ready to be merged unless you have any other questions or issues. Thanks.
@miq-bot add_label bug |
@miq-bot add_label gaprindashvili/yes |
@miq-bot add_label fine/yes |
db97c6d
to
e9398dc
Compare
In order to fix an issue where long-running Smartstate jobs get killed under the mistaken assumption that they are being unresponsive when they are actually quite busy, a separate thread is being added to the SmartProxy Worker which just heartbeats every 30 seconds. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538
Clean up two RuboCop warnings - One for a do in an until clause One for a 'rescue nil' modifier - changed to a begin-rescue-end block for the NoMethodError that would result if the thread id was nil.
# | ||
# Wait for the Heartbeat Thread to stop | ||
# | ||
unless @tid.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer the positive case: if @tid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. Please note that a lot of this was glommed from the monitor_thread in the event_catcher. I note that this comment and that at line 61 below were direct copies from there. I will change it here but it still lives there.
safe_log("#{message} Waiting for Heartbeat Thread to Stop.") | ||
begin | ||
@tid.join(worker_settings[:heartbeat_thread_shutdown_timeout]) | ||
rescue NoMethodError => join_err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances would you receive a NoMethodError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event_catcher, this block is written as
@tid.join(worker_settings[:timeout_value]) rescue nil
Rubocop complained about the rescue clause written that way so I changed it to what was more appropriate - that is if @tid is nil here, you get a NoMethodError on the join (there is no "join" method on a nil).
Now, I realize we just checked for @tid.nil in line 18 above, but my understanding is that the value could be reset between checking and joining. If that's not correct (in either worker) then we can remove this rescue.
|
||
def do_work | ||
if @tid.nil? || [email protected]? | ||
if [email protected]? && @tid.status.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless @tid.try(:status)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this condition need to live inside the condition above? It looks like it could be moved above the outer conditional and improve readability.
config/settings.yml
Outdated
@@ -972,7 +972,7 @@ | |||
:prefetch_stale_threshold: 30.seconds | |||
:rails_server: puma | |||
:remote_console_type: VMRC | |||
:role: database_operations,event,reporting,scheduler,smartstate,ems_operations,ems_inventory,user_interface,websocket,web_services,automate | |||
:role: database_operations,event,reporting,scheduler,smartstate,ems_operations,ems_inventory,user_interface,websocket,web_services,automate,smartproxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this turning the role on by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line shouldn't be in there. Only the timeout below was meant to be changed. Thanks for noticing.
The settings file had an extra modified role that was not intended as part of this PR. Style comments in runner.rb
e9398dc
to
5d0ce26
Compare
Checked commits jerryk55/manageiq@906ed99~...5d0ce26 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0 |
@bdunne changes have been made as requested. |
Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1531299
Gaprindashvili backport details:
|
Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854
Fine backport details:
|
…_thread Add Heartbeat Thread to SmartProxy Worker (cherry picked from commit a34f5e0) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1532854
In order to fix an issue where long-running Smartstate jobs get killed
under the mistaken assumption that they are being unresponsive when they
are actually quite busy, a separate thread is being added to the SmartProxy Worker
which just heartbeats every 30 seconds.
Please note - this PR does NOT address the issue whereby SSA jobs run for extremely long
periods of time - over one hour in cases. It simply allows those jobs to complete successfully.
@roliveri @hsong-rh Please review.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538
Link
Steps for Testing/QA
Find a VM / Instance that typically is very long-running for SSA. Run SSA on it.
I currently have access to an Azure Windows Instance in the Australia East region - when running from the US East Coast it takes more than an hour to run. This PR allows the job to run to completion.