-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heartbeat_check script for file-based worker process heartbeating #15494
Conversation
@miq-bot assign @gtanzillo Feel free to assign others as you see fit. cc @jrafanie @carbonin @bdunne |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions.
Also, may want to remove the blank
file.
class MiqDefaults | ||
HEARTBEAT_TIMEOUT = 2.minutes.freeze | ||
STARTING_TIMEOUT = 10.minutes.freeze | ||
STOPPING_TIMEOUT = 10.minutes.freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a new class for these values? Shouldn't we be using the ones in Settings
here.
I'm not sure why they were in two different places to begin with...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smells like thin workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were removing hardcoded values that existed in the code base of just 10.minutes
(see the diff). There wasn't any reference to Settings with what I replaced these with. If these shouldn't be hardcoded in the first place, that is an entirely different question that I don't have the answer to.
As far as why it is it's own class, since we want to have the heartbeat_check
execute as simply as possibly, without needing external dependencies, so making it it's own class allows it to be reused in both the codebase and in this script.
lib/workers/bin/heartbeat_check.rb
Outdated
timeout = if contents.empty? | ||
(mtime + Workers::MiqDefaults.heartbeat_timeout).utc | ||
else | ||
Time.parse(contents).utc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding this?
I feel like the mtime should do the job for now.
Are we going to use this for queue message timeouts that would exceed the worker's timeout setting? (looking at you reports ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is the "more accurate" way of determining the timeout, but I guess I failed to mention this in my commit messages and the PR description. I will try and explain:
So, if you look where Workers::MiqDefaults.heartbeat_timeout
is actually called from, you will see if is called from here:
This method takes a hearbeat timeout value from the settings (or defaults it to Workers::MiqDefaults.heartbeat_timeout
if that does not exist), and then compares it against the w.last_heartbeat
value. Action is then taken if the heartbeat is not responding, but it is based on the w.last_heartbeat
and the timeout value in the settings.
To avoid having to do a DB lookup in this script, I put the calculation burden of determining when this timeout time was on the worker writing the file, avoiding the heartbeat script having to deal with that. Then, the only thing that the script needs to do is read that value and determine if that time has already passed or not.
The mtime
is now just as a fallback incase the worker didn't write to the file properly, and needs to make a "best guess". That said, we maybe should do a rescue
and use the mtime
and not concern ourselves with checking the contents.empty?
, or just let the script fail.
Yeah... woops... that was me doing a bit of QA on this myself, and clearly did a #yoloGitAddDashA without a care in the world... I will fix tomorrow. |
901b62b
to
e6fc419
Compare
lib/workers/bin/heartbeat_check.rb
Outdated
end | ||
|
||
if current_time < timeout | ||
puts 0 if options[:verbose] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carbonin Will this mess up the common logging stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, it will just get parsed as the message value. If something isn't json formatted then it will still get recorded it just won't have all the other values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fryguy you would also have to pass -v
or --verbose
to this script to get that output, so it is more of a debug flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or... if you are to lazy to follow up with a echo $?
afterwords... #whoHasTwoThumbsAndWouldDoThat #thisGuy
10b768a
to
7a16f34
Compare
app/models/miq_worker/runner.rb
Outdated
@@ -386,7 +385,8 @@ def heartbeat_to_drb | |||
end | |||
|
|||
def heartbeat_to_file | |||
FileUtils.touch(@worker.heartbeat_file) | |||
timeout = worker_settings[:heartbeat_timeout] || Workers::MiqDefaults.heartbeat_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're going to have the worker write his timeout to the heartbeat file, what do you think about including this existing logic where the timeout of the an active queue message (for queue workers only) is the timeout if the worker is processing the message? I think we'll still need to support that in some form in the new architecture.
With that said, I'm not sure how that can be done because the worker heartbeats before taking work from the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed reply on this, but yeah... hrmmm... not sure how we would be able to do this... would have to ponder for a bit...
I assume you are referring to this bit in particular: https://github.com/ManageIQ/manageiq/blob/8b04e133/app/models/miq_server/worker_management/monitor/settings.rb#L38-L41
That said, depending on the queue changes, there might not be available if the current_timeout
stops being a thing in the new architecture: https://github.com/ManageIQ/manageiq/blob/8b04e133/app/models/miq_worker.rb#L451-L454
lib/workers/bin/heartbeat_check.rb
Outdated
heartbeat_file = options[:heartbeat_file] || ARGV[0] | ||
heartbeat_file ||= Workers::MiqDefaults.heartbeat_file(options[:guid]) | ||
|
||
exit_status = Workers::Heartbeat.file_check heartbeat_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing parens around method params
spec/lib/workers/heartbeat_spec.rb
Outdated
|
||
example.run | ||
|
||
FileUtils.rm_f calculated_hb_file.to_s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
context "when passing in a filepath as an argument" do | ||
other_heartbeat_file = ManageIQ.root.join("tmp", "spec", "other.hb").to_s | ||
|
||
it_should_behave_like "heartbeat file checker", other_heartbeat_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I not only disagree with this one, since this is how rspec itself says how to do it, but also would meant that we should be writing specs like this:
describe(".some_method") do
context("with some changed variable") do
expect(subject.some_method).to(eq(true))
end
end
Which... well isn't the case in... oh... 101% of our specs #fakeMath
@bdunne 😑 |
The "@bdunne style check bot" strikes again |
7a16f34
to
c19a274
Compare
Creates a set of worker defaults instead of hardcoded values in methods that can't be reused.
This method moves the function for determining the heartbeat filename into the Workers::MiqDefaults, and sets a default to `tmp/miq_worker.hb` if neither a ENV for the filename is provided, or a guid of the worker.
Use the running worker process to write it's own timeout time in it, which then won't require a healthchecker script to determine it itself by querying the DB/worker settings.
Currently just a single, testable class method that is meant to be a reusable form of the logic needed to perform a heartbeat check. The logic flows as follows (Fail is a false result, and Pass is true): - Fail if the heartbeat file does not exist - Fail if the heartbeat file exists, there is no content, and the mtime of the file is stale (older than 2 minutes is the default). - Pass if it is currently within that timeout - Fail if the heartbeat file exists, there is content writen, and it is parsed into a time and is currently passed the timeout written into that file - Pass if it is currently within that timeout A file can be passed into that method as an argument, or it can be determined via Workers::MiqDefaults.heartbeat_file, which will look at the current ENV of WORKER_HEARTBEAT_FILE before determining a reasonable default.
c19a274
to
8ca25a7
Compare
@gtanzillo @carbonin @bdunne I think I have solved the queue worker stuff in a pretty small amount of changes, but haven't tested manually to confirm that it works. Let me know if this looks reasonable to address Gregg's previous concern. |
This script basically works with the docker's HEALTHCHECK interface: https://docs.docker.com/engine/reference/builder/#healthcheck So it will return 0 if the healthcheck passes, and 1 otherwise. How it evaluates the healthcheck: * Return 1 if no healthcheck file exists. * Return 1 if the contents of the file are empty, and the mtime of the file + the default timeout for the worker have passed * Return 1 if contents of the file are present, and the time parsed from the file has already passed (the time in the file should be the time at which the worker wrote that it's heartbeat timeout has since passed) * Return 0 if either the contents of the file are present with a parsed timeout value, or a timeout value of the mtime of the file + the default worker heartbeat_timeout, has still not passed. The healthcheck file can be passed in as the first ARG, as a cli option, ENV variable (same as it would be set in the worker process), or by passing in the guid, and it will evaluate where file should exist from that. Finally, if none of those are provided, it will then assume it is in `tmp/miq_worker.hb`. Again, if the file doesn't exist, it will still return a 1 exit status.
8ca25a7
to
838c766
Compare
|
||
# Only for file based heartbeating | ||
def heartbeat_message_timeout(message) | ||
if ENV["WORKER_HEARTBEAT_METHOD"] == "file" && msg.msg_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/msg/message/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carbonin thanks... copy-pasta from where I pulled that from.
Currently, it is MiqServer's role to determine if the current worker is taking too long to process a message by introspecting the current job they are working on. When using a file based heartbeating system, switch that responsibility to the worker runner itself, and make sure that it is in charge of determining when it's next timeout is, and write that just before starting to do the message work.
838c766
to
3dd93ab
Compare
Some comments on commits NickLaMuro/manageiq@487b2ce~...3dd93ab lib/workers/bin/heartbeat_check.rb
|
Checked commits NickLaMuro/manageiq@487b2ce~...3dd93ab with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 lib/workers/bin/heartbeat_check.rb
|
# Only for file based heartbeating | ||
def heartbeat_message_timeout(message) | ||
if ENV["WORKER_HEARTBEAT_METHOD"] == "file" && message.msg_timeout | ||
timeout = worker_settings[:poll] + message.msg_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is worker_settings[:poll]
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fryguy @gtanzillo @carbonin @bdunne is there anything else that you wanted to see done with this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me 👍 Will merge tomorrow unless someone objects.
Adds heartbeat_check script which basically works with the docker's
HEALTHCHECK
interface:https://docs.docker.com/engine/reference/builder/#healthcheck
So it will return 0 if the healthcheck passes, and 1 otherwise.
How it evaluates the healthcheck:
The healthcheck file can be passed in as the first ARG, as a cli option, ENV variable (same as it would be set in the worker process), or by passing in the guid, and it will evaluate where file should exist from that. Finally, if none of those are provided, it will then assume it is in
tmp/miq_worker.hb
. Again, if the file doesn't exist, it will still return a 1 exit status.Links
Steps for Testing/QA
Try a few different scenarios for testing the script
$ ruby lib/workers/bin/healthbeat_check.rb -v
$ touch fake.hb
$ ruby lib/workers/bin/healthbeat_check.rb -v fake.hb
$ ruby lib/workers/bin/healthbeat_check.rb -v fake.hb
WORKER_HEARTBEAT_FILE
ENV var set:WORKER_HEARTBEAT_FILE
:$ export WORKER_HEARTBEAT_FILE=real.hb
$ ruby libworkers/bin/run_single_worker.rb MiqUiWorker
$ ruby lib/workers/bin/healthbeat_check.rb -v
$ unset WORKER_HEARTBEAT_FILE
$ ruby libworkers/bin/run_single_worker.rb MiqUiWorker
tmp/
dir for the.hb
file (something like2345-23ref-as32s-23rfsd.hb
$ ruby lib/workers/bin/healthbeat_check.rb -v 2345-23ref-as32s-23rfsd.hb
$ ruby lib/workers/bin/healthbeat_check.rb -v --guid "2345-23ref-as32s-23rfsd"