Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heartbeat_check script for file-based worker process heartbeating #15494

Merged
merged 6 commits into from
Jul 20, 2017

Conversation

NickLaMuro
Copy link
Member

@NickLaMuro NickLaMuro commented Jun 30, 2017

Adds heartbeat_check script which basically works with the docker's HEALTHCHECK interface:

https://docs.docker.com/engine/reference/builder/#healthcheck

So it will return 0 if the healthcheck passes, and 1 otherwise.

How it evaluates the healthcheck:

  • Return 1 if no healthcheck file exists.
  • Return 1 if the contents of the file are empty, and the mtime of the file + the default timeout for the worker have passed
  • Return 1 if contents of the file are present, and the time parsed from the file has already passed (the time in the file should be the time at which the worker wrote that it's heartbeat timeout has since passed)
  • Return 0 if either the contents of the file are present with a parsed timeout value, or a timeout value of the mtime of the file + the default worker heartbeat_timeout, has still not passed.

The healthcheck file can be passed in as the first ARG, as a cli option, ENV variable (same as it would be set in the worker process), or by passing in the guid, and it will evaluate where file should exist from that. Finally, if none of those are provided, it will then assume it is in tmp/miq_worker.hb. Again, if the file doesn't exist, it will still return a 1 exit status.

Links

Steps for Testing/QA

Try a few different scenarios for testing the script

  1. With no worker running
    • run the script: $ ruby lib/workers/bin/healthbeat_check.rb -v
    • it should return 1
  2. With a random file
    • create a fake heartbeat script: $ touch fake.hb
    • run the script: $ ruby lib/workers/bin/healthbeat_check.rb -v fake.hb
    • it should return 0
    • wait 2 minutes or more
    • run the script: $ ruby lib/workers/bin/healthbeat_check.rb -v fake.hb
    • it should return 1
  3. With a running worker with the WORKER_HEARTBEAT_FILE ENV var set:
    • set WORKER_HEARTBEAT_FILE: $ export WORKER_HEARTBEAT_FILE=real.hb
    • run a worker: $ ruby libworkers/bin/run_single_worker.rb MiqUiWorker
    • wait for it to start
    • run the script: $ ruby lib/workers/bin/healthbeat_check.rb -v
    • it should return 0
    • try this a few more times after a bit, and it should always return 0
  4. With a running worker, without using the ENV variable
    • unset the above ENV variable (if needed): $ unset WORKER_HEARTBEAT_FILE
    • run a worker: $ ruby libworkers/bin/run_single_worker.rb MiqUiWorker
    • wait for it to start
    • look in your tmp/ dir for the .hb file (something like 2345-23ref-as32s-23rfsd.hb
    • run the script: $ ruby lib/workers/bin/healthbeat_check.rb -v 2345-23ref-as32s-23rfsd.hb
    • it should return 0
    • run the script (using the guid): $ ruby lib/workers/bin/healthbeat_check.rb -v --guid "2345-23ref-as32s-23rfsd"
    • it should return 0

@NickLaMuro
Copy link
Member Author

@miq-bot assign @gtanzillo

Feel free to assign others as you see fit. cc @jrafanie @carbonin @bdunne

Copy link
Member

@carbonin carbonin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions.

Also, may want to remove the blank file.

class MiqDefaults
HEARTBEAT_TIMEOUT = 2.minutes.freeze
STARTING_TIMEOUT = 10.minutes.freeze
STOPPING_TIMEOUT = 10.minutes.freeze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a new class for these values? Shouldn't we be using the ones in Settings here.

I'm not sure why they were in two different places to begin with...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smells like thin workers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were removing hardcoded values that existed in the code base of just 10.minutes (see the diff). There wasn't any reference to Settings with what I replaced these with. If these shouldn't be hardcoded in the first place, that is an entirely different question that I don't have the answer to.

As far as why it is it's own class, since we want to have the heartbeat_check execute as simply as possibly, without needing external dependencies, so making it it's own class allows it to be reused in both the codebase and in this script.

timeout = if contents.empty?
(mtime + Workers::MiqDefaults.heartbeat_timeout).utc
else
Time.parse(contents).utc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding this?

I feel like the mtime should do the job for now.
Are we going to use this for queue message timeouts that would exceed the worker's timeout setting? (looking at you reports ...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is the "more accurate" way of determining the timeout, but I guess I failed to mention this in my commit messages and the PR description. I will try and explain:

So, if you look where Workers::MiqDefaults.heartbeat_timeout is actually called from, you will see if is called from here:

https://github.com/ManageIQ/manageiq/blob/3ea55fa/app/models/miq_server/worker_management/monitor/validation.rb#L4-L19

This method takes a hearbeat timeout value from the settings (or defaults it to Workers::MiqDefaults.heartbeat_timeout if that does not exist), and then compares it against the w.last_heartbeat value. Action is then taken if the heartbeat is not responding, but it is based on the w.last_heartbeat and the timeout value in the settings.

To avoid having to do a DB lookup in this script, I put the calculation burden of determining when this timeout time was on the worker writing the file, avoiding the heartbeat script having to deal with that. Then, the only thing that the script needs to do is read that value and determine if that time has already passed or not.

The mtime is now just as a fallback incase the worker didn't write to the file properly, and needs to make a "best guess". That said, we maybe should do a rescue and use the mtime and not concern ourselves with checking the contents.empty?, or just let the script fail.

@NickLaMuro
Copy link
Member Author

Also, may want to remove the blank file.

Yeah... woops... that was me doing a bit of QA on this myself, and clearly did a #yoloGitAddDashA without a care in the world... I will fix tomorrow.

end

if current_time < timeout
puts 0 if options[:verbose]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin Will this mess up the common logging stuff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, it will just get parsed as the message value. If something isn't json formatted then it will still get recorded it just won't have all the other values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fryguy you would also have to pass -v or --verbose to this script to get that output, so it is more of a debug flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or... if you are to lazy to follow up with a echo $? afterwords... #whoHasTwoThumbsAndWouldDoThat #thisGuy

@NickLaMuro NickLaMuro force-pushed the heartbeat_check branch 3 times, most recently from 10b768a to 7a16f34 Compare July 7, 2017 21:53
@NickLaMuro
Copy link
Member Author

@Fryguy @carbonin Moved a core amount of the logic into a simple ruby class to allow this logic to be testable. Should remove the need for a lot of the above QA steps in the future (since they are automated), but some manual testing of this probably wouldn't hurt since it doesn't cover everything.

@@ -386,7 +385,8 @@ def heartbeat_to_drb
end

def heartbeat_to_file
FileUtils.touch(@worker.heartbeat_file)
timeout = worker_settings[:heartbeat_timeout] || Workers::MiqDefaults.heartbeat_timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're going to have the worker write his timeout to the heartbeat file, what do you think about including this existing logic where the timeout of the an active queue message (for queue workers only) is the timeout if the worker is processing the message? I think we'll still need to support that in some form in the new architecture.

With that said, I'm not sure how that can be done because the worker heartbeats before taking work from the queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed reply on this, but yeah... hrmmm... not sure how we would be able to do this... would have to ponder for a bit...

I assume you are referring to this bit in particular: https://github.com/ManageIQ/manageiq/blob/8b04e133/app/models/miq_server/worker_management/monitor/settings.rb#L38-L41

That said, depending on the queue changes, there might not be available if the current_timeout stops being a thing in the new architecture: https://github.com/ManageIQ/manageiq/blob/8b04e133/app/models/miq_worker.rb#L451-L454

heartbeat_file = options[:heartbeat_file] || ARGV[0]
heartbeat_file ||= Workers::MiqDefaults.heartbeat_file(options[:guid])

exit_status = Workers::Heartbeat.file_check heartbeat_file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing parens around method params


example.run

FileUtils.rm_f calculated_hb_file.to_s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

context "when passing in a filepath as an argument" do
other_heartbeat_file = ManageIQ.root.join("tmp", "spec", "other.hb").to_s

it_should_behave_like "heartbeat file checker", other_heartbeat_file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I not only disagree with this one, since this is how rspec itself says how to do it, but also would meant that we should be writing specs like this:

describe(".some_method") do
  context("with some changed variable") do
    expect(subject.some_method).to(eq(true))
  end
end

Which... well isn't the case in... oh... 101% of our specs #fakeMath

@NickLaMuro
Copy link
Member Author

@bdunne 😑

@NickLaMuro
Copy link
Member Author

The "@bdunne style check bot" strikes again

Creates a set of worker defaults instead of hardcoded values in methods
that can't be reused.
This method moves the function for determining the heartbeat filename
into the Workers::MiqDefaults, and sets a default to `tmp/miq_worker.hb`
if neither a ENV for the filename is provided, or a guid of the worker.
Use the running worker process to write it's own timeout time in it,
which then won't require a healthchecker script to determine it itself
by querying the DB/worker settings.
Currently just a single, testable class method that is meant to be a
reusable form of the logic needed to perform a heartbeat check.

The logic flows as follows (Fail is a false result, and Pass is true):

- Fail if the heartbeat file does not exist
- Fail if the heartbeat file exists, there is no content, and the mtime
  of the file is stale (older than 2 minutes is the default).
  - Pass if it is currently within that timeout
- Fail if the heartbeat file exists, there is content writen, and it is
  parsed into a time and is currently passed the timeout written into
  that file
  - Pass if it is currently within that timeout

A file can be passed into that method as an argument, or it can be
determined via Workers::MiqDefaults.heartbeat_file, which will look at
the current ENV of WORKER_HEARTBEAT_FILE before determining a reasonable
default.
@NickLaMuro
Copy link
Member Author

@gtanzillo @carbonin @bdunne I think I have solved the queue worker stuff in a pretty small amount of changes, but haven't tested manually to confirm that it works. Let me know if this looks reasonable to address Gregg's previous concern.

This script basically works with the docker's HEALTHCHECK interface:

https://docs.docker.com/engine/reference/builder/#healthcheck

So it will return 0 if the healthcheck passes, and 1 otherwise.

How it evaluates the healthcheck:

* Return 1 if no healthcheck file exists.
* Return 1 if the contents of the file are empty, and the mtime of the
  file + the default timeout for the worker have passed
* Return 1 if contents of the file are present, and the time parsed from
  the file has already passed (the time in the file should be the time
  at which the worker wrote that it's heartbeat timeout has since
  passed)
* Return 0 if either the contents of the file are present with a
  parsed timeout value, or a timeout value of the mtime of the file +
  the default worker heartbeat_timeout, has still not passed.

The healthcheck file can be passed in as the first ARG, as a cli option,
ENV variable (same as it would be set in the worker process), or by
passing in the guid, and it will evaluate where file should exist from
that.  Finally, if none of those are provided, it will then assume it is
in `tmp/miq_worker.hb`.  Again, if the file doesn't exist, it will still
return a 1 exit status.

# Only for file based heartbeating
def heartbeat_message_timeout(message)
if ENV["WORKER_HEARTBEAT_METHOD"] == "file" && msg.msg_timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/msg/message/ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin thanks... copy-pasta from where I pulled that from.

Currently, it is MiqServer's role to determine if the current worker is
taking too long to process a message by introspecting the current job
they are working on.

When using a file based heartbeating system, switch that responsibility
to the worker runner itself, and make sure that it is in charge of
determining when it's next timeout is, and write that just before
starting to do the message work.
@miq-bot
Copy link
Member

miq-bot commented Jul 14, 2017

Some comments on commits NickLaMuro/manageiq@487b2ce~...3dd93ab

lib/workers/bin/heartbeat_check.rb

  • ⚠️ - 20 - Detected puts. Remove all debugging statements.
  • ⚠️ - 33 - Detected puts. Remove all debugging statements.

@miq-bot
Copy link
Member

miq-bot commented Jul 14, 2017

Checked commits NickLaMuro/manageiq@487b2ce~...3dd93ab with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
8 files checked, 4 offenses detected

lib/workers/bin/heartbeat_check.rb

# Only for file based heartbeating
def heartbeat_message_timeout(message)
if ENV["WORKER_HEARTBEAT_METHOD"] == "file" && message.msg_timeout
timeout = worker_settings[:poll] + message.msg_timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is worker_settings[:poll] here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin carbonin dismissed their stale review July 14, 2017 19:18

addressed

@NickLaMuro
Copy link
Member Author

@Fryguy @gtanzillo @carbonin @bdunne is there anything else that you wanted to see done with this PR?

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me 👍 Will merge tomorrow unless someone objects.

@gtanzillo gtanzillo added this to the Sprint 65 Ending Jul 24, 2017 milestone Jul 20, 2017
@gtanzillo gtanzillo merged commit 99a3672 into ManageIQ:master Jul 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants