-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: add support for checks in nomad services #13715
Conversation
This PR adds support for specifying checks in services registered to the built-in nomad service provider. Currently only HTTP and TCP checks are supported, though more types could be added later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome! No blockers and mostly just questions for my education.
result := o.checker.Do(o.ctx, o.qc, query) | ||
|
||
// and put the results into the store | ||
_ = o.checkStore.Set(o.allocID, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding a comment why it's safe to ignore this error? shim.Set
calls db.PutCheckResult
, which depending on the implementation has potential to return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! added a missing error log statement if the shim is unable to set the check status in the persistent store
if err := h.shim.Purge(h.allocID); err != nil { | ||
h.logger.Error("failed to purge check results", "alloc_id", h.allocID, "error", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything the operators can do if this log line is seen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, no. PreKill doesn't report an error either so it's not like we can prevent the client from continuing with purging the alloc - though presumably if the state store can't remove a check, it can't remove anything else, either
results, err := s.db.GetCheckResults() | ||
if err != nil { | ||
s.log.Error("failed to restore health check results", "error", err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right in thinking that we log the error rather than return it so that the client doesn't fail to start due to a problem in restoring the check results and that results will be re-populated after the next subsequent trigger?
It might be useful to have a comment describing the behaviour irregardless of whether my statement is correct.
m, exists := s.current[allocID] | ||
if !exists { | ||
return nil | ||
} | ||
|
||
return helper.CopyMap(m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that helper.CopyMap
handles nil maps so we could do away with the exists check, however, this does read better.
const ( | ||
// maxTimeoutHTTP is a fail-safe value for the HTTP client, ensuring a Nomad | ||
// Client does not leak goroutines hanging on to unresponsive endpoints. | ||
maxTimeoutHTTP = 10 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems somewhat high, but I don't have any idea how to choose a better value. Is there a particular reason it is set to 10 mins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's basically "much larger than a reasonable HC timeout" ... and "less than infinity". In my mind the slowest of checks should be on the order of a few seconds, e.g. incurring some database query
|
||
type checker struct { | ||
log hclog.Logger | ||
clock libtime.Clock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that this wrapper around the standard lib is mostly used for testing capabilities? I just want to understand when is better to use this compared to calling the standard lib directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it's just for testing! At $prevJob we had excellent control over time in our code - making testing of time-based logic not just possible, but easy. I'd like to start trying to bringing some of those patterns into Nomad. (and really, using indirection over time.Now
is 90% of the solution)
nomad/structs/checks.go
Outdated
// will not move forward while the check is failing. | ||
Healthiness CheckMode = "healthiness" | ||
|
||
// A Readiness check is useful in the context of ensuring a service is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// A Readiness check is useful in the context of ensuring a service is | |
// A Readiness check is useful in the context of ensuring a service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the partial review. Looking good so far: no functional issues. I'll pick it back up ASAP.
// l is used to lock shared fields listed below | ||
l sync.Mutex | ||
// lock is used to lock shared fields listed below | ||
lock sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm a mu
person myself, let's see..
$ rg -I '\W[a-z]+\W+sync.Mutex' | sed -e 's/var//' | awk '{print $1}' | sort | uniq -c | sort -nr
26 mu
16 lock
8 l
1 m
1 errmu
1 acquire
Seems like I'm still winning, but you're catching up!
But seriously anything is better than l
so 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll let the people decide!
@@ -262,7 +308,7 @@ func (t *Tracker) watchTaskEvents() { | |||
} | |||
|
|||
// Store the task states | |||
t.l.Lock() | |||
t.lock.Lock() | |||
for task, state := range alloc.TaskStates { | |||
//TODO(schmichael) for now skip unknown tasks as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TODO(schmichael)
The scariest words I can see in a PR. I don't even think this comment is accurate. It seems to be copied and pasted from another place, but here we're iterating over alloc.TaskStates
and updating a map that was originally populated from alloc.TaskStates
... I really don't see how group services could factor into this?
If you have 30 seconds of time to give this comment a think and remove it if you think it's nonsensical as well, I'd appreciate you cleaning up past schmichael's messes.
I don't see how taskHealth[task]
could ever be !ok
, but we don't have to worry about changing the actual code too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh yeah I was wondering about this 😅
client/allochealth/tracker.go
Outdated
@@ -321,17 +367,12 @@ func (t *Tracker) watchTaskEvents() { | |||
t.setTaskHealth(false, false) | |||
|
|||
// Avoid the timer from firing at the old start time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Avoid the timer from firing at the old start time | |
// Prevent the timer from firing at the old start time |
@@ -381,8 +455,12 @@ func (t *Tracker) watchConsulEvents() { | |||
OUTER: | |||
for { | |||
select { | |||
|
|||
// we are shutting down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dangerous phrasing. Is "we" the agent shutting down? I believe it just means this tracker is no longer needed and you don't need to know why. It could be canceled due to the alloc being stopped or another event causing the health to be set, but I don't think actually shutting down the agent closes it! So maybe:
// we are shutting down | |
// tracker has been canceled, no need to keep waiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fantastic! Love all of the usecase specific types.
Changelog entry in this PR maybe?
case <-checkTicker.C: | ||
results = t.checkStore.List(allocID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add "blocking queries"/watching to checkStore so we could avoid polling here. No a big deal here, the cardinality will be low relative to the 500ms timer so the CPU savings would be meaningless. Might simplify testing by removing one timing dependent component.
return alloc | ||
} | ||
|
||
var checkHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fantastic testing approach!
case "http": | ||
qr = c.checkHTTP(timeout, qc, q) | ||
default: | ||
qr = c.checkTCP(timeout, qc, q) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit surprising to me that we use an interface for Checkers when there's only 1 concrete implementation that just switches between internal logic. No need to change it. The interface is still useful for testing, and we can always split this up in the future if we have so many check types the single struct becomes unwieldy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my head this was going to be super elegant with implementations per type (expanding in the future)... reality didn't quite get there yet 😞
} | ||
|
||
// nomad checks do not have warnings | ||
if sc.OnUpdate == "ignore_warnings" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we used more string consts... That s
on the end would be easy to forget. Nothing we need to block this PR for though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this in a followup PR
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.
Currently only HTTP and TCP checks are supported, though more types
could be added later.
Closes #13717
Future Work https://github.com/hashicorp/team-nomad/issues/354
docs & e2e in a follow up PR
An example job file to play around with
Example query to the API