-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client driver health checks for Docker #3856
Conversation
ee0f21f
to
9e06dd9
Compare
client/client.go
Outdated
return c.config.Node | ||
} | ||
|
||
func (c *Client) updateNodeFromHealthCheck(response *cstructs.HealthCheckResponse) *structs.Node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What synchronizes this with the servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to add this in a separate PR that will address synchronization and scheduling.
nomad/structs/structs.go
Outdated
@@ -1139,6 +1148,9 @@ type Node struct { | |||
// Raft Indexes | |||
CreateIndex uint64 | |||
ModifyIndex uint64 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Raft indexes are kept at the bottom of the structs so put this above them. I would put it above attributes.
nomad/structs/structs.go
Outdated
@@ -1062,6 +1062,15 @@ func ValidNodeStatus(status string) bool { | |||
} | |||
} | |||
|
|||
// DriverInfo is the current state of a single driver. This is updated | |||
// regularly as driver health changes on the node. | |||
type DriverInfo struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have to make a duplicate of this in the api/node.go and update the node
client/fingerprint/fingerprint.go
Outdated
// it should be run periodically. The return value is a boolean indicating | ||
// whether it should be done periodically, and the time interval at which | ||
// this check should happen. | ||
CheckHealthPeriodic() (bool, time.Duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be request response as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of either name.
Check -> HealthCheck
CheckHealthPeriodic -> GetHealthCheckInterval
client/structs/structs.go
Outdated
type HealthCheckRequest struct{} | ||
|
||
type HealthCheckResponse struct { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line
client/fingerprint_manager.go
Outdated
|
||
fm.nodeLock.Lock() | ||
if node := fm.updateHealthCheck(&response); node != nil { | ||
fm.node = node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock only needs to be held in this if
client/driver/docker.go
Outdated
HealthDescription: "Docker driver is available but unresponsive", | ||
UpdateTime: time.Now(), | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use d.dockerClients()
client/driver/docker.go
Outdated
return err | ||
} | ||
|
||
d.logger.Printf("[DEBUG] driver.docker: docker driver is available and is responsive to `docker ps`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TRACE
client/fingerprint_manager.go
Outdated
if err := hc.Check(request, &response); err != nil { | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the value of the health check hasn't changed avoid calling updateHealthCheck
. Otherwise every minute we will be causing a the client to push changes to the server and cause unnecessary Raft transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about handling that in updateHealthCheck
itself, as #3873 moves this logic to updateNodeFromFingeprint
client/client.go
Outdated
for name, val := range response.Drivers { | ||
c.config.Node.Drivers[name] = val | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver health seems like a nice metric to emit. Healthy and unhealthy with driver as a label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. This seemed like the best option but let me know if you had something else in mind. https://github.com/armon/go-metrics/blob/2e4f2be0fe4f6b7096471aa85f2c342bff3b8f4f/metrics.go#L105
2831093
to
29c49cf
Compare
b8695fd
to
4f1ba6b
Compare
nomad/structs/node.go
Outdated
return false | ||
} | ||
|
||
if strings.Compare(di.HealthDescription, other.HealthDescription) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use ==
scheduler/feasible.go
Outdated
option.ID, driverStr, value) | ||
return false | ||
} | ||
// TODO this is a compatibility mechanism- as of Nomad 0.8, nodes have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use this syntax for compatibility code:
// COMPAT: Remove in 0.X: As of Nomad 0.8....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets us search for compatibility code
client/client.go
Outdated
oldVal := c.config.Node.Drivers[name] | ||
if oldVal == nil && newVal != nil { | ||
c.config.Node.Drivers[name] = newVal | ||
} else if newVal != nil && newVal.Detected != oldVal.Detected { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the attributes change? This will skip merging
client/driver/docker.go
Outdated
@@ -552,6 +551,48 @@ func (d *DockerDriver) Fingerprint(req *cstructs.FingerprintRequest, resp *cstru | |||
return nil | |||
} | |||
|
|||
// HealthChck implements the interface for the HealthCheck interface. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling
client/driver/docker.go
Outdated
|
||
client, _, err := d.dockerClients() | ||
if err != nil { | ||
d.logger.Printf("[WARN] driver.docker: docker driver is available but is unresponsive to `docker ps`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always log the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you see this in logs how will you know which line caused the log? Add distinguishing information in the log lines so they are useful for diagnosing. An example:
d.logger.Printf("[WARN] driver.docker: failed to retrieve Docker client docker health check: %v", err)
client/fingerprint_manager.go
Outdated
case <-time.After(period): | ||
err := fm.healthCheck(name, hc) | ||
if err != nil { | ||
fm.logger.Printf("[DEBUG] client.fingerprint_manager: health checking for %v failed: %+v", name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use %v for errors
client/fingerprint_manager.go
Outdated
logger *log.Logger | ||
updateNodeAttributes func(*cstructs.FingerprintResponse) *structs.Node | ||
|
||
// UpdateHealthCheck is a callback to the client to update the state of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lower case
// away from annotating a node's attributes to demonstrate support for a | ||
// particular driver. Takes the FingerprintResponse and converts it to the | ||
// proper DriverInfo update and then sets the prefix attributes as well | ||
func (fm *FingerprintManager) fingerprintDriver(name string, f fingerprint.Fingerprint) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only being called the first time. Periodic, health checking drivers should always call this. Where "this" is the merging logic into the driverinfo
client/fingerprint_manager.go
Outdated
go fm.run(d, period, name) | ||
go fm.runFingerprint(d, period, name) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what this should do is not use runFingerprint
or runHealthCheck
but instead gather the fact that the Driver is health checking or periodic and pass all that to a new method that runs the drivers fingerprint/healthcheck periodically and then take the returned info and properly construct the DriverInfo object (merged health check and attributes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky as the fingerprint/healthcheck methods run on different intervals, so if we merge them in lockstep then we only update the node as often has the slowest interval. Thinking through ways to do this more cleanly.
client/fingerprint_manager.go
Outdated
// in a backwards compatible way, where node attributes for drivers will | ||
// eventually be phased out. | ||
di := &structs.DriverInfo{ | ||
Attributes: response.Attributes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stripping:
DriverInfo.Attributes: map[string]string{
"version": "1.12.3",
}
Node.Attributes: {
"driver.docker.version": "1.12.3"
}
client/client.go
Outdated
// update the node with the latest driver health information | ||
for name, newVal := range response.Drivers { | ||
if newVal == nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when newVal is nil but oldVal is not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a safety precaution, every driver should have a corresponding driver info value for a node.
scheduler/feasible_test.go
Outdated
} | ||
checker := NewDriverChecker(ctx, drivers) | ||
if act := checker.Feasible(c.Node); act != c.Result { | ||
t.Fatalf("case(%d) failed: got %v; want %v", i, act, c.Result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rmaybe use require rather than t.Fatalf here?
scheduler/feasible.go
Outdated
return false | ||
} | ||
|
||
if !enabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplify by changing this to return enabled
// If the Drivers field has not yet been initialized, it does so here. | ||
func (h *HealthCheckResponse) AddDriverInfo(name string, driverInfo *structs.DriverInfo) { | ||
// initialize Drivers if it has not been already | ||
if h.Drivers == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make a constructor for HealthCheckResponse
and create an empty map there? Doing a lazy init is a bit more efficient because it avoids creating the map till you actually need it, but it trades off some readability and I am not sure the small efficiency gains are worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAde some comments
bd262da
to
244389c
Compare
client/client.go
Outdated
|
||
if fingerprint != nil { | ||
if c.config.Node.Drivers[name] == nil { | ||
// if the driver info has not yet been set, do that here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize your comments. This is Go style: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
client/client.go
Outdated
events := []*structs.NodeEvent{ | ||
{ | ||
Subsystem: "Driver", | ||
Message: fmt.Sprintf("Driver status changed: %s", health.HealthDescription), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Driver status changed:" can probably be removed.
client/driver/docker.go
Outdated
|
||
_, err = client.ListContainers(docker.ListContainersOptions{All: false}) | ||
if err != nil { | ||
d.logger.Printf("[WARN] driver.docker: failed to list Docker containers in the process of a docker health check: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize Docker
scheduler/feasible.go
Outdated
option.ID, driverStr, value) | ||
return false | ||
} | ||
// COMPAT: Remove in 0.X: As of Nomad 0.8, nodes have a DriverInfo that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.X -> 0.10
client/fingerprint_manager.go
Outdated
logger *log.Logger | ||
updateNodeAttributes func(*cstructs.FingerprintResponse) *structs.Node | ||
|
||
updateNodeFromDriver func(string, *structs.DriverInfo, *structs.DriverInfo) *structs.Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment
client/fingerprint_manager.go
Outdated
case <-t2.C: | ||
if isHealthCheck { | ||
t2.Reset(healthCheckPeriod) | ||
err := fm.runHealthCheck(name, hc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := fm.runHealthCheck(name, hc); err != nil {
client/fingerprint_manager.go
Outdated
@@ -118,6 +234,21 @@ func (fm *FingerprintManager) fingerprint(name string, f fingerprint.Fingerprint | |||
return response.Detected, nil | |||
} | |||
|
|||
// healthcheck checks the health of the specified resource. | |||
func (fm *FingerprintManager) runHealthCheck(name string, hc fingerprint.HealthCheck) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to runDriverHealthCheck
and fix comment
client/fingerprint_manager.go
Outdated
@@ -2,6 +2,7 @@ package client | |||
|
|||
import ( | |||
"log" | |||
"strings" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to arrange the methods such that drivers methods are colocated and fingerprint methods are too. I think it would be simpler to follow if it went:
- Struct
- NewFingerprintManager
- Run.
- SetupFingerprinter
- SetupDriver
6-X: Fingerprinter stuff
7-X: Driver stuff
client/fingerprint_manager.go
Outdated
fm.logger.Printf("[DEBUG] client.fingerprint_manager: periodic fingerprinting for driver %v failed: %+v", name, err) | ||
} | ||
case <-t2.C: | ||
if isHealthCheck { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we shouldn't run the health check if detected isn't true. Otherwise you will log errors continuously, right?
I think you also skip the initial health check here and just rely on this loop: https://github.com/hashicorp/nomad/pull/3856/files#diff-8cf4b6aaa496508270af1cd14fdc1527R92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this, but the second point would entail that all drivers that allow health checking will also be periodic. If that is always true, then we can do as you said and skip the initial check and only update on a periodic interval.
|
||
client, _, err := d.dockerClients() | ||
if err != nil { | ||
d.logger.Printf("[WARN] driver.docker: failed to retrieve Docker client in the process of a docker health check: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the reason I feel like we skip the health check unless it is detected, otherwise you will mark it as unhealthy and log even if docker just isn't there.
7ee98af
to
2adea66
Compare
2adea66
to
e17343d
Compare
fix up feedback from code review add driver info for all drivers to node
… check Slight updates for go style
bump log level
e17343d
to
89ffc96
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
No description provided.