Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: add support for checks in nomad services #13715

Merged
merged 2 commits into from
Jul 21, 2022
Merged

client: add support for checks in nomad services #13715

merged 2 commits into from
Jul 21, 2022

Conversation

shoenig
Copy link
Contributor

@shoenig shoenig commented Jul 12, 2022

This PR adds support for specifying checks in services registered to
the built-in nomad service provider.

Currently only HTTP and TCP checks are supported, though more types
could be added later.

Closes #13717

Future Work https://github.com/hashicorp/team-nomad/issues/354

docs & e2e in a follow up PR

An example job file to play around with
job "fake" {
  datacenters = ["dc1"]

  group "fake" {

    network {
      mode = "bridge"
      port "http" { to = 9090 }
    }

    service {
      provider = "nomad"
      name     = "fake1"
      port     = "http"
      check {
        type     = "http"
        path     = "/"
        interval = "5s"
        timeout  = "1s"
      }
    }

    task "faketask" {
      driver = "docker"

      config {
        image = "nicholasjackson/fake-service:v0.23.1"
        ports = ["http"]
      }

      env {
        LISTEN_ADDR = "0.0.0.0:9090"
      }

      resources {
        cpu    = 10
        memory = 32
      }
    }
  }

  group "caching" {
    network {
      mode = "bridge"
      port "db" { to = 6379 }
    }

    service {
      provider = "nomad"
      name = "redis"
      port = "db"
      check {
	name = "redis_tcp"
	type = "tcp"
	interval = "10s"
	timeout = "1s"
      }
    }

   task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Example query to the API

➜ nomad operator api /v1/allocation/fb261663-507f-95f1-ae02-7eaf14ea8998/checks  | jq .
{
  "3a6ceb54262f99cf87598b05bcb91af8": {
    "Check": "service: \"fake1\" check",
    "Group": "fake.fake[0]",
    "ID": "3a6ceb54262f99cf87598b05bcb91af8",
    "Mode": "healthiness",
    "Output": "nomad: http ok",
    "Service": "fake1",
    "Status": "success",
    "StatusCode": 200,
    "Timestamp": 1657663765
  }
}

This PR adds support for specifying checks in services registered to
the built-in nomad service provider.

Currently only HTTP and TCP checks are supported, though more types
could be added later.
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! No blockers and mostly just questions for my education.

result := o.checker.Do(o.ctx, o.qc, query)

// and put the results into the store
_ = o.checkStore.Set(o.allocID, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding a comment why it's safe to ignore this error? shim.Set calls db.PutCheckResult, which depending on the implementation has potential to return an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! added a missing error log statement if the shim is unable to set the check status in the persistent store

Comment on lines +267 to +269
if err := h.shim.Purge(h.allocID); err != nil {
h.logger.Error("failed to purge check results", "alloc_id", h.allocID, "error", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything the operators can do if this log line is seen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, no. PreKill doesn't report an error either so it's not like we can prevent the client from continuing with purging the alloc - though presumably if the state store can't remove a check, it can't remove anything else, either

Comment on lines 57 to 61
results, err := s.db.GetCheckResults()
if err != nil {
s.log.Error("failed to restore health check results", "error", err)
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking that we log the error rather than return it so that the client doesn't fail to start due to a problem in restoring the check results and that results will be re-populated after the next subsequent trigger?

It might be useful to have a comment describing the behaviour irregardless of whether my statement is correct.

Comment on lines +103 to +108
m, exists := s.current[allocID]
if !exists {
return nil
}

return helper.CopyMap(m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that helper.CopyMap handles nil maps so we could do away with the exists check, however, this does read better.

const (
// maxTimeoutHTTP is a fail-safe value for the HTTP client, ensuring a Nomad
// Client does not leak goroutines hanging on to unresponsive endpoints.
maxTimeoutHTTP = 10 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems somewhat high, but I don't have any idea how to choose a better value. Is there a particular reason it is set to 10 mins?

Copy link
Contributor Author

@shoenig shoenig Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's basically "much larger than a reasonable HC timeout" ... and "less than infinity". In my mind the slowest of checks should be on the order of a few seconds, e.g. incurring some database query


type checker struct {
log hclog.Logger
clock libtime.Clock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that this wrapper around the standard lib is mostly used for testing capabilities? I just want to understand when is better to use this compared to calling the standard lib directly.

Copy link
Contributor Author

@shoenig shoenig Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's just for testing! At $prevJob we had excellent control over time in our code - making testing of time-based logic not just possible, but easy. I'd like to start trying to bringing some of those patterns into Nomad. (and really, using indirection over time.Now is 90% of the solution)

// will not move forward while the check is failing.
Healthiness CheckMode = "healthiness"

// A Readiness check is useful in the context of ensuring a service is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A Readiness check is useful in the context of ensuring a service is
// A Readiness check is useful in the context of ensuring a service

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the partial review. Looking good so far: no functional issues. I'll pick it back up ASAP.

// l is used to lock shared fields listed below
l sync.Mutex
// lock is used to lock shared fields listed below
lock sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm a mu person myself, let's see..

$ rg -I '\W[a-z]+\W+sync.Mutex' | sed -e 's/var//' | awk '{print $1}' | sort | uniq -c | sort -nr
     26 mu
     16 lock
      8 l
      1 m
      1 errmu
      1 acquire

Seems like I'm still winning, but you're catching up!

But seriously anything is better than l so 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll let the people decide!

@@ -262,7 +308,7 @@ func (t *Tracker) watchTaskEvents() {
}

// Store the task states
t.l.Lock()
t.lock.Lock()
for task, state := range alloc.TaskStates {
//TODO(schmichael) for now skip unknown tasks as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

			//TODO(schmichael)

The scariest words I can see in a PR. I don't even think this comment is accurate. It seems to be copied and pasted from another place, but here we're iterating over alloc.TaskStates and updating a map that was originally populated from alloc.TaskStates ... I really don't see how group services could factor into this?

If you have 30 seconds of time to give this comment a think and remove it if you think it's nonsensical as well, I'd appreciate you cleaning up past schmichael's messes.

I don't see how taskHealth[task] could ever be !ok, but we don't have to worry about changing the actual code too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh yeah I was wondering about this 😅

@@ -321,17 +367,12 @@ func (t *Tracker) watchTaskEvents() {
t.setTaskHealth(false, false)

// Avoid the timer from firing at the old start time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Avoid the timer from firing at the old start time
// Prevent the timer from firing at the old start time

@@ -381,8 +455,12 @@ func (t *Tracker) watchConsulEvents() {
OUTER:
for {
select {

// we are shutting down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous phrasing. Is "we" the agent shutting down? I believe it just means this tracker is no longer needed and you don't need to know why. It could be canceled due to the alloc being stopped or another event causing the health to be set, but I don't think actually shutting down the agent closes it! So maybe:

Suggested change
// we are shutting down
// tracker has been canceled, no need to keep waiting

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic! Love all of the usecase specific types.

Changelog entry in this PR maybe?

Comment on lines +579 to +580
case <-checkTicker.C:
results = t.checkStore.List(allocID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add "blocking queries"/watching to checkStore so we could avoid polling here. No a big deal here, the cardinality will be low relative to the 500ms timer so the CPU savings would be meaningless. Might simplify testing by removing one timing dependent component.

return alloc
}

var checkHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic testing approach!

Comment on lines +62 to +65
case "http":
qr = c.checkHTTP(timeout, qc, q)
default:
qr = c.checkTCP(timeout, qc, q)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit surprising to me that we use an interface for Checkers when there's only 1 concrete implementation that just switches between internal logic. No need to change it. The interface is still useful for testing, and we can always split this up in the future if we have so many check types the single struct becomes unwieldy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head this was going to be super elegant with implementations per type (expanding in the future)... reality didn't quite get there yet 😞

}

// nomad checks do not have warnings
if sc.OnUpdate == "ignore_warnings" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we used more string consts... That s on the end would be easy to forget. Nothing we need to block this PR for though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in a followup PR

@shoenig shoenig merged commit 4508af8 into main Jul 21, 2022
@shoenig shoenig deleted the dev-nsd-checks branch July 21, 2022 15:22
@shoenig shoenig added this to the 1.4.0 milestone Aug 24, 2022
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature: support for specifying checks in native nomad services
3 participants