-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promoting a canary causes its service check to immediately become critical #4566
Comments
For fun, I fired up an old test environment that's running an older version (Nomad v0.7.1 (0b295d3)). This problem does not occur on 0.7.1 since consul service definitions are not replaced. Using the same workflow as above, I can check during deploy:
And then after promoting:
|
Wow, I too had observed such behavior when using a test setup of Dockers and using blue green promotion. I had marked it as something I had done wrong among the myriad of options between but I guess others are facing a similar issue. I haven't tried using a previous version to see if blue green works. BTW, the usual rolling upgrade setting works, i.e. service is online when I upgrade only a few at a time. At the time of my observation, I was using Nomad 0.8.4 and Consul 1.2.1. |
I've tested this on three versions of nomad and two versions of consul now. The behaviour seems to be consul version-agnostic, but FWIW I've performed my testing with consul 1.0.2 and 1.2.0. As for nomad: Nomad v0.7.1 (0b295d3) - not affected My best guess is that this new behaviour was introduced via #4259, specifically these lines: https://github.com/hashicorp/nomad/pull/4259/files#diff-8612d5e61f84976db572b3f38eb380c0R1056 The consul service ID is hashed based on allocation ID + task name to avoid collisions, and as of that PR the canary bool as well. Since the canary stops being a canary when the job is promoted, the hash is changed and a new consul service registration is essentially forced. The new service is consequently I'm not familiar enough with Nomad's codebase to know if including the canary status in the hash is necessary or not, but if it isn't, then excluding it from the hashing function is probably the easiest/best solution. Perhaps it's not as straightforward though (plus, the canary tags are also hashed in this function, which will cause this behaviour for anyone who uses them). |
@byronwolfman just a thought, could the following help (to mark the status a |
saw this too.. with nomad 0.8.4 and consul 1.2.0 and this happened when I was trying out canary tags.. I thought it was a misconfiguration on my part and didn't get a chance to really check it out. |
Yep, but I've described why this is almost worse than the service check flapping to critical. By setting We don't want to advertise the task until it is ready to receive traffic, which is why it makes sense for consul to start all registered services as "critical" until their healthchecks are passing. Unfortunately this has unintended consequences for the canary. Having thought about this, what we will probably do is to set We're also considering rolling back to 0.8.3, but 0.8.4 is so stable otherwise that I'm not too stoked about the idea of a rollback. |
@byronwolfman apologies, I didn't see you have already mentioned setting the I too had thought about the side effects of marking something healthy when it is not. Regards, |
Hi all, any traction on this? |
An alternative would be to update one consul service at a time, avoiding to leave all of them in a critical state at the same time. What do you think? |
I appreciate this (and other) well-meaning workarounds, but this is not our desired workflow, and doubles the time it takes to deploy an app. I opened this issue because the change breaks a documented workflow for Blue/Green deployments here: The breaking change here isn't to an undocumented edge case that functioned as a side-effect of how nomad and consul work together; it's a breaking change that invalidates the entire above document. |
And that's a problem. How can you use blue/green and have traffic only redirecting to canaries only or promoted allocs only if you have no way to differentiate between the two ?
I don't see how it can be a problem if that's done by Nomad on Consul service replacement. |
@schmichael and @preetapan You are the ones who improved the deployment in commits indicated by #4566 (comment) (more specifically 17c6eb8#diff-8612d5e61f84976db572b3f38eb380c0L1043). Do you know why the consul service id is a hash of the canary flag (and the canary tags)? Why was it done that way? Could we consider removing the canary flag and tags from the hash to fix this issue ? |
@mildred I addressed this in a comment on the mailing list https://groups.google.com/forum/#!topic/nomad-tool/GatYXal9evg TLDR - Using the service ID's hash to encapsulate all its internal state(like canary status/tag values) was in retrospect, not the right design decision and has been a source of brittleness. We are working on redesigning this for 0.9. |
@preetapan |
As you are redesigning this, perhaps it would be a good time to implement #3636: an option to enable automatic promotion of canaries when they are all healthy |
@preetapan do you have news about the planning of this issue for Nomad 0.9 ? Is it planned for 0.9.0 or a later release of the 0.9.x branch ? |
Any update on this? The bug has existed for 8 months now (reported 6 months ago). I know it's planned to be fixed in 0.9.x but that release seems to keep getting delayed, and the massive scope of 0.9.0 frankly makes us wary of upgrading. The discussion in this thread indicates that the bug will not be fixed in 0.9.0, but in a later point release (so 0.9.1 at earliest), and then we're going to need to wait for at least the next point release to make sure that there were no show-stoppers. I feel like by the time we can actually start using the blue-green pattern again, a full year will have passed. In the meanwhile, https://www.nomadproject.io/guides/operating-a-job/update-strategies/blue-green-and-canary-deployments.html continues to advertise an update strategy that does not actually work. Is it not possible to backport something -- anything -- into 0.8.x? Is it possible to just have a configuration option to disable canary tagging in consul, since this is the source of the bug? |
So, I probably should have done this a lot sooner. I've put together a quick and dirty patch against v0.8.4 that disables canary tagging. I've erred on the side of making as few code changes as possible since the PR that introduced canary tags (#4259) is massive. This means that rather than a comprehensive cleanup effort that includes refactors and test coverage (aka the correct approach), you're getting the equivalent of a handful of commented-out lines. This modifies the service.Hash() function; it still accepts the diff --git a/nomad/structs/structs.go b/nomad/structs/structs.go
index 969f11338..49efad7fe 100644
--- a/nomad/structs/structs.go
+++ b/nomad/structs/structs.go
@@ -3930,11 +3930,6 @@ func (s *Service) Hash(allocID, taskName string, canary bool) string {
io.WriteString(h, tag)
}
- // Vary ID on whether or not CanaryTags will be used
- if canary {
- h.Write([]byte("Canary"))
- }
-
// Base32 is used for encoding the hash as sha1 hashes can always be
// encoded without padding, only 4 bytes larger than base64, and saves
// 8 bytes vs hex. Since these hashes are used in Consul URLs it's nice The next diff modifies ServiceClient.serviceRegs() to no longer add the The change is meant to be revert the commit that introduced them, comments included. diff --git a/command/agent/consul/client.go b/command/agent/consul/client.go
index 7ca1fc845..3a0d6c7f3 100644
--- a/command/agent/consul/client.go
+++ b/command/agent/consul/client.go
@@ -641,24 +641,17 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, t
return nil, fmt.Errorf("unable to get address for service %q: %v", service.Name, err)
}
- // Determine whether to use tags or canary_tags
- var tags []string
- if task.Canary && len(service.CanaryTags) > 0 {
- tags = make([]string, len(service.CanaryTags))
- copy(tags, service.CanaryTags)
- } else {
- tags = make([]string, len(service.Tags))
- copy(tags, service.Tags)
- }
-
// Build the Consul Service registration request
serviceReg := &api.AgentServiceRegistration{
ID: id,
Name: service.Name,
- Tags: tags,
+ Tags: make([]string, len(service.Tags)),
Address: ip,
Port: port,
}
+ // copy isn't strictly necessary but can avoid bugs especially
+ // with tests that may reuse Tasks
+ copy(serviceReg.Tags, service.Tags)
ops.regServices = append(ops.regServices, serviceReg)
// Build the check registrations I've been playing around with this in a sandbox environment and so far it's behaving as expected, so we might fire it into our test cluster later this week. If you want to try it out in your own test environment, this repo provides you all the tools you need to build your own binary:
A few other things in no particular order:
|
Hey friends and fellow nomad operators, good(ish) news; after soaking the patch on our test environments for a week, we've rolled this to prod and have our safe blue/greens back. I can't recommend this as general fix for everyone running a >= 0.8.4 cluster, but if you have a very strong understanding of how your cluster operates and what kind of jobs are being sent to it, you're not using the canary tags feature and you have a very robust way of testing and validating this kind of patch, well... It works. Use at your own risk of course, since the "fix" is that it disables code that it hopes you won't try to use. If you're not in a hurry though, then it's probably still best to wait for an official (and correct) fix in 0.9.x. |
bump |
Sorry for the lack of communication on this bug. It will not make it into 0.9.0 but is one of my highest priorities after release. |
🎉 🎉 🎉 🎉 |
As far as I can tell, this issue has not been fixed in Nomad 0.9.1 (Consul 1.3.1). Consul health status with 2 containers + 1 canary running, before promote:
Consul health status after promote, all previous service definitions have been deregistered:
|
Hey! It seems that 0.9.1 was put out early as a bug fix release, but the fix should be in 0.9.2 along with some other really neat lifecycle stuff: 4ddb4bb#diff-4ac32a78649ca5bdd8e0ba38b7006a1eR26 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.4 (dbee1d7)
Consul v1.2.0
Operating system and Environment details
CentOS Linux release 7.5.1804 (Core) (EC2)
Issue
Our engineering teams have noticed that their web apps sometimes brown out whenever we promote canaries. Many of these apps are very small, and we only run two containers. Therefore in our update stanza we set
canary = 2
to do a blue-green update, wait for the canaries to become healthy, and then promote them. We use consul and consul-template to add services to our load balancers so that as new services become healthy, they're entered into the catalog and consequently into the load balancer config.What we've noticed is that at the moment of promotion, suddenly the service seems to disappear from the load balancer. The 2 old containers are removed from the consul catalog because we've just promoted the job, and that's expected. But the new 2 containers also seem to disappear, even though they were healthy a moment ago. They'll then re-appear a few seconds later, but not before we drop a bunch of traffic on the floor.
As far as I can tell this is because when a nomad task is promoted, it service definition in consul is replaced. This is kind of bad since all consul services start out critical by default.
Reproduction steps
You can watch this in real-time if nomad and consul are both running. Deploy a job to nomad which will register with consul, and then deploy an update with a canary. You can look at service definitions in Consul with:
So far so good. As soon as the job is promoted, however:
This is a new service definition in Consul, and therefore its healthcheck starts out critical. We have a couple of ways we can work around this but neither are appealing:
initial_status = "passing"
in the service definition. Potentially worse, since this means that unhealthy services will immediately start accepting traffic ("unhealthy" meaning an unready service, but also a completely broken service).In other words we're in a pretty bad place right now.
Job file (if appropriate)
A playground app of ours:
The text was updated successfully, but these errors were encountered: