-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autopromote fails when an allocation goes unhealthy but is properly replaced #8150
Comments
Hi, any thoughts on this bug? We still see it today /w v0.12.8. |
Currently investigating this along with #7058, which appears to be at least somewhat related. |
Update on that #7058 investigation: that's been shipped but specifically had to do with the progress deadline not being set. I kind of doubt this is related at this point. I going to try to put together a minimal reproduction before putting this on the roadmap for development, but if you have one I'd be happy to validate it. |
Thanks for looking into this, tgross. We definitely do see this issue still and have just been kicking the deploy by changing the count on a taskgroup when it happens. I'll refresh my original investigation and see if there was anything I did to try to reproduce it or if I had just submitted evidence when the situation occurred. |
Any update on this? We are still seeing it |
I'm seeing something similar fairly frequently Start deployment, and then First alloc
After first alloc fails a second alloc attempt is made Replacement alloc
Second alloc attempt succeeds and yet the deployment is stuck
Need to manually promote
|
Looking at this:
If my understanding is correct, PlacedCanaries holds a struct of all canary id's placed at any point in time. Including those now unhealthy/replaced. This would mean if you're ever placing more canaries than desired this is always returning Using that, I'd propose the following change:
Let me know if this makes sense! |
Hi @chuckyz 👋 Great investigation! This looks good to me. I think the only change is that you would want to check for the inverse condition: I don't know if you have an easy repro, but just for the record, this is the job I used: job "canary" {
datacenters = ["dc1"]
meta {
uuid = uuidv4()
}
group "canary" {
count = 3
restart {
attempts = 1
}
update {
max_parallel = 3
canary = 3
auto_promote = true
min_healthy_time = "2s"
}
task "canary" {
driver = "raw_exec"
config {
command = "/bin/bash"
args = ["local/script.sh"]
}
template {
data = <<EOF
#!/usr/bin/env bash
if [[ $NOMAD_ALLOC_ID =~ ^[a-fA-F] ]]; then
echo "alloc ID starts with letter, bye"
exit 1
fi
echo "alloc ID doesn't start with letter"
while true; do
sleep 5
done
EOF
destination = "local/script.sh"
}
}
}
} It takes a bit of luck to trigger a failure, but you can just run the job multiple times, that Feel free to open a PR with this patch 🙂 |
@lgfa29 open! I've opened this ahead of testing it locally so that I can get some eyeballs on the test because that exact test is... complicated and I'd by lying if I said I truly understood what I just read/wrote. edit: |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Amazon Linux 2
Issue
A job set to autopromote ended up requiring manual promotion to complete the deployment. The only thing different in this case was one of the allocations were unhealthy, so Nomad replaced that allocation. That new allocation became healthy and the healthy threshold was met. However, rather than autopromoting, the deployment stalled.
The deployment status was reporting
Deployment is running pending automatic promotion
when queried, but it was only when we clicked "Promote" in the UI did the deployment complete successfully.In this deployment, I clicked "promote" at 1:07 PM PT, which is when the deployment completed. Notice the
progress_deadline
values were about half an hour before this. So this tells me the canary allocations satisfied the healthy threshold as to not fail the deployment, but for some reason, required a manual promotion.Unfortunately, I'm not sure how to replicate this as we're seeing this issue a low percentage of the time, but given as many deployments as we do, we see it a few times a week.
The text was updated successfully, but these errors were encountered: