-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563
Comments
If it helps, I've been experiencing an issue very similar to #7743. I tried to experiment with the driver-host-path csi plugin over a year ago in an attempt to familiarize myself with utilizing CSI plugins with Nomad. This did not yield useful results, and so I tried to delete it. This yielded a very unfortunate scenario where now my servers spew several hundred lines of
upon each boot, as well as when I assume some GC routine attempts to reap this now entirely stuck plugin job. The panic always happens after a few hundred of these have been spit out. |
Additionally, downgrading back to v1.1.8 allows the servers to function once again. The fsm errors are still there, but the panic is gone. |
Hi @dcarbone! Sorry to hear about your trouble. It looks like the panic bug was introduced in 41b853b which shipped in 1.2.0. When we're creating the It looks like this won't just impact ARM64 and you were just the unlucky first reporter because your cluster has classes to filter. We'll get a patch up ASAP. |
Ok, I was able to reproduce this on Nomad 1.2.0 in the following circumstances:
If the system job is rejected for all nodes or accepted for all nodes, we don't hit this code path, which probably explains why testing unfortunately didn't catch it. (One more reason to resurrect the prop testing PR #8832.) There's another map right after this point in the code that can probably be hit as well, so patching just this bug would undoubtably reveal another panic there, so I'll fix them both. To reproduce on a Vagrant box, run two Nomad processes. One server + client config without a node class: log_level = "debug"
data_dir = "/var/nomad/data"
bind_addr = "0.0.0.0"
plugin_dir = "/opt/nomad/plugins"
server {
enabled = true
bootstrap_expect = 1
raft_protocol = 3
}
client {
enabled = true
# node_class = # not enabled!
} And one client with a node class: log_level = "debug"
data_dir = "/var/nomad-client01/data"
bind_addr = "0.0.0.0"
plugin_dir = "/opt/nomad/plugins"
server {
enabled = false
}
client {
enabled = true
node_class = "foo"
servers = ["10.0.2.15:4647"]
}
ports {
http = 5646
rpc = 5647
serf = 5648
} Then run the following jobspec: job "example" {
datacenters = ["dc1"]
type = "system"
group "web" {
constraint {
attribute = "${node.class}"
value = "fuzz"
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/var/www"]
}
}
}
} |
Looks like we're on track to get this fixed a bit later today. Thanks again for the report, @dcarbone |
awesome, thanks for the lightning fast fix! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.2.1 (719c53ac0ebee95d902faafe59a30422a091bc31)
Operating system and Environment details
Linux 5.11.0-1022-raspi #24-Ubuntu aarch64
Issue
Server nodes continuously panic on boot after a time
Reproduction steps
Unsure exactly, I've been experiencing random instability since upgrading to v1.2.1, and now we're here. The server never goes beyond the boot stage.
Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: