-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad v1.5.2 panics with a segmentation violation #16623
Comments
Hi @henrikjohansen! I just took a pass over this code and the nil pointer when we dereference the |
@henrikjohansen it might help me narrow down the behavior to hear about timing of this. Is this crash happening as the client is coming back up and restoring tasks, or is it happening when placing new tasks? If you can expand the window of the client logs to include when the allocation is placed (or restored), that could be helpful. |
@tgross It happens as the new client comes back up. I have attached a larger section of the log file ... client.log |
That's very interesting! I'm seeing a whole lot of log lines like this:
These messages indicate that the client's state is missing expected data. There's "alloc runner state" (state for the allocation) and "task runner state" (state for the tasks within an allocation), and the only way you'd see these messages is if the task runner state was missing. Has this client been running for a very long time, such that these allocation IDs might be referring to pre-0.9 allocations?
|
I still haven't been able to reproduce this, but another thing I'm noticing from the log file you provided is that Nomad was started (pid 2449124) and then stopped 4s later and started again (pid 2449377) about 4s after that. Was there a problem during startup that isn't showing in the logs here? |
@tgross I had already removed the I'll send that to the specified email in a few moments 👍 |
Ok @henrikjohansen thanks so much for that state file. I don't have a conclusion yet but I wanted to give you an update. I used the diff --git a/command/operator_client_state.go b/command/operator_client_state.go
index ce8da4e8e..d398f9394 100644
--- a/command/operator_client_state.go
+++ b/command/operator_client_state.go
@@ -71,6 +71,16 @@ func (c *OperatorClientStateCommand) Run(args []string) int {
tg := alloc.Job.LookupTaskGroup(alloc.TaskGroup)
for _, jt := range tg.Tasks {
ls, rs, err := db.GetTaskRunnerState(allocID, jt.Name)
+ if ls == nil {
+ c.Ui.Warn(fmt.Sprintf("no task runner state for %s (%s)", allocID, jt.Name))
+ tasks[jt.Name] = &taskState{
+ LocalState: ls,
+ RemoteState: rs,
+ DriverState: nil,
+ }
+ continue
+
+ }
if err != nil {
c.Ui.Error(fmt.Sprintf("failed to get task runner state %s: %v", allocID, err))
return 1 So then let's look at what's in the state file (with the task group names redacted): # dump the state to JSON
$ nomad operator client-state . | jq . > client-state.json
no task runner state for 415e7f9c-4407-608d-27c8-f5476c07d279 (<redacted>)
no task runner state for 791e9a62-69f7-20fe-ff6f-5d3e447997b3 (<redacted>)
no task runner state for 7ad798aa-cf25-d88e-5276-1bba3a9e40fc (<redacted>)
no task runner state for 88b03ee9-0476-3946-ca4b-ba343ec73055 (<redacted>)
no task runner state for d6ed2325-ebb2-3385-f1cb-9507b4f1a205 (<redacted>)
no task runner state for e004690f-530a-38de-fc94-e7d83a6d56bc (<redacted>)
# how many allocations are there
$ cat client-state.json | jq '.Allocations | length'
24
# get the oldest allocation
$ for ts in $(cat client-state.json | jq '.Allocations[].Alloc.CreateTime')
do timestampprint $ts; done | sort | head -1
2022-05-20T05:10:50-04:00 So none of the allocations in the state are ancient pre-0.9 allocations. Of the allocations missing their task state, they're all of various ages but none of them have a client status of "running": $ cat client-state.json | jq -r '.Allocations[] | select(.Tasks[].LocalState == null) |
"ID=" + .Alloc.ID[:8]
+ " CreateTime="+ (.Alloc.CreateTime|tostring)
+ " DesiredStatus=" + .Alloc.DesiredStatus
+ " ClientStatus=" + .Alloc.ClientStatus
' | sort -k2
ID=415e7f9c CreateTime=1653037850359321900 DesiredStatus=stop ClientStatus=failed
ID=d6ed2325 CreateTime=1666777833070312000 DesiredStatus=stop ClientStatus=failed
ID=7ad798aa CreateTime=1678730131962171600 DesiredStatus=run ClientStatus=pending
ID=e004690f CreateTime=1678730194928676400 DesiredStatus=run ClientStatus=pending
ID=791e9a62 CreateTime=1678730391069487600 DesiredStatus=run ClientStatus=pending
ID=88b03ee9 CreateTime=1679589255182233600 DesiredStatus=run ClientStatus=pending
# oldest pending is from last week
$ timestampprint 1678730131962171600
2023-03-13T13:55:31-04:00 If I look at the oldest one of the "pending" allocations, Putting that aside, I tried restoring the |
Wanted to give a quick update on this. I've got a unit test that can replicate the stack trace, but I don't yet have a root cause for how we get to that situation. There are two places where the nil object that's causing the crash should be set to a non-nil value: when we initialize all the alloc runner's hooks at |
@tgross One quick observation - we have seen this on 4 different nodes now and it seems to only occur on nodes where CSI is being utilized ... |
Yeah, that's pretty much what I'd expect from the test I've got here: the stack trace shows we're getting the error because the |
I've got a patch up in #16722 that fixes something that's definitely a bug around this struct, but I haven't quite figured out the sequence of events that can actually lead to a crash. I've been trying to deploy jobs with CSI volumes mocked up from the client state I got from your client state but so far no joy, because I've been regularly hitting #13028 in the process. I'm noting from your logs that you don't hit #13028 though, @henrikjohansen, which is interesting. Can you share what CSI plugin you're using? |
The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex.
@tgross The nodes where we have seen this crash all run our DNA/RNA sequencing workloads and some ML/AI stuff ... those jobs either use the CIFS SMB driver or the NFS CSI driver. The SMB driver is the most prominent one IIRC. |
Hi @henrikjohansen I think I've almost figured out how we're getting into this state.
At that point, we end up at I'm at the end of my day here and will be able to test this out tomorrow. The client state db you sent me does have allocations that are in the condition we need here. So if that hypothesis proves to be accurate, then #16722 will fix the bug by ensuring that the nil pointer that the taskrunner is hitting is always populated, even if by an empty object. This probably means that |
Ok, I've managed to reproduce the crash and then tested the same scenario against the patch in #16722 and I'm confident that'll fix the problem. The bug is timing dependent between when the allocrunner goroutines are started after client state restore (ref Then I ran a single server and a single client. I ran the following plugin, created a volume via the spec below, and then launched a job with a poststop task that consumes that volume. plugin jobspecjob "plugin" {
type = "system"
group "csi" {
task "plugin" {
driver = "docker"
config {
image = "quay.io/k8scsi/hostpathplugin:v1.2.0"
args = [
"--drivername=csi-hostpath",
"--v=5",
"--endpoint=${CSI_ENDPOINT}",
"--nodeid=node-${NOMAD_ALLOC_INDEX}",
]
privileged = true
}
csi_plugin {
id = "hostpath-plugin0"
type = "monolith" # "node"
mount_dir = "/csi"
}
resources {
cpu = 256
memory = 128
}
}
}
} volume specid = "volume0"
name = "volume0"
type = "csi"
plugin_id = "hostpath-plugin0"
capacity_min = "1MB"
capacity_max = "1GB"
capability {
access_mode = "multi-node-single-writer"
attachment_mode = "file-system"
} volume-consuming jobspecjob "example" {
group "group" {
volume "volume0" {
type = "csi"
source = "volume0"
attachment_mode = "file-system"
access_mode = "multi-node-single-writer"
read_only = true
}
task "main" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/local"]
}
volume_mount {
volume = "volume0"
destination = "${NOMAD_TASK_DIR}/volume0"
}
}
task "post" {
driver = "docker"
lifecycle {
hook = "poststop"
}
config {
image = "busybox:1"
command = "sleep"
args = ["5"]
}
volume_mount {
volume = "volume0"
destination = "${NOMAD_TASK_DIR}/volume0"
}
}
}
} From there, I stopped the client, ran
With the patch from #16722 we instead fail at the time the volume hook is called, which results in the allocrunner getting cleaned up but the
I'm going to get #16722 reviewed, and I've opened a new issue #16746 for solving the |
The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex.
#16722 has been merged and will ship in the next regular patch release. |
The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex.
The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex. Co-authored-by: Tim Gross <[email protected]>
Nomad version
Nomad
v1.5.2+ent
Operating system and Environment details
Ubuntu 20.04 LTS
Issue
After installing Nomad v1.5.2 over v1.5.1 the nomad agent keeps crashing and never comes online.
The text was updated successfully, but these errors were encountered: