Nomad v1.5.2 panics with a segmentation violation #16623

henrikjohansen · 2023-03-23T18:37:58Z

Nomad version

Nomad v1.5.2+ent

Operating system and Environment details

Ubuntu 20.04 LTS

Issue

After installing Nomad v1.5.2 over v1.5.1 the nomad agent keeps crashing and never comes online.

Mar 23 16:46:31 prappai03 nomad[14294]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 23 16:46:31 prappai03 nomad[14294]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x18d8e37]
Mar 23 16:46:31 prappai03 nomad[14294]: goroutine 625 [running]:
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/structs.(*AllocHookResources).GetCSIMounts(0x0?)
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/structs/allochook.go:18 +0x37
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).prepareCSIVolumes(0xc001807560, 0xc000d719b0?, >
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:169 +0x1ef
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).Prestart(0xc001807560, {0x3165e73?, 0xc00180756>
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:219 +0x38e
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).prestart(0xc001796c00)
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:262 +0x7c5
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc001796c00)
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:588 +0x706
Mar 23 16:46:31 prappai03 nomad[14294]: created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:395 +0x6c
Mar 23 16:46:31 prappai03 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 23 16:46:31 prappai03 systemd[1]: nomad.service: Failed with result 'exit-code'.
Mar 23 16:46:33 prappai03 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 10.
Mar 23 16:46:33 prappai03 systemd[1]: Stopped Nomad.

Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=40c1510b-a518-f624-2729-20bd465c5625
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=72baf548-6ada-771f-8af9-ba054e3fcea4
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=d214b01a-8601-7f4e-0625-915f0474e8f8
Mar 23 16:46:49 prappai03 nomad[14738]:  client.alloc_runner.task_runner.task_hook: failed to reattach to logmon process: alloc_id=56e263a5-a647-cdac-21b3-bb3>
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=56e263a5-a647-cdac-21b3-bb32f07fc520
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: garbage collecting allocation: alloc_id=56e263a5-a647-cdac-21b3-bb32f07fc520 reason="forced collection"
Mar 23 16:46:49 prappai03 nomad[14738]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 23 16:46:49 prappai03 nomad[14738]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x18c4577]
Mar 23 16:46:49 prappai03 nomad[14738]: goroutine 711 [running]:
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/structs.(*AllocHookResources).GetCSIMounts(0x0?)
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/structs/allochook.go:18 +0x37
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).prepareCSIVolumes(0xc001b53830, 0xc000ab39b0?, >
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:169 +0x1ef
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).Prestart(0xc001b53830, {0x3144b35?, 0xc001b5383>
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:219 +0x38e
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).prestart(0xc001b2f800)
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:262 +0x7c5
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc001b2f800)
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:588 +0x706
Mar 23 16:46:49 prappai03 nomad[14738]: created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:395 +0x6c
Mar 23 16:46:49 prappai03 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 23 16:46:49 prappai03 systemd[1]: nomad.service: Failed with result 'exit-code'.
Mar 23 16:46:51 prappai03 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 1.
Mar 23 16:46:51 prappai03 systemd[1]: Stopped Nomad.

The text was updated successfully, but these errors were encountered:

tgross · 2023-03-23T19:24:31Z

Hi @henrikjohansen! I just took a pass over this code and the nil pointer when we dereference the AllocHookResources struct at allochook.go#L18. This should be created the very first time we initialize the hooks in alloc_runner_hooks.go#L122 so something very funky is going on here. We'll dig into this and get back to you ASAP.

tgross · 2023-03-23T20:10:41Z

@henrikjohansen it might help me narrow down the behavior to hear about timing of this. Is this crash happening as the client is coming back up and restoring tasks, or is it happening when placing new tasks? If you can expand the window of the client logs to include when the allocation is placed (or restored), that could be helpful.

henrikjohansen · 2023-03-23T20:50:02Z

@tgross It happens as the new client comes back up.

I have attached a larger section of the log file ... client.log

tgross · 2023-03-23T21:11:28Z

That's very interesting! I'm seeing a whole lot of log lines like this:

client: found an alloc without any local state, skipping restore: alloc_id=f6f6270d-4c7e-620f-2f8a-3500ab3e9cef

These messages indicate that the client's state is missing expected data. There's "alloc runner state" (state for the allocation) and "task runner state" (state for the tasks within an allocation), and the only way you'd see these messages is if the task runner state was missing. Has this client been running for a very long time, such that these allocation IDs might be referring to pre-0.9 allocations?

If not, the client state store is most likely corrupted. You can find that at $datadir/client/state.db. If you'd be willing to zip it up and email it to [email protected] we'd be hugely grateful (that's a write-only list that only Nomad Engineering has access to). Once you've done that, you can most likely delete the client's data directory and everything will be back to normal.
If those allocs do likely belong to a really old job, then that's probably not the underlying problem here (although I'd still be interested in looking at that state db anyways, because that'd be interesting for us to look at for purposes of migrating those old schemas!).

tgross · 2023-03-24T13:04:31Z

I still haven't been able to reproduce this, but another thing I'm noticing from the log file you provided is that Nomad was started (pid 2449124) and then stopped 4s later and started again (pid 2449377) about 4s after that. Was there a problem during startup that isn't showing in the logs here?

henrikjohansen · 2023-03-24T13:38:58Z

@tgross I had already removed the client/ and alloc/ folders in order to get the node back online - however we have another node exhibiting the exact same issue. The fix was the same but this time I made a copy of the client/state.db file.

I'll send that to the specified email in a few moments 👍

tgross · 2023-03-24T15:33:29Z

Ok @henrikjohansen thanks so much for that state file. I don't have a conclusion yet but I wanted to give you an update. I used the nomad operator client-state command (new in 1.5.0) to take a look at what's in there. But first I ran into a bug in the command where if the task runner state was missing the command blows up. This patch fixed that:

diff --git a/command/operator_client_state.go b/command/operator_client_state.go
index ce8da4e8e..d398f9394 100644
--- a/command/operator_client_state.go
+++ b/command/operator_client_state.go
@@ -71,6 +71,16 @@ func (c *OperatorClientStateCommand) Run(args []string) int {
                tg := alloc.Job.LookupTaskGroup(alloc.TaskGroup)
                for _, jt := range tg.Tasks {
                        ls, rs, err := db.GetTaskRunnerState(allocID, jt.Name)
+                       if ls == nil {
+                               c.Ui.Warn(fmt.Sprintf("no task runner state for %s (%s)", allocID, jt.Name))
+                               tasks[jt.Name] = &taskState{
+                                       LocalState:  ls,
+                                       RemoteState: rs,
+                                       DriverState: nil,
+                               }
+                               continue
+
+                       }
                        if err != nil {
                                c.Ui.Error(fmt.Sprintf("failed to get task runner state %s: %v", allocID, err))
                                return 1

So then let's look at what's in the state file (with the task group names redacted):

# dump the state to JSON
$ nomad operator client-state . | jq . > client-state.json
no task runner state for 415e7f9c-4407-608d-27c8-f5476c07d279 (<redacted>)
no task runner state for 791e9a62-69f7-20fe-ff6f-5d3e447997b3 (<redacted>)
no task runner state for 7ad798aa-cf25-d88e-5276-1bba3a9e40fc (<redacted>)
no task runner state for 88b03ee9-0476-3946-ca4b-ba343ec73055 (<redacted>)
no task runner state for d6ed2325-ebb2-3385-f1cb-9507b4f1a205 (<redacted>)
no task runner state for e004690f-530a-38de-fc94-e7d83a6d56bc (<redacted>)

# how many allocations are there
$ cat client-state.json | jq '.Allocations | length'
24

# get the oldest allocation
$ for ts in $(cat client-state.json | jq '.Allocations[].Alloc.CreateTime')
    do timestampprint $ts; done | sort | head -1
2022-05-20T05:10:50-04:00

So none of the allocations in the state are ancient pre-0.9 allocations. Of the allocations missing their task state, they're all of various ages but none of them have a client status of "running":

$ cat client-state.json | jq -r '.Allocations[] | select(.Tasks[].LocalState == null) |
    "ID=" + .Alloc.ID[:8]
    + " CreateTime="+ (.Alloc.CreateTime|tostring)
    + " DesiredStatus=" + .Alloc.DesiredStatus
    + " ClientStatus=" + .Alloc.ClientStatus
    ' | sort -k2

ID=415e7f9c CreateTime=1653037850359321900 DesiredStatus=stop ClientStatus=failed
ID=d6ed2325 CreateTime=1666777833070312000 DesiredStatus=stop ClientStatus=failed
ID=7ad798aa CreateTime=1678730131962171600 DesiredStatus=run ClientStatus=pending
ID=e004690f CreateTime=1678730194928676400 DesiredStatus=run ClientStatus=pending
ID=791e9a62 CreateTime=1678730391069487600 DesiredStatus=run ClientStatus=pending
ID=88b03ee9 CreateTime=1679589255182233600 DesiredStatus=run ClientStatus=pending

# oldest pending is from last week
$ timestampprint 1678730131962171600
2023-03-13T13:55:31-04:00

If I look at the oldest one of the "pending" allocations, 7ad798aa, I see that it has no task state, which implies that its tasks never got started. Perhaps the Nomad client was shut down while the allocation was being spun up. This is frankly a little weird -- we would not expect to see pending allocations that never got a running task preserved in the client store past a single restart of the client agent. It looks like there's some old backwards compatibility code in place that's preventing us from cleaning it up. I've got a draft PR up to fix this state weirdness #16638

Putting that aside, I tried restoring the state.db to a new client (knowing that no allocation would be successfully restored), and the client wiped all the failed restores as I'd expect but left the 6 bad allocations in the state store. However, the restore process did not crash because of those 6 bad allocations, which I think means we can conclusively eliminate that condition as a source of the problem. I'm still digging into that.

tgross · 2023-03-28T21:08:22Z

Wanted to give a quick update on this. I've got a unit test that can replicate the stack trace, but I don't yet have a root cause for how we get to that situation.

There are two places where the nil object that's causing the crash should be set to a non-nil value: when we initialize all the alloc runner's hooks at alloc_runner_hooks.go#L122 and then again at the end of setting up a CSI volume in csi_hook.go#L111-L113. By the time we get to the CSI hook, the value should already be non-nil, and we'd be crashing there if it weren't. The CSI will have run to completion synchronously long before we run either of the volume hooks where we're crashing. There's a mutex that looks wrong to me around the setter, so that's the next avenue of investigation which I'll be picking up tomorrow.

henrikjohansen · 2023-03-29T07:20:29Z

@tgross One quick observation - we have seen this on 4 different nodes now and it seems to only occur on nodes where CSI is being utilized ...

tgross · 2023-03-29T13:09:33Z

@tgross One quick observation - we have seen this on 4 different nodes now and it seems to only occur on nodes where CSI is being utilized ...

Yeah, that's pretty much what I'd expect from the test I've got here: the stack trace shows we're getting the error because the AllocHookResources is nil, but the only feature that uses that resource is CSI. I feel like I'm getting pretty close here though; I'll report back here with an update soon.

tgross · 2023-03-29T20:48:38Z

I've got a patch up in #16722 that fixes something that's definitely a bug around this struct, but I haven't quite figured out the sequence of events that can actually lead to a crash. I've been trying to deploy jobs with CSI volumes mocked up from the client state I got from your client state but so far no joy, because I've been regularly hitting #13028 in the process. I'm noting from your logs that you don't hit #13028 though, @henrikjohansen, which is interesting. Can you share what CSI plugin you're using?

The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex.

henrikjohansen · 2023-03-30T07:03:17Z

@tgross The nodes where we have seen this crash all run our DNA/RNA sequencing workloads and some ML/AI stuff ... those jobs either use the CIFS SMB driver or the NFS CSI driver. The SMB driver is the most prominent one IIRC.

tgross · 2023-03-30T21:10:23Z

Hi @henrikjohansen I think I've almost figured out how we're getting into this state.

An allocation has a poststop task
That poststop task wants a CSI volume
The allocation's main task has started, but not the poststop yet.
The client stops
The server marks the allocation lost or the allocation is otherwise marked server-terminal (but not yet client-terminal)
The client restarts and gets an update from the server indicating the alloc is terminal before it completes restore (this is one fuzzy timing bit I haven't quite pinned down yet).

At that point, we end up at alloc_runner.go#L336-L353. The shouldRun call returns false because we're server-terminal (ref alloc_runner.go#L367-L377. We then run the postrun task, which is operating on data that was never written by the allocrunner's CSI hook because it didn't run at all.

I'm at the end of my day here and will be able to test this out tomorrow. The client state db you sent me does have allocations that are in the condition we need here. So if that hypothesis proves to be accurate, then #16722 will fix the bug by ensuring that the nil pointer that the taskrunner is hitting is always populated, even if by an empty object.

This probably means that postrun tasks that need CSI volumes are going to end up failing to run for lost allocations if the client happens to come back. That's not ideal but if that proves to be the case I'll probably split that out to a separate issue so we can triage that separately from this crashing bug.

tgross · 2023-03-31T18:24:31Z

Ok, I've managed to reproduce the crash and then tested the same scenario against the patch in #16722 and I'm confident that'll fix the problem.

The bug is timing dependent between when the allocrunner goroutines are started after client state restore (ref client.go#L1286) and the client's goroutine that watches for allocation updates from the server (ref client.go#L1849). You'll never hit it in a zillion years if you have a small client with only one allocation to worry about. So to make it a reliable interleaving I introduced a short pause at alloc_runner.go#L335.

Then I ran a single server and a single client. I ran the following plugin, created a volume via the spec below, and then launched a job with a poststop task that consumes that volume.

plugin jobspec

job "plugin" {
  type        = "system"

  group "csi" {

    task "plugin" {
      driver = "docker"

      config {
        image = "quay.io/k8scsi/hostpathplugin:v1.2.0"

        args = [
          "--drivername=csi-hostpath",
          "--v=5",
          "--endpoint=${CSI_ENDPOINT}",
          "--nodeid=node-${NOMAD_ALLOC_INDEX}",
        ]

        privileged = true
      }

      csi_plugin {
        id        = "hostpath-plugin0"
        type      = "monolith" # "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 256
        memory = 128
      }
    }
  }
}

volume spec

id        = "volume0"
name      = "volume0"
type      = "csi"
plugin_id = "hostpath-plugin0"

capacity_min = "1MB"
capacity_max = "1GB"

capability {
  access_mode     = "multi-node-single-writer"
  attachment_mode = "file-system"
}

volume-consuming jobspec

job "example" {

  group "group" {

    volume "volume0" {
      type            = "csi"
      source          = "volume0"
      attachment_mode = "file-system"
      access_mode     = "multi-node-single-writer"
      read_only       = true
    }

    task "main" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
      }

      volume_mount {
        volume      = "volume0"
        destination = "${NOMAD_TASK_DIR}/volume0"
      }
    }

    task "post" {
      driver = "docker"

      lifecycle {
        hook = "poststop"
      }

      config {
        image   = "busybox:1"
        command = "sleep"
        args    = ["5"]
      }

      volume_mount {
        volume      = "volume0"
        destination = "${NOMAD_TASK_DIR}/volume0"
      }
    }

  }
}

From there, I stopped the client, ran nomad job stop example, and restart the client. A few moments later, the client crashed just as reported (the line number in alloc_runner.go is off by two b/c of the 2 lines I added to pause):

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x13a3db7]

goroutine 407 [running]:
github.com/hashicorp/nomad/client/structs.(*AllocHookResources).GetCSIMounts(0x0?)
        github.com/hashicorp/nomad/client/structs/allochook.go:18 +0x37
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).prepareCSIVolumes(0xc0005e9020, 0xc0001619b0?, 0xc000772b40)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:169 +0x1ef
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).Prestart(0xc0005e9020, {0x2a1b37b?, 0xc0005e9020?}, 0xc0006f8960, 0xc0006ad630?)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:219 +0x38e
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).prestart(0xc001136600)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:262 +0x7c5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc001136600)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:588 +0x706
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
        github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:397 +0x6c

With the patch from #16722 we instead fail at the time the volume hook is called, which results in the allocrunner getting cleaned up but the poststop task fails:

    2023-03-31T14:11:46.900-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post type="Task hook failed" msg="volumes: No CSI Mount Point found for volume: volume0" failed=false
    2023-03-31T14:11:46.900-0400 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post error="prestart hook \"volumes\" failed: No CSI Mount Point found for volume: volume0"
    2023-03-31T14:11:46.900-0400 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post reason="Error was unrecoverable"
    2023-03-31T14:11:46.900-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post type="Not Restarting" msg="Error was unrecoverable" failed=true

I'm going to get #16722 reviewed, and I've opened a new issue #16746 for solving the poststop volumes issue.

The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex.

tgross · 2023-04-03T15:04:23Z

#16722 has been merged and will ship in the next regular patch release.

The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex.

The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex. Co-authored-by: Tim Gross <[email protected]>

henrikjohansen added the type/bug label Mar 23, 2023

tgross added the theme/crash label Mar 23, 2023

tgross self-assigned this Mar 23, 2023

tgross added this to the 1.5.x milestone Mar 24, 2023

tgross mentioned this issue Mar 29, 2023

CSI: set mounts in alloc hook resources atomically #16722

Merged

tgross mentioned this issue Mar 31, 2023

CSI volumes can fail for poststop tasks for a stopped job after client restart #16746

Open

tgross closed this as completed in #16722 Apr 3, 2023

tgross mentioned this issue Apr 5, 2023

client: allow incomplete allocrunners to be removed on restore #16638

Merged

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad v1.5.2 panics with a segmentation violation #16623

Nomad v1.5.2 panics with a segmentation violation #16623

henrikjohansen commented Mar 23, 2023

tgross commented Mar 23, 2023

tgross commented Mar 23, 2023

henrikjohansen commented Mar 23, 2023 •

edited

Loading

tgross commented Mar 23, 2023

tgross commented Mar 24, 2023

henrikjohansen commented Mar 24, 2023

tgross commented Mar 24, 2023

tgross commented Mar 28, 2023

henrikjohansen commented Mar 29, 2023

tgross commented Mar 29, 2023

tgross commented Mar 29, 2023

henrikjohansen commented Mar 30, 2023

tgross commented Mar 30, 2023 •

edited

Loading

tgross commented Mar 31, 2023 •

edited

Loading

tgross commented Apr 3, 2023

Nomad v1.5.2 panics with a segmentation violation #16623

Nomad v1.5.2 panics with a segmentation violation #16623

Comments

henrikjohansen commented Mar 23, 2023

Nomad version

Operating system and Environment details

Issue

tgross commented Mar 23, 2023

tgross commented Mar 23, 2023

henrikjohansen commented Mar 23, 2023 • edited Loading

tgross commented Mar 23, 2023

tgross commented Mar 24, 2023

henrikjohansen commented Mar 24, 2023

tgross commented Mar 24, 2023

tgross commented Mar 28, 2023

henrikjohansen commented Mar 29, 2023

tgross commented Mar 29, 2023

tgross commented Mar 29, 2023

henrikjohansen commented Mar 30, 2023

tgross commented Mar 30, 2023 • edited Loading

tgross commented Mar 31, 2023 • edited Loading

tgross commented Apr 3, 2023

henrikjohansen commented Mar 23, 2023 •

edited

Loading

tgross commented Mar 30, 2023 •

edited

Loading

tgross commented Mar 31, 2023 •

edited

Loading