Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad v1.5.2 panics with a segmentation violation #16623

Closed
henrikjohansen opened this issue Mar 23, 2023 · 15 comments · Fixed by #16722
Closed

Nomad v1.5.2 panics with a segmentation violation #16623

henrikjohansen opened this issue Mar 23, 2023 · 15 comments · Fixed by #16722
Assignees
Milestone

Comments

@henrikjohansen
Copy link

Nomad version

Nomad v1.5.2+ent

Operating system and Environment details

Ubuntu 20.04 LTS

Issue

After installing Nomad v1.5.2 over v1.5.1 the nomad agent keeps crashing and never comes online.

Mar 23 16:46:31 prappai03 nomad[14294]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 23 16:46:31 prappai03 nomad[14294]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x18d8e37]
Mar 23 16:46:31 prappai03 nomad[14294]: goroutine 625 [running]:
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/structs.(*AllocHookResources).GetCSIMounts(0x0?)
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/structs/allochook.go:18 +0x37
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).prepareCSIVolumes(0xc001807560, 0xc000d719b0?, >
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:169 +0x1ef
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).Prestart(0xc001807560, {0x3165e73?, 0xc00180756>
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:219 +0x38e
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).prestart(0xc001796c00)
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:262 +0x7c5
Mar 23 16:46:31 prappai03 nomad[14294]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc001796c00)
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:588 +0x706
Mar 23 16:46:31 prappai03 nomad[14294]: created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
Mar 23 16:46:31 prappai03 nomad[14294]:         github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:395 +0x6c
Mar 23 16:46:31 prappai03 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 23 16:46:31 prappai03 systemd[1]: nomad.service: Failed with result 'exit-code'.
Mar 23 16:46:33 prappai03 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 10.
Mar 23 16:46:33 prappai03 systemd[1]: Stopped Nomad.
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=40c1510b-a518-f624-2729-20bd465c5625
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=72baf548-6ada-771f-8af9-ba054e3fcea4
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=d214b01a-8601-7f4e-0625-915f0474e8f8
Mar 23 16:46:49 prappai03 nomad[14738]:  client.alloc_runner.task_runner.task_hook: failed to reattach to logmon process: alloc_id=56e263a5-a647-cdac-21b3-bb3>
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: marking allocation for GC: alloc_id=56e263a5-a647-cdac-21b3-bb32f07fc520
Mar 23 16:46:49 prappai03 nomad[14738]:  client.gc: garbage collecting allocation: alloc_id=56e263a5-a647-cdac-21b3-bb32f07fc520 reason="forced collection"
Mar 23 16:46:49 prappai03 nomad[14738]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 23 16:46:49 prappai03 nomad[14738]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x18c4577]
Mar 23 16:46:49 prappai03 nomad[14738]: goroutine 711 [running]:
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/structs.(*AllocHookResources).GetCSIMounts(0x0?)
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/structs/allochook.go:18 +0x37
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).prepareCSIVolumes(0xc001b53830, 0xc000ab39b0?, >
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:169 +0x1ef
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).Prestart(0xc001b53830, {0x3144b35?, 0xc001b5383>
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:219 +0x38e
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).prestart(0xc001b2f800)
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:262 +0x7c5
Mar 23 16:46:49 prappai03 nomad[14738]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc001b2f800)
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:588 +0x706
Mar 23 16:46:49 prappai03 nomad[14738]: created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
Mar 23 16:46:49 prappai03 nomad[14738]:         github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:395 +0x6c
Mar 23 16:46:49 prappai03 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 23 16:46:49 prappai03 systemd[1]: nomad.service: Failed with result 'exit-code'.
Mar 23 16:46:51 prappai03 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 1.
Mar 23 16:46:51 prappai03 systemd[1]: Stopped Nomad.

@tgross
Copy link
Member

tgross commented Mar 23, 2023

Hi @henrikjohansen! I just took a pass over this code and the nil pointer when we dereference the AllocHookResources struct at allochook.go#L18. This should be created the very first time we initialize the hooks in alloc_runner_hooks.go#L122 so something very funky is going on here. We'll dig into this and get back to you ASAP.

@tgross
Copy link
Member

tgross commented Mar 23, 2023

@henrikjohansen it might help me narrow down the behavior to hear about timing of this. Is this crash happening as the client is coming back up and restoring tasks, or is it happening when placing new tasks? If you can expand the window of the client logs to include when the allocation is placed (or restored), that could be helpful.

@tgross tgross self-assigned this Mar 23, 2023
@henrikjohansen
Copy link
Author

henrikjohansen commented Mar 23, 2023

@tgross It happens as the new client comes back up.

I have attached a larger section of the log file ... client.log

@tgross
Copy link
Member

tgross commented Mar 23, 2023

That's very interesting! I'm seeing a whole lot of log lines like this:

client: found an alloc without any local state, skipping restore: alloc_id=f6f6270d-4c7e-620f-2f8a-3500ab3e9cef

These messages indicate that the client's state is missing expected data. There's "alloc runner state" (state for the allocation) and "task runner state" (state for the tasks within an allocation), and the only way you'd see these messages is if the task runner state was missing. Has this client been running for a very long time, such that these allocation IDs might be referring to pre-0.9 allocations?

  • If not, the client state store is most likely corrupted. You can find that at $datadir/client/state.db. If you'd be willing to zip it up and email it to [email protected] we'd be hugely grateful (that's a write-only list that only Nomad Engineering has access to). Once you've done that, you can most likely delete the client's data directory and everything will be back to normal.
  • If those allocs do likely belong to a really old job, then that's probably not the underlying problem here (although I'd still be interested in looking at that state db anyways, because that'd be interesting for us to look at for purposes of migrating those old schemas!).

@tgross
Copy link
Member

tgross commented Mar 24, 2023

I still haven't been able to reproduce this, but another thing I'm noticing from the log file you provided is that Nomad was started (pid 2449124) and then stopped 4s later and started again (pid 2449377) about 4s after that. Was there a problem during startup that isn't showing in the logs here?

@henrikjohansen
Copy link
Author

@tgross I had already removed the client/ and alloc/ folders in order to get the node back online - however we have another node exhibiting the exact same issue. The fix was the same but this time I made a copy of the client/state.db file.

I'll send that to the specified email in a few moments 👍

@tgross
Copy link
Member

tgross commented Mar 24, 2023

Ok @henrikjohansen thanks so much for that state file. I don't have a conclusion yet but I wanted to give you an update. I used the nomad operator client-state command (new in 1.5.0) to take a look at what's in there. But first I ran into a bug in the command where if the task runner state was missing the command blows up. This patch fixed that:

diff --git a/command/operator_client_state.go b/command/operator_client_state.go
index ce8da4e8e..d398f9394 100644
--- a/command/operator_client_state.go
+++ b/command/operator_client_state.go
@@ -71,6 +71,16 @@ func (c *OperatorClientStateCommand) Run(args []string) int {
                tg := alloc.Job.LookupTaskGroup(alloc.TaskGroup)
                for _, jt := range tg.Tasks {
                        ls, rs, err := db.GetTaskRunnerState(allocID, jt.Name)
+                       if ls == nil {
+                               c.Ui.Warn(fmt.Sprintf("no task runner state for %s (%s)", allocID, jt.Name))
+                               tasks[jt.Name] = &taskState{
+                                       LocalState:  ls,
+                                       RemoteState: rs,
+                                       DriverState: nil,
+                               }
+                               continue
+
+                       }
                        if err != nil {
                                c.Ui.Error(fmt.Sprintf("failed to get task runner state %s: %v", allocID, err))
                                return 1

So then let's look at what's in the state file (with the task group names redacted):

# dump the state to JSON
$ nomad operator client-state . | jq . > client-state.json
no task runner state for 415e7f9c-4407-608d-27c8-f5476c07d279 (<redacted>)
no task runner state for 791e9a62-69f7-20fe-ff6f-5d3e447997b3 (<redacted>)
no task runner state for 7ad798aa-cf25-d88e-5276-1bba3a9e40fc (<redacted>)
no task runner state for 88b03ee9-0476-3946-ca4b-ba343ec73055 (<redacted>)
no task runner state for d6ed2325-ebb2-3385-f1cb-9507b4f1a205 (<redacted>)
no task runner state for e004690f-530a-38de-fc94-e7d83a6d56bc (<redacted>)

# how many allocations are there
$ cat client-state.json | jq '.Allocations | length'
24

# get the oldest allocation
$ for ts in $(cat client-state.json | jq '.Allocations[].Alloc.CreateTime')
    do timestampprint $ts; done | sort | head -1
2022-05-20T05:10:50-04:00

So none of the allocations in the state are ancient pre-0.9 allocations. Of the allocations missing their task state, they're all of various ages but none of them have a client status of "running":

$ cat client-state.json | jq -r '.Allocations[] | select(.Tasks[].LocalState == null) |
    "ID=" + .Alloc.ID[:8]
    + " CreateTime="+ (.Alloc.CreateTime|tostring)
    + " DesiredStatus=" + .Alloc.DesiredStatus
    + " ClientStatus=" + .Alloc.ClientStatus
    ' | sort -k2

ID=415e7f9c CreateTime=1653037850359321900 DesiredStatus=stop ClientStatus=failed
ID=d6ed2325 CreateTime=1666777833070312000 DesiredStatus=stop ClientStatus=failed
ID=7ad798aa CreateTime=1678730131962171600 DesiredStatus=run ClientStatus=pending
ID=e004690f CreateTime=1678730194928676400 DesiredStatus=run ClientStatus=pending
ID=791e9a62 CreateTime=1678730391069487600 DesiredStatus=run ClientStatus=pending
ID=88b03ee9 CreateTime=1679589255182233600 DesiredStatus=run ClientStatus=pending

# oldest pending is from last week
$ timestampprint 1678730131962171600
2023-03-13T13:55:31-04:00

If I look at the oldest one of the "pending" allocations, 7ad798aa, I see that it has no task state, which implies that its tasks never got started. Perhaps the Nomad client was shut down while the allocation was being spun up. This is frankly a little weird -- we would not expect to see pending allocations that never got a running task preserved in the client store past a single restart of the client agent. It looks like there's some old backwards compatibility code in place that's preventing us from cleaning it up. I've got a draft PR up to fix this state weirdness #16638

Putting that aside, I tried restoring the state.db to a new client (knowing that no allocation would be successfully restored), and the client wiped all the failed restores as I'd expect but left the 6 bad allocations in the state store. However, the restore process did not crash because of those 6 bad allocations, which I think means we can conclusively eliminate that condition as a source of the problem. I'm still digging into that.

@tgross tgross added this to the 1.5.x milestone Mar 24, 2023
@tgross
Copy link
Member

tgross commented Mar 28, 2023

Wanted to give a quick update on this. I've got a unit test that can replicate the stack trace, but I don't yet have a root cause for how we get to that situation.

There are two places where the nil object that's causing the crash should be set to a non-nil value: when we initialize all the alloc runner's hooks at alloc_runner_hooks.go#L122 and then again at the end of setting up a CSI volume in csi_hook.go#L111-L113. By the time we get to the CSI hook, the value should already be non-nil, and we'd be crashing there if it weren't. The CSI will have run to completion synchronously long before we run either of the volume hooks where we're crashing. There's a mutex that looks wrong to me around the setter, so that's the next avenue of investigation which I'll be picking up tomorrow.

@henrikjohansen
Copy link
Author

@tgross One quick observation - we have seen this on 4 different nodes now and it seems to only occur on nodes where CSI is being utilized ...

@tgross
Copy link
Member

tgross commented Mar 29, 2023

@tgross One quick observation - we have seen this on 4 different nodes now and it seems to only occur on nodes where CSI is being utilized ...

Yeah, that's pretty much what I'd expect from the test I've got here: the stack trace shows we're getting the error because the AllocHookResources is nil, but the only feature that uses that resource is CSI. I feel like I'm getting pretty close here though; I'll report back here with an update soon.

@tgross
Copy link
Member

tgross commented Mar 29, 2023

I've got a patch up in #16722 that fixes something that's definitely a bug around this struct, but I haven't quite figured out the sequence of events that can actually lead to a crash. I've been trying to deploy jobs with CSI volumes mocked up from the client state I got from your client state but so far no joy, because I've been regularly hitting #13028 in the process. I'm noting from your logs that you don't hit #13028 though, @henrikjohansen, which is interesting. Can you share what CSI plugin you're using?

tgross added a commit that referenced this issue Mar 29, 2023
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).

The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.

Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
@henrikjohansen
Copy link
Author

@tgross The nodes where we have seen this crash all run our DNA/RNA sequencing workloads and some ML/AI stuff ... those jobs either use the CIFS SMB driver or the NFS CSI driver. The SMB driver is the most prominent one IIRC.

@tgross
Copy link
Member

tgross commented Mar 30, 2023

Hi @henrikjohansen I think I've almost figured out how we're getting into this state.

  • An allocation has a poststop task
  • That poststop task wants a CSI volume
  • The allocation's main task has started, but not the poststop yet.
  • The client stops
  • The server marks the allocation lost or the allocation is otherwise marked server-terminal (but not yet client-terminal)
  • The client restarts and gets an update from the server indicating the alloc is terminal before it completes restore (this is one fuzzy timing bit I haven't quite pinned down yet).

At that point, we end up at alloc_runner.go#L336-L353. The shouldRun call returns false because we're server-terminal (ref alloc_runner.go#L367-L377. We then run the postrun task, which is operating on data that was never written by the allocrunner's CSI hook because it didn't run at all.

I'm at the end of my day here and will be able to test this out tomorrow. The client state db you sent me does have allocations that are in the condition we need here. So if that hypothesis proves to be accurate, then #16722 will fix the bug by ensuring that the nil pointer that the taskrunner is hitting is always populated, even if by an empty object.

This probably means that postrun tasks that need CSI volumes are going to end up failing to run for lost allocations if the client happens to come back. That's not ideal but if that proves to be the case I'll probably split that out to a separate issue so we can triage that separately from this crashing bug.

@tgross
Copy link
Member

tgross commented Mar 31, 2023

Ok, I've managed to reproduce the crash and then tested the same scenario against the patch in #16722 and I'm confident that'll fix the problem.

The bug is timing dependent between when the allocrunner goroutines are started after client state restore (ref client.go#L1286) and the client's goroutine that watches for allocation updates from the server (ref client.go#L1849). You'll never hit it in a zillion years if you have a small client with only one allocation to worry about. So to make it a reliable interleaving I introduced a short pause at alloc_runner.go#L335.

Then I ran a single server and a single client. I ran the following plugin, created a volume via the spec below, and then launched a job with a poststop task that consumes that volume.

plugin jobspec
job "plugin" {
  type        = "system"

  group "csi" {

    task "plugin" {
      driver = "docker"

      config {
        image = "quay.io/k8scsi/hostpathplugin:v1.2.0"

        args = [
          "--drivername=csi-hostpath",
          "--v=5",
          "--endpoint=${CSI_ENDPOINT}",
          "--nodeid=node-${NOMAD_ALLOC_INDEX}",
        ]

        privileged = true
      }

      csi_plugin {
        id        = "hostpath-plugin0"
        type      = "monolith" # "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 256
        memory = 128
      }
    }
  }
}
volume spec
id        = "volume0"
name      = "volume0"
type      = "csi"
plugin_id = "hostpath-plugin0"

capacity_min = "1MB"
capacity_max = "1GB"

capability {
  access_mode     = "multi-node-single-writer"
  attachment_mode = "file-system"
}
volume-consuming jobspec
job "example" {

  group "group" {

    volume "volume0" {
      type            = "csi"
      source          = "volume0"
      attachment_mode = "file-system"
      access_mode     = "multi-node-single-writer"
      read_only       = true
    }

    task "main" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
      }

      volume_mount {
        volume      = "volume0"
        destination = "${NOMAD_TASK_DIR}/volume0"
      }
    }

    task "post" {
      driver = "docker"

      lifecycle {
        hook = "poststop"
      }

      config {
        image   = "busybox:1"
        command = "sleep"
        args    = ["5"]
      }

      volume_mount {
        volume      = "volume0"
        destination = "${NOMAD_TASK_DIR}/volume0"
      }
    }

  }
}

From there, I stopped the client, ran nomad job stop example, and restart the client. A few moments later, the client crashed just as reported (the line number in alloc_runner.go is off by two b/c of the 2 lines I added to pause):

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x13a3db7]

goroutine 407 [running]:
github.com/hashicorp/nomad/client/structs.(*AllocHookResources).GetCSIMounts(0x0?)
        github.com/hashicorp/nomad/client/structs/allochook.go:18 +0x37
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).prepareCSIVolumes(0xc0005e9020, 0xc0001619b0?, 0xc000772b40)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:169 +0x1ef
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*volumeHook).Prestart(0xc0005e9020, {0x2a1b37b?, 0xc0005e9020?}, 0xc0006f8960, 0xc0006ad630?)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/volume_hook.go:219 +0x38e
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).prestart(0xc001136600)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:262 +0x7c5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc001136600)
        github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:588 +0x706
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
        github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:397 +0x6c

With the patch from #16722 we instead fail at the time the volume hook is called, which results in the allocrunner getting cleaned up but the poststop task fails:

    2023-03-31T14:11:46.900-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post type="Task hook failed" msg="volumes: No CSI Mount Point found for volume: volume0" failed=false
    2023-03-31T14:11:46.900-0400 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post error="prestart hook \"volumes\" failed: No CSI Mount Point found for volume: volume0"
    2023-03-31T14:11:46.900-0400 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post reason="Error was unrecoverable"
    2023-03-31T14:11:46.900-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c2191ec3-2455-744a-a163-66cab3e558f7 task=post type="Not Restarting" msg="Error was unrecoverable" failed=true

I'm going to get #16722 reviewed, and I've opened a new issue #16746 for solving the poststop volumes issue.

tgross added a commit that referenced this issue Apr 3, 2023
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).

The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.

Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
@tgross
Copy link
Member

tgross commented Apr 3, 2023

#16722 has been merged and will ship in the next regular patch release.

tgross added a commit that referenced this issue Apr 3, 2023
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).

The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.

Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
tgross added a commit that referenced this issue Apr 3, 2023
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).

The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.

Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.

Co-authored-by: Tim Gross <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants