panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563

dcarbone · 2021-11-24T07:02:15Z

Nomad version

Nomad v1.2.1 (719c53ac0ebee95d902faafe59a30422a091bc31)

Operating system and Environment details

Linux 5.11.0-1022-raspi #24-Ubuntu aarch64

Issue

Server nodes continuously panic on boot after a time

Reproduction steps

Unsure exactly, I've been experiencing random instability since upgrading to v1.2.1, and now we're here. The server never goes beyond the boot stage.

Nomad Server logs (if appropriate)

Nov 24 06:54:16 nomad[2559]: panic: assignment to entry in nil map
Nov 24 06:54:16 nomad[2559]: goroutine 85 [running]:
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.mergeNodeFiltered(0x4002666d20, 0x4002666dc0)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:291 +0xdc
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computePlacements(0x40000e8840, {0x400038c880, 0x4, 0x4})
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:341 +0x804
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computeJobAllocs(0x40000e8840)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:280 +0x930
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process(0x40000e8840)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:148 +0x4cc
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0x400054f808, 0x400054f7f8)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/util.go:322 +0x44
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).Process(0x40000e8840, 0x4000b16c00)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:94 +0x5e4
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/nomad.(*nomadFSM).reconcileQueuedAllocations(0x4000504540, 0xd0d25)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/nomad/fsm.go:1789 +0x47c
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyReconcileSummaries(0x4000504540, {0x4001ee80b1, 0xa2, 0xa2}, 0xd0d25)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/nomad/fsm.go:895 +0x74
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0x4000504540, 0x400211eb90)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/nomad/fsm.go:231 +0x52c
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0x4002112d30)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/[email protected]/fsm.go:90 +0x200
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*Raft).runFSM.func2({0x400046c400, 0x40, 0x40})
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/[email protected]/fsm.go:113 +0x478
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*Raft).runFSM(0x400093a000)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/[email protected]/fsm.go:219 +0x278
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0x400093a000, 0x4000a66810)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/[email protected]/state.go:146 +0x58
Nov 24 06:54:16 nomad[2559]: created by github.com/hashicorp/raft.(*raftState).goFunc
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/[email protected]/state.go:144 +0x60

The text was updated successfully, but these errors were encountered:

dcarbone · 2021-11-24T07:10:36Z

If it helps, I've been experiencing an issue very similar to #7743. I tried to experiment with the driver-host-path csi plugin over a year ago in an attempt to familiarize myself with utilizing CSI plugins with Nomad. This did not yield useful results, and so I tried to delete it. This yielded a very unfortunate scenario where now my servers spew several hundred lines of

Nov 24 07:06:24 nomad[2843]:     2021-11-24T07:06:24.091Z [ERROR] nomad.fsm: deregistering job failed: job=csi-plugin error="DeleteJob failed: deleting job from plugin: plugin missing: hostpath-plugin0 <nil>"

upon each boot, as well as when I assume some GC routine attempts to reap this now entirely stuck plugin job. The panic always happens after a few hundred of these have been spit out.

dcarbone · 2021-11-24T07:13:16Z

Additionally, downgrading back to v1.1.8 allows the servers to function once again. The fsm errors are still there, but the panic is gone.

tgross · 2021-11-24T13:48:36Z

Hi @dcarbone! Sorry to hear about your trouble. It looks like the panic bug was introduced in 41b853b which shipped in 1.2.0. When we're creating the AllocMetrics object, it's not getting correctly populated with its ClassFiltered map.

It looks like this won't just impact ARM64 and you were just the unlucky first reporter because your cluster has classes to filter. We'll get a patch up ASAP.

tgross · 2021-11-24T14:26:55Z

Ok, I was able to reproduce this on Nomad 1.2.0 in the following circumstances:

Some subset of nodes has a class
Some subset of nodes does not have that class
A system job requires a class

If the system job is rejected for all nodes or accepted for all nodes, we don't hit this code path, which probably explains why testing unfortunately didn't catch it. (One more reason to resurrect the prop testing PR #8832.) There's another map right after this point in the code that can probably be hit as well, so patching just this bug would undoubtably reveal another panic there, so I'll fix them both.

To reproduce on a Vagrant box, run two Nomad processes. One server + client config without a node class:

log_level  = "debug"
data_dir   = "/var/nomad/data"
bind_addr  = "0.0.0.0"
plugin_dir = "/opt/nomad/plugins"

server {
  enabled          = true
  bootstrap_expect = 1
  raft_protocol = 3
}

client {
  enabled = true
  # node_class = # not enabled!
}

And one client with a node class:

log_level  = "debug"
data_dir   = "/var/nomad-client01/data"
bind_addr  = "0.0.0.0"
plugin_dir = "/opt/nomad/plugins"

server {
  enabled = false
}

client {
  enabled    = true
  node_class = "foo"
  servers    = ["10.0.2.15:4647"]
}

ports {
  http = 5646
  rpc  = 5647
  serf = 5648
}

Then run the following jobspec:

job "example" {
  datacenters = ["dc1"]
  type        = "system"

  group "web" {

    constraint {
      attribute = "${node.class}"
      value     = "fuzz"
    }

    task "http" {
      driver = "docker"
      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/var/www"]
      }
    }
  }
}

tgross · 2021-11-24T15:49:40Z

@dcarbone I've opened #11565 with the patch. I'll update here when I have a better idea of when that'll ship.

tgross · 2021-11-24T19:22:20Z

Looks like we're on track to get this fixed a bit later today. Thanks again for the report, @dcarbone

dcarbone · 2021-11-24T20:32:13Z

awesome, thanks for the lightning fast fix!

github-actions · 2022-10-14T02:44:47Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dcarbone added the type/bug label Nov 24, 2021

dcarbone changed the title ~~arm64 panic loop at startup~~ Server node arm64 panic loop at startup Nov 24, 2021

tgross added the theme/crash label Nov 24, 2021

tgross changed the title ~~Server node arm64 panic loop at startup~~ panic in scheduler on nil ClassFilter map Nov 24, 2021

tgross changed the title ~~panic in scheduler on nil ClassFilter map~~ panic in 1.2.0+ in scheduler on nil ClassFilter map Nov 24, 2021

tgross self-assigned this Nov 24, 2021

This was referenced Nov 24, 2021

helper package functions to copy maps/slices encourage unsafe usage #11564

Closed

scheduler: fix panic in system jobs when nodes filtered by class #11565

Merged

tgross changed the title ~~panic in 1.2.0+ in scheduler on nil ClassFilter map~~ panic in 1.2.0+ in scheduler for system jobs with class constraints Nov 24, 2021

tgross added this to the 1.2.2 milestone Nov 24, 2021

jrasell pinned this issue Nov 24, 2021

tgross closed this as completed in #11565 Nov 24, 2021

tgross modified the milestones: 1.2.3, 1.2.2 Nov 24, 2021

dcarbone mentioned this issue Nov 25, 2021

CSI volumes API doesn't respect wildcard namespace filter #11574

Closed

tgross changed the title ~~panic in 1.2.0+ in scheduler for system jobs with class constraints~~ panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints Nov 29, 2021

tgross unpinned this issue Dec 14, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563

panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563

dcarbone commented Nov 24, 2021 •

edited

Loading

dcarbone commented Nov 24, 2021 •

edited

Loading

dcarbone commented Nov 24, 2021

tgross commented Nov 24, 2021 •

edited

Loading

tgross commented Nov 24, 2021

tgross commented Nov 24, 2021

tgross commented Nov 24, 2021

dcarbone commented Nov 24, 2021

github-actions bot commented Oct 14, 2022

panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563

panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563

Comments

dcarbone commented Nov 24, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

dcarbone commented Nov 24, 2021 • edited Loading

dcarbone commented Nov 24, 2021

tgross commented Nov 24, 2021 • edited Loading

tgross commented Nov 24, 2021

tgross commented Nov 24, 2021

tgross commented Nov 24, 2021

dcarbone commented Nov 24, 2021

github-actions bot commented Oct 14, 2022

dcarbone commented Nov 24, 2021 •

edited

Loading

dcarbone commented Nov 24, 2021 •

edited

Loading

tgross commented Nov 24, 2021 •

edited

Loading