Runaway nomad process after Nomad client reboot #5984

radcool · 2019-07-19T15:40:47Z

Hello,

I've been experiencing an issue with Nomad and I'm not sure if it's a bug, or if I'm just abusing Nomad.

I have a Nomad cluster (0.9.3 everywhere on CentOS 7) with 3 servers and 6 clients (3 running exec and 3 running raw_exec). All are on VMs with 2 CPUs and 2 GBs of memory. That might not seem like a lot (and maybe that's my problem), but I'm not running CPU intensive workload so I assumed this would be OK.

I run a few microweb services, a system task (Fabio) and some periodic batch jobs, including a few parameterized periodic batch jobs that run every minute using raw_exec.

So although I don't consider the clients taxed by any means, the list of completed jobs reported by nomad status is quite high:

[root@nomad01 ~]# nomad status | grep dead | wc
   2695   13475  369215

The problem is that if I restart a client node with a simple reboot (no draining before), when it boots up again Nomad starts working like crazy, bringing both CPUs to 100% and eventually running out of memory, essentially taking that client out of commission. The only way I've then found to recover the client is to forcefully reset the VM, stop Nomad as soon as the VM is booted (and before Nomad goes out of control again), delete the contents of the data_dir, and restart Nomad again, thereby creating a new Nomad node.

Something weird I've also noticed is that before the reboot, a nomad status <job_id> shows all periodic jobs as having one allocation with status complete, but after the reboot I sometimes spot jobs that now have two allocations: one with status complete and one with status lost, even though I'm pretty sure that the job originally only had a single allocation with status complete:

ID            = frame.sh/dispatch-1563485106-254731d2/periodic-1563516540
Name          = frame.sh/dispatch-1563485106-254731d2/periodic-1563516540
Submit Date   = 2019-07-19T02:09:00-04:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
frame.sh    0       0         0        0       2         1

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
f4cf37fa  1f0723e2  frame.sh    0        run      complete  1m34s ago  1m34s ago
0ae2a209  dd0947d3  frame.sh    0        stop     lost      5h26m ago  1m34s ago

So, have I run into a bug, hit a Nomad operational limit (abusing it), or am I simply doing things wrong and shooting myself in the foot?

Thanks,
-Martin

The text was updated successfully, but these errors were encountered:

langmartin · 2019-07-19T16:52:52Z

Martin,

There are a couple of fixes around client rebooting that will be included in 0.9.4, and we're planning to push a release candidate today. The issue where allocations come back as lost may be related to #5890, and you may also be encountering GC failure addressed here #5905. Please try the rc when it's available!

What does your client configuration look like, especially the gc_* family of options? With those parameters set appropriately, the clients should avoid keeping too many of the completed allocs.

radcool · 2019-07-19T18:59:24Z

Hi Lang,

OK, I'll wait for the 0.9.4 RC and try that out and report back.

As for my node config, all I have in the client stanza is:

client {
  enabled = true
  network_speed = 10000
}

I've gone with the default GC values as I don't usually tweak knobs I don't fully understand. Should I have preferably overridden some of the default GC values?

radcool · 2019-07-23T20:08:48Z

@notnoop I installed 0.9.4-rc1 on a single one of my clients and then proceeded to restart this client without first draining it. Upon reboot the client started many allocations until memory (2GB RAM + 2GB swap) ran out. I managed to stop the process, rm -rf /var/db/nomad and restart.

I'll have to try a reboot again once allocs build in the data_dir.

radcool · 2019-07-24T14:48:20Z

@notnoop Some more details...

I left allocs build up overnight on the 0.9.4-rc1 client. When I checked this morning the <data_dir>/alloc folder contained 50 alloc subdirectories which I guess is expected because I have left gc_max_allocs to its default value of 50.

I then proceeded to reboot the client. Once the system had rebooted I used htop and ps to check the number of nomad processes and saw 569 /usr/local/bin/nomad logmon processes running, consuming all CPU, RAM and swap space.

One of the first Nomad errors I saw when reviewing the systemd journal was this one:

[ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=06aa9a19-b3de-742d-33f9-98d6a200c8d2/frame.sh/317370ac

I saw 50 of these errors (one for each alloc in the data_dir).

Two lines below I saw this:

[INFO ] client.gc: garbage collecting allocation: alloc_id=5a47798d-71c1-c671-ac58-b6ea69782771 reason="number of allocations (3172) is over the limit (50)"

I don't know where that high number (3172) comes from.

I then have 796 events in the style:

[INFO ] client.gc: marking allocation for GC: alloc_id=xxx-xxx-xxx-xxx

Mixed in with 625 lines in the style:

[INFO ] client.gc: garbage collecting allocation: alloc_id=xxx-xxx-xxx-xxx reason="forced collection"

Then it's 273 lines with:

[INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=xxx-xxx-xxx-xxx task=<task_name> @module=logmon path=/var/db/nomad/alloc/xxx-xxx-xxx-xxx/alloc/logs/.<task_name>.stdout.fifo timestamp=<timestamp>

Which finally seems to all come crashing down with:

Jul 24 10:25:49 <nomad_hostname> nomad[955]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Jul 24 10:25:49 <nomad_hostname> nomad[955]: SIGABRT: abort
Jul 24 10:25:49 <nomad_hostname> nomad[955]: PC=0x7fe3a9de12c7 m=2 sigcode=18446744073709551610
Jul 24 10:25:49 <nomad_hostname> nomad[955]: goroutine 0 [idle]:
Jul 24 10:25:49 <nomad_hostname> nomad[955]: runtime: unknown pc 0x7fe3a9de12c7
Jul 24 10:25:49 <nomad_hostname> nomad[955]: stack: frame={sp:0x7fe3a7b998f8, fp:0x0} stack=[0x7fe3a739a2a8,0x7fe3a7b99ea8)
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b997f8:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99808:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99818:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99828:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99838:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99848:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99858:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99868:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99878:  0000000000000000  41969ea420000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99888:  0000000000000000  000000000043d8ce <runtime.sysmon+558>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99898:  0000000000435d46 <runtime.mstart1+230>  00000000004365f9 <runtime.allocm+361>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998a8:  0000000000436ce9 <runtime.newm+57>  00000000004373c9 <runtime.startm+313>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998b8:  0000000000437515 <runtime.handoffp+85>  0000000000000002
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998c8:  00007fe3aa172868  00000000024accbd
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998d8:  00007fe38c0008c0  0000000000000011
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998e8:  0000000002420cb8  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998f8: <00007fe3a9de29b8  0000000000000020
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99908:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99918:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99928:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99938:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99948:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99958:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99968:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99978:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99988:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99998:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999a8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999b8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999c8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999d8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999e8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: runtime: unknown pc 0x7fe3a9de12c7
Jul 24 10:25:50 <nomad_hostname> nomad[955]: stack: frame={sp:0x7fe3a7b998f8, fp:0x0} stack=[0x7fe3a739a2a8,0x7fe3a7b99ea8)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b997f8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99808:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99818:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99828:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99838:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99848:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99858:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99868:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99878:  0000000000000000  41969ea420000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99888:  0000000000000000  000000000043d8ce <runtime.sysmon+558>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99898:  0000000000435d46 <runtime.mstart1+230>  00000000004365f9 <runtime.allocm+361>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998a8:  0000000000436ce9 <runtime.newm+57>  00000000004373c9 <runtime.startm+313>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998b8:  0000000000437515 <runtime.handoffp+85>  0000000000000002
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998c8:  00007fe3aa172868  00000000024accbd
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998d8:  00007fe38c0008c0  0000000000000011
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998e8:  0000000002420cb8  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998f8: <00007fe3a9de29b8  0000000000000020
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99908:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99918:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99928:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99938:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99948:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99958:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99968:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99978:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99988:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99998:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999a8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999b8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999c8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999d8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999e8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 1 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/command/agent.(*Command).handleSignals(0xc00024cfc0, 0xc0003f9080)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:738 +0x1de
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc00024cfc0, 0xc00003a0e0, 0x2, 0x2, 0x0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:653 +0xf11
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc00010ab40, 0xc00010ab40, 0xc0003c3fc0, 0x3b)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli/cli.go:255 +0x207
Jul 24 10:25:50 <nomad_hostname> nomad[955]: main.RunCustom(0xc00003a0d0, 0x3, 0x3, 0x0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/main.go:131 +0x48e
Jul 24 10:25:50 <nomad_hostname> nomad[955]: main.Run(0xc00003a0d0, 0x3, 0x3, 0xc000088058)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/main.go:76 +0x3f
Jul 24 10:25:50 <nomad_hostname> nomad[955]: main.main()
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/main.go:72 +0x62
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 5 [syscall]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: os/signal.signal_recv(0x0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/runtime/sigqueue.go:139 +0x9c
Jul 24 10:25:50 <nomad_hostname> nomad[955]: os/signal.loop()
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/os/signal/signal_unix.go:23 +0x22
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by os/signal.init.0
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/os/signal/signal_unix.go:29 +0x41
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 27 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSignal).run(0xc0001bc340)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:64 +0xb3
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.NewInmemSignal
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:38 +0x13e
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 28 [sleep]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: time.Sleep(0x3b9aca00)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/runtime/time.go:105 +0x14f
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).collectStats(0xc0001622d0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/metrics.go:230 +0x2f
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.New
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/start.go:79 +0x170
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 29 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/command/agent/consul.(*ServiceClient).Run(0xc000162690)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/client.go:352 +0x44d
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by github.com/hashicorp/nomad/command/agent.(*Agent).setupConsul
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/agent.go:1034 +0x170
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 30 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/drivers/shared/eventer.(*Eventer).eventLoop(0xc001125c80)

and it continues for a while.

radcool · 2019-08-19T15:08:00Z

@notnoop Am I the only one having reported such an issue?

notnoop · 2019-08-24T13:30:14Z

@radcool Thank you so much for the detailed messages and debugging. That is very helpful.

I'm sorry that I got sidetracked with other work. This is one of our high priority issues to address in 0.10 and I'll investigate it and follow up with questions as they come.

radcool · 2019-08-24T14:41:14Z

No worries @notnoop. I thought perhaps this was an issue on my side, not seen anywhere else.

Thanks.

This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes #5984 Related to #5890

notnoop · 2019-08-25T15:39:52Z

@radcool Thank you so much again for your detailed observations and notes! It helped me get to the bottom of the issue quickly. Would it be possible for you to test a client with #6207 changes?

radcool · 2019-08-26T04:33:16Z

@notnoop I compiled a new Nomad binary from #6207 as you requested:

[root@nomad07 ~]# nomad version
Nomad v0.10.0-dev (a80643e46d154c620c7c64709d1a7ba4cb7c288f)

installed it on a Nomad client, and rebooted that client.

Unfortunately once it came back up both vCPUs shot up to 100% and memory rapidly started filling up. Perhaps I did something wrong, but whatever I did I got the same behavior as before. It's a bit late here but I tomorrow I'll sift through the journal logs and report back.

radcool · 2019-08-26T14:48:51Z

@notnoop I did another reboot of the client running 0.10-dev this morning and the runaway behavior has not occurred this time, Upon re-reading the following text of #6207:

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

I guess this is actually expected.

Can you please confirm that going from 0.9.x --> v0.10.0-dev will initially not fix the issue after process restart as 0.10.0 will still try to process the allocs from the 0.9.x-created state DB, and that only once 0.10.0-dev becomes the running version, then the issue will no longer occur at subsequent reboots since the allocs will no longer be persisted to the state DB under 0.10.0?

notnoop · 2019-08-26T17:12:30Z

Can you please confirm that going from 0.9.x --> v0.10.0-dev will initially not fix the issue...

@radcool This reading is correct. The change doesn't recover from already "corrupt" persisted state left by the earlier buggy client process; but ensures that state is stored correctly for future client restarts.

I'd research options to how to recover without locking up or slowing restores unnecessarily.

radcool · 2019-08-27T12:38:49Z

@notnoop Thanks for your help by the way!

github-actions · 2022-11-19T02:26:43Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added theme/client fixed-waiting-confirmation labels Jul 22, 2019

notnoop added type/bug stage/needs-investigation and removed fixed-waiting-confirmation labels Aug 24, 2019

notnoop self-assigned this Aug 24, 2019

notnoop mentioned this issue Aug 25, 2019

Don't persist allocs of destroyed alloc runners #6207

Merged

notnoop closed this as completed in #6207 Aug 26, 2019

notnoop mentioned this issue Aug 27, 2019

alloc_runner: wait when starting suspicious allocs #6216

Merged

notnoop mentioned this issue Aug 27, 2019

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Open

github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runaway nomad process after Nomad client reboot #5984

Runaway nomad process after Nomad client reboot #5984

radcool commented Jul 19, 2019

langmartin commented Jul 19, 2019

radcool commented Jul 19, 2019 •

edited

Loading

radcool commented Jul 23, 2019

radcool commented Jul 24, 2019

radcool commented Aug 19, 2019 •

edited

Loading

notnoop commented Aug 24, 2019

radcool commented Aug 24, 2019

notnoop commented Aug 25, 2019

radcool commented Aug 26, 2019

radcool commented Aug 26, 2019

notnoop commented Aug 26, 2019

radcool commented Aug 27, 2019

github-actions bot commented Nov 19, 2022

Runaway nomad process after Nomad client reboot #5984

Runaway nomad process after Nomad client reboot #5984

Comments

radcool commented Jul 19, 2019

langmartin commented Jul 19, 2019

radcool commented Jul 19, 2019 • edited Loading

radcool commented Jul 23, 2019

radcool commented Jul 24, 2019

radcool commented Aug 19, 2019 • edited Loading

notnoop commented Aug 24, 2019

radcool commented Aug 24, 2019

notnoop commented Aug 25, 2019

radcool commented Aug 26, 2019

radcool commented Aug 26, 2019

notnoop commented Aug 26, 2019

radcool commented Aug 27, 2019

github-actions bot commented Nov 19, 2022

radcool commented Jul 19, 2019 •

edited

Loading

radcool commented Aug 19, 2019 •

edited

Loading