Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runaway nomad process after Nomad client reboot #5984

Closed
radcool opened this issue Jul 19, 2019 · 13 comments · Fixed by #6207 or #6216
Closed

Runaway nomad process after Nomad client reboot #5984

radcool opened this issue Jul 19, 2019 · 13 comments · Fixed by #6207 or #6216

Comments

@radcool
Copy link

radcool commented Jul 19, 2019

Hello,

I've been experiencing an issue with Nomad and I'm not sure if it's a bug, or if I'm just abusing Nomad.

I have a Nomad cluster (0.9.3 everywhere on CentOS 7) with 3 servers and 6 clients (3 running exec and 3 running raw_exec). All are on VMs with 2 CPUs and 2 GBs of memory. That might not seem like a lot (and maybe that's my problem), but I'm not running CPU intensive workload so I assumed this would be OK.

I run a few microweb services, a system task (Fabio) and some periodic batch jobs, including a few parameterized periodic batch jobs that run every minute using raw_exec.

So although I don't consider the clients taxed by any means, the list of completed jobs reported by nomad status is quite high:

[root@nomad01 ~]# nomad status | grep dead | wc
   2695   13475  369215

The problem is that if I restart a client node with a simple reboot (no draining before), when it boots up again Nomad starts working like crazy, bringing both CPUs to 100% and eventually running out of memory, essentially taking that client out of commission. The only way I've then found to recover the client is to forcefully reset the VM, stop Nomad as soon as the VM is booted (and before Nomad goes out of control again), delete the contents of the data_dir, and restart Nomad again, thereby creating a new Nomad node.

Something weird I've also noticed is that before the reboot, a nomad status <job_id> shows all periodic jobs as having one allocation with status complete, but after the reboot I sometimes spot jobs that now have two allocations: one with status complete and one with status lost, even though I'm pretty sure that the job originally only had a single allocation with status complete:

ID            = frame.sh/dispatch-1563485106-254731d2/periodic-1563516540
Name          = frame.sh/dispatch-1563485106-254731d2/periodic-1563516540
Submit Date   = 2019-07-19T02:09:00-04:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
frame.sh    0       0         0        0       2         1

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
f4cf37fa  1f0723e2  frame.sh    0        run      complete  1m34s ago  1m34s ago
0ae2a209  dd0947d3  frame.sh    0        stop     lost      5h26m ago  1m34s ago

So, have I run into a bug, hit a Nomad operational limit (abusing it), or am I simply doing things wrong and shooting myself in the foot?

Thanks,
-Martin

@langmartin
Copy link
Contributor

Martin,

There are a couple of fixes around client rebooting that will be included in 0.9.4, and we're planning to push a release candidate today. The issue where allocations come back as lost may be related to #5890, and you may also be encountering GC failure addressed here #5905. Please try the rc when it's available!

What does your client configuration look like, especially the gc_* family of options? With those parameters set appropriately, the clients should avoid keeping too many of the completed allocs.

@radcool
Copy link
Author

radcool commented Jul 19, 2019

Hi Lang,

OK, I'll wait for the 0.9.4 RC and try that out and report back.

As for my node config, all I have in the client stanza is:

client {
  enabled = true
  network_speed = 10000
}

I've gone with the default GC values as I don't usually tweak knobs I don't fully understand. Should I have preferably overridden some of the default GC values?

@radcool
Copy link
Author

radcool commented Jul 23, 2019

@notnoop I installed 0.9.4-rc1 on a single one of my clients and then proceeded to restart this client without first draining it. Upon reboot the client started many allocations until memory (2GB RAM + 2GB swap) ran out. I managed to stop the process, rm -rf /var/db/nomad and restart.

I'll have to try a reboot again once allocs build in the data_dir.

@radcool
Copy link
Author

radcool commented Jul 24, 2019

@notnoop Some more details...

I left allocs build up overnight on the 0.9.4-rc1 client. When I checked this morning the <data_dir>/alloc folder contained 50 alloc subdirectories which I guess is expected because I have left gc_max_allocs to its default value of 50.

I then proceeded to reboot the client. Once the system had rebooted I used htop and ps to check the number of nomad processes and saw 569 /usr/local/bin/nomad logmon processes running, consuming all CPU, RAM and swap space.

One of the first Nomad errors I saw when reviewing the systemd journal was this one:

[ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=06aa9a19-b3de-742d-33f9-98d6a200c8d2/frame.sh/317370ac

I saw 50 of these errors (one for each alloc in the data_dir).

Two lines below I saw this:

[INFO ] client.gc: garbage collecting allocation: alloc_id=5a47798d-71c1-c671-ac58-b6ea69782771 reason="number of allocations (3172) is over the limit (50)"

I don't know where that high number (3172) comes from.

I then have 796 events in the style:

[INFO ] client.gc: marking allocation for GC: alloc_id=xxx-xxx-xxx-xxx

Mixed in with 625 lines in the style:

[INFO ] client.gc: garbage collecting allocation: alloc_id=xxx-xxx-xxx-xxx reason="forced collection"

Then it's 273 lines with:

[INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=xxx-xxx-xxx-xxx task=<task_name> @module=logmon path=/var/db/nomad/alloc/xxx-xxx-xxx-xxx/alloc/logs/.<task_name>.stdout.fifo timestamp=<timestamp>

Which finally seems to all come crashing down with:

Jul 24 10:25:49 <nomad_hostname> nomad[955]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Jul 24 10:25:49 <nomad_hostname> nomad[955]: SIGABRT: abort
Jul 24 10:25:49 <nomad_hostname> nomad[955]: PC=0x7fe3a9de12c7 m=2 sigcode=18446744073709551610
Jul 24 10:25:49 <nomad_hostname> nomad[955]: goroutine 0 [idle]:
Jul 24 10:25:49 <nomad_hostname> nomad[955]: runtime: unknown pc 0x7fe3a9de12c7
Jul 24 10:25:49 <nomad_hostname> nomad[955]: stack: frame={sp:0x7fe3a7b998f8, fp:0x0} stack=[0x7fe3a739a2a8,0x7fe3a7b99ea8)
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b997f8:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99808:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99818:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99828:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99838:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99848:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99858:  0000000000000000  0000000000000000
Jul 24 10:25:49 <nomad_hostname> nomad[955]: 00007fe3a7b99868:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99878:  0000000000000000  41969ea420000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99888:  0000000000000000  000000000043d8ce <runtime.sysmon+558>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99898:  0000000000435d46 <runtime.mstart1+230>  00000000004365f9 <runtime.allocm+361>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998a8:  0000000000436ce9 <runtime.newm+57>  00000000004373c9 <runtime.startm+313>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998b8:  0000000000437515 <runtime.handoffp+85>  0000000000000002
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998c8:  00007fe3aa172868  00000000024accbd
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998d8:  00007fe38c0008c0  0000000000000011
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998e8:  0000000002420cb8  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998f8: <00007fe3a9de29b8  0000000000000020
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99908:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99918:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99928:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99938:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99948:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99958:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99968:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99978:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99988:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99998:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999a8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999b8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999c8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999d8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999e8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: runtime: unknown pc 0x7fe3a9de12c7
Jul 24 10:25:50 <nomad_hostname> nomad[955]: stack: frame={sp:0x7fe3a7b998f8, fp:0x0} stack=[0x7fe3a739a2a8,0x7fe3a7b99ea8)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b997f8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99808:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99818:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99828:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99838:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99848:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99858:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99868:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99878:  0000000000000000  41969ea420000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99888:  0000000000000000  000000000043d8ce <runtime.sysmon+558>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99898:  0000000000435d46 <runtime.mstart1+230>  00000000004365f9 <runtime.allocm+361>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998a8:  0000000000436ce9 <runtime.newm+57>  00000000004373c9 <runtime.startm+313>
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998b8:  0000000000437515 <runtime.handoffp+85>  0000000000000002
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998c8:  00007fe3aa172868  00000000024accbd
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998d8:  00007fe38c0008c0  0000000000000011
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998e8:  0000000002420cb8  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b998f8: <00007fe3a9de29b8  0000000000000020
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99908:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99918:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99928:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99938:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99948:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99958:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99968:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99978:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99988:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b99998:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999a8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999b8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999c8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999d8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: 00007fe3a7b999e8:  0000000000000000  0000000000000000
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 1 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/command/agent.(*Command).handleSignals(0xc00024cfc0, 0xc0003f9080)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:738 +0x1de
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc00024cfc0, 0xc00003a0e0, 0x2, 0x2, 0x0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:653 +0xf11
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc00010ab40, 0xc00010ab40, 0xc0003c3fc0, 0x3b)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli/cli.go:255 +0x207
Jul 24 10:25:50 <nomad_hostname> nomad[955]: main.RunCustom(0xc00003a0d0, 0x3, 0x3, 0x0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/main.go:131 +0x48e
Jul 24 10:25:50 <nomad_hostname> nomad[955]: main.Run(0xc00003a0d0, 0x3, 0x3, 0xc000088058)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/main.go:76 +0x3f
Jul 24 10:25:50 <nomad_hostname> nomad[955]: main.main()
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/main.go:72 +0x62
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 5 [syscall]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: os/signal.signal_recv(0x0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/runtime/sigqueue.go:139 +0x9c
Jul 24 10:25:50 <nomad_hostname> nomad[955]: os/signal.loop()
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/os/signal/signal_unix.go:23 +0x22
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by os/signal.init.0
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/os/signal/signal_unix.go:29 +0x41
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 27 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSignal).run(0xc0001bc340)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:64 +0xb3
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.NewInmemSignal
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:38 +0x13e
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 28 [sleep]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: time.Sleep(0x3b9aca00)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /usr/local/go/src/runtime/time.go:105 +0x14f
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).collectStats(0xc0001622d0)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/metrics.go:230 +0x2f
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.New
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/start.go:79 +0x170
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 29 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/command/agent/consul.(*ServiceClient).Run(0xc000162690)
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/client.go:352 +0x44d
Jul 24 10:25:50 <nomad_hostname> nomad[955]: created by github.com/hashicorp/nomad/command/agent.(*Agent).setupConsul
Jul 24 10:25:50 <nomad_hostname> nomad[955]: /opt/gopath/src/github.com/hashicorp/nomad/command/agent/agent.go:1034 +0x170
Jul 24 10:25:50 <nomad_hostname> nomad[955]: goroutine 30 [select]:
Jul 24 10:25:50 <nomad_hostname> nomad[955]: github.com/hashicorp/nomad/drivers/shared/eventer.(*Eventer).eventLoop(0xc001125c80)

and it continues for a while.

@radcool
Copy link
Author

radcool commented Aug 19, 2019

@notnoop Am I the only one having reported such an issue?

@notnoop
Copy link
Contributor

notnoop commented Aug 24, 2019

@radcool Thank you so much for the detailed messages and debugging. That is very helpful.

I'm sorry that I got sidetracked with other work. This is one of our high priority issues to address in 0.10 and I'll investigate it and follow up with questions as they come.

@radcool
Copy link
Author

radcool commented Aug 24, 2019

No worries @notnoop. I thought perhaps this was an issue on my side, not seen anywhere else.

Thanks.

notnoop pushed a commit that referenced this issue Aug 25, 2019
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes #5984
Related to #5890
@notnoop
Copy link
Contributor

notnoop commented Aug 25, 2019

@radcool Thank you so much again for your detailed observations and notes! It helped me get to the bottom of the issue quickly. Would it be possible for you to test a client with #6207 changes?

@radcool
Copy link
Author

radcool commented Aug 26, 2019

@notnoop I compiled a new Nomad binary from #6207 as you requested:

[root@nomad07 ~]# nomad version
Nomad v0.10.0-dev (a80643e46d154c620c7c64709d1a7ba4cb7c288f)

installed it on a Nomad client, and rebooted that client.

Unfortunately once it came back up both vCPUs shot up to 100% and memory rapidly started filling up. Perhaps I did something wrong, but whatever I did I got the same behavior as before. It's a bit late here but I tomorrow I'll sift through the journal logs and report back.

@radcool
Copy link
Author

radcool commented Aug 26, 2019

@notnoop I did another reboot of the client running 0.10-dev this morning and the runaway behavior has not occurred this time, Upon re-reading the following text of #6207:

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

I guess this is actually expected.

Can you please confirm that going from 0.9.x --> v0.10.0-dev will initially not fix the issue after process restart as 0.10.0 will still try to process the allocs from the 0.9.x-created state DB, and that only once 0.10.0-dev becomes the running version, then the issue will no longer occur at subsequent reboots since the allocs will no longer be persisted to the state DB under 0.10.0?

@notnoop
Copy link
Contributor

notnoop commented Aug 26, 2019

Can you please confirm that going from 0.9.x --> v0.10.0-dev will initially not fix the issue...

@radcool This reading is correct. The change doesn't recover from already "corrupt" persisted state left by the earlier buggy client process; but ensures that state is stored correctly for future client restarts.

I'd research options to how to recover without locking up or slowing restores unnecessarily.

@radcool
Copy link
Author

radcool commented Aug 27, 2019

@notnoop Thanks for your help by the way!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.