Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending allocations #6461

Open
jorgemarey opened this issue Oct 10, 2019 · 11 comments
Open

Pending allocations #6461

jorgemarey opened this issue Oct 10, 2019 · 11 comments

Comments

@jorgemarey
Copy link
Contributor

jorgemarey commented Oct 10, 2019

Nomad version

Nomad v0.9.3 (c5e8b66)

Operating system and Environment details

RHEL 7.5

Issue

At some point all allocations placed on a node remain in pending state.

Reproduction steps

I'm not able to reproduce this behaviour, but it happend several times on different nodes on our cluster once we updated from 0.8.4 to 0.9.3

I found out that this happens after the GC runs when allocations reach gc_max_allocs. But it doesn't happens every time.

Nomad Client logs (if appropriate)

Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-10-09T21:05:00.155479Z","filtered":103,"index":29757357,"pulled":15,"total":118}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-10-09T21:05:00.155734Z","added":1,"ignored":103,"removed":0,"updated":14}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-10-09T21:05:00.186512Z","alloc_id":"ea5f6df0-0140-45e5-958a-13f389d95e50","reason":"new allocations and over max (50)"}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"debug","@message":"alloc garbage collected","@module":"client.gc","@timestamp":"2019-10-09T21:05:00.186576Z","alloc_id":"ea5f6df0-0140-45e5-958a-13f389d95e50"}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-10-09T21:05:00.186614Z","alloc_id":"e6ee3260-a1c9-c283-ad17-d1cb6950774d","reason":"new allocations and over max (50)"}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-10-09T21:05:00.188295Z","alloc_id":"e6ee3260-a1c9-c283-ad17-d1cb6950774d"}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2019-10-09T21:05:00.466065Z","duration":277547156441,"method":"GET","path":"/v1/node/dcce6698-f6a0-8f41-7341-2949da2a0a8c/allocations?index=29756881\u0026region=es\u0026wait=300000ms"}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-10-09T21:05:00.480566Z","filtered":103,"index":29757363,"pulled":15,"total":118}
Oct  9 21:05:00 w-008a19cc-0017 nomad: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2019-10-09T21:05:00.830025Z","duration":270978657,"method":"GET","path":"/v1/node/dcce6698-f6a0-8f41-7341-2949da2a0a8c/allocations?index=29757357\u0026region=es\u0026wait=300000ms"}
Oct  9 21:10:02 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-10-09T21:10:02.289686Z","filtered":103,"index":29757363,"pulled":15,"total":118}
Oct  9 21:10:12 w-008a19cc-0017 nomad: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2019-10-09T21:10:12.115422Z","duration":311206778941,"method":"GET","path":"/v1/node/dcce6698-f6a0-8f41-7341-2949da2a0a8c/allocations?index=29757363\u0026region=es\u0026wait=300000ms"}

As you can see in the logs, the allocation with ID: e6ee3260-a1c9-c283-ad17-d1cb6950774d doesn't get garbage collected.

That allocation shows the following (note that times in the log are UTC and in my PC are GMT+2):

e6ee3260-a1c9-c283-ad17-d1cb6950774d-events

The last event is when I restarted the nomad service in the instance.
Seems like nomad is trying to garbage collect an already dead allocation.

As this happened to us several times before I enabled the profiler on the node to get some data. I saw the following (taken at Thu Oct 10 09:00:00 CEST 2019):

goroutine 1892 [select, 595 minutes]:
github.com/hashicorp/nomad/client.(*AllocGarbageCollector).destroyAllocRunner(0xc00048b0e0, 0xc00dc3f620, 0x24, 0x2499a00, 0xc00890cf00, 0xc0066550b0, 0x21)
	/opt/gopath/src/github.com/hashicorp/nomad/client/gc.go:185 +0x2d9
github.com/hashicorp/nomad/client.(*AllocGarbageCollector).MakeRoomFor(0xc00048b0e0, 0xc001c8db70, 0x1, 0x1, 0x4, 0x4)
	/opt/gopath/src/github.com/hashicorp/nomad/client/gc.go:257 +0x11a
github.com/hashicorp/nomad/client.(*Client).runAllocs(0xc00001c240, 0xc0074b0880)
	/opt/gopath/src/github.com/hashicorp/nomad/client/client.go:2028 +0x6b7
github.com/hashicorp/nomad/client.(*Client).run(0xc00001c240)
	/opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1485 +0x91
created by github.com/hashicorp/nomad/client.NewClient
	/opt/gopath/src/github.com/hashicorp/nomad/client/client.go:476 +0x16d7

goroutine 850951 [select, 17038 minutes]:
github.com/hashicorp/nomad/client.(*AllocGarbageCollector).destroyAllocRunner(0xc00048b0e0, 0xc007a421b0, 0x24, 0x2499a00, 0xc00cd63200, 0x21d83d5, 0x11)
	/opt/gopath/src/github.com/hashicorp/nomad/client/gc.go:185 +0x2d9
github.com/hashicorp/nomad/client.(*AllocGarbageCollector).Collect(0xc00048b0e0, 0xc007a421b0, 0x24, 0x872c01)
	/opt/gopath/src/github.com/hashicorp/nomad/client/gc.go:209 +0x196
created by github.com/hashicorp/nomad/client.(*Client).removeAlloc
	/opt/gopath/src/github.com/hashicorp/nomad/client/client.go:2133 +0x39c

The time of first one (595 minutes) is the same that the first pending allocation had when placed on the node (Oct 9 21:05:00). The other tracks back to this moment:

Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-09-28T11:01:51.614967Z","added":0,"ignored":76,"removed":1,"updated":0}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-09-28T11:01:51.615038Z","added":0,"errors":0,"ignored":76,"removed":1,"updated":0}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-09-28T11:01:51.615055Z","alloc_id":"049b6b85-f1d6-efb3-78ee-681f89c9ad63","reason":"forced collection"}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"debug","@message":"alloc garbage collected","@module":"client.gc","@timestamp":"2019-09-28T11:01:51.615099Z","alloc_id":"049b6b85-f1d6-efb3-78ee-681f89c9ad63"}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-09-28T11:01:51.713433Z","filtered":74,"index":28028356,"pulled":0,"total":74}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-09-28T11:01:51.713598Z","added":0,"ignored":74,"removed":2,"updated":0}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-09-28T11:01:51.713644Z","added":0,"errors":0,"ignored":74,"removed":2,"updated":0}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-09-28T11:01:51.713650Z","alloc_id":"ec8a075b-7946-7fee-303f-a02fdf53ae1a","reason":"forced collection"}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-09-28T11:01:51.713660Z","alloc_id":"743b7e55-54b4-4187-5a61-494a262e969b","reason":"forced collection"}
Sep 28 11:01:51 w-008a19cc-0017 nomad: {"@level":"debug","@message":"alloc garbage collected","@module":"client.gc","@timestamp":"2019-09-28T11:01:51.724669Z","alloc_id":"743b7e55-54b4-4187-5a61-494a262e969b"}
Sep 28 11:02:49 w-008a19cc-0017 nomad: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2019-09-28T11:02:49.646780Z","duration":311948611113,"method":"GET","path":"/v1/node/dcce6698-f6a0-8f41-7341-2949da2a0a8c/allocations?index=28023851\u0026region=es\u0026wait=300000ms"}

Seems like the allocation ec8a075b-7946-7fee-303f-a02fdf53ae1a doesnt't get correctly garbage collected either.

It seems that when the node reach gc_max_allocs and gc_parallel_destroys are blocked all allocations that are placed on that node remain pending state until the nomad service is rebooted. At least that is what I feel is happening here.

Weirdly, I thought that pending allocations would be blocking here: https://github.com/hashicorp/nomad/blob/v0.9.3/client/gc.go#L180

But no goroutines appeared there, so they must be waiting someplace else...

@cgbaker
Copy link
Contributor

cgbaker commented Oct 10, 2019

hi @jorgemarey , there were a number of changes in 0.9.6 (released yesterday) around client and allocations, but I don't have any good reason to believe that they will address the issue that you're seeing. I'm going to look at your report and see if I can reproduce it, but in the mean time, if you are able to try out 0.9.6, I would be curious to see whether it also exhibits this behavior.

@jorgemarey
Copy link
Contributor Author

Hi @cgbaker , thanks for the reply. I'll change the nomad version and see if this happens again.
Just to add more info. I took a look at the logs at the moment when the allocation in the image ( e6ee3260-a1c9-c283-ad17-d1cb6950774d) gets the first kill signal.

Oct  9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-10-09T16:00:37.433289Z","added":0,"ignored":73,"removed":0,"updated":13}
Oct  9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-10-09T16:00:37.479736Z","added":0,"errors":0,"ignored":73,"removed":0,"updated":13}
Oct  9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-10-09T16:00:37.703560Z","filtered":74,"index":29722784,"pulled":12,"total":86}
Oct  9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-10-09T16:00:37.703838Z","added":0,"ignored":74,"removed":0,"updated":12}
Oct  9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-10-09T16:00:37.756186Z","added":0,"errors":0,"ignored":74,"removed":0,"updated":12}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"error fetching stats of task","@module":"client.alloc_runner.task_runner.task_hook.stats_hook","@timestamp":"2019-10-09T16:00:39.442363Z","alloc_id":"e6ee3260-a1c9-c283-ad17-d1cb6950774d","error":"container stopped","task":"pdp"}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2019-10-09T16:00:39.446006Z","driver":"docker","path":"/opt/nomad/bin/nomad","pid":22377}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2019-10-09T16:00:39.446093Z","driver":"docker"}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2019-10-09T16:00:39.495588Z","driver":"docker","image_id":"sha256:cb101cb1872bdda9104a0e39c3822af6dd939c44ab18ed4996dcf5250668ed59","references":0}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-10-09T16:00:39.501933Z","alloc_id":"e6ee3260-a1c9-c283-ad17-d1cb6950774d"}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-10-09T16:00:39.663174Z","filtered":74,"index":29722786,"pulled":12,"total":86}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-10-09T16:00:39.663424Z","added":0,"ignored":74,"removed":0,"updated":12}
Oct  9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-10-09T16:00:39.694808Z","added":0,"errors":0,"ignored":74,"removed":0,"updated":12}

I don't see anything weird, but maybe you do. Maybe the allocation was killed but internally it wasn't removed from the GC heap and it was tried to be garbage collected again and that produced a lock on the destroyAllocRunner method? And that lead to when gc_max_allocs is reached the method MakeRoomFor (that calls destroyAllocRunner) blocks and no allocations gets placed on the node after that.

@cgbaker
Copy link
Contributor

cgbaker commented Oct 11, 2019

@jorgemarey , yeah, that's sort of what i was theorizing as well.

let me know how 0.9.6 works out.

@tzahimizrahi
Copy link

#5363 it looks the same issue and says it still reproduce in 0.10.0

@shantanugadgil
Copy link
Contributor

I too often notice this unexplained behavior where the cluster seems to be waiting for something, dunno what.

For my experiments I have suspected the default backoff timings which cause it to go into some sort of wait.

Usually to get over it quickly, I use nomad-helper with reevaluate-all. I know it is no solution, but a workaround.

@jorgemarey
Copy link
Contributor Author

Hi @cgbaker . This happened to me again on v0.9.6

Here is some information about the node:

ID          = 1e3b2610-2de1-f3d3-6278-10f9442c5859
Name        = w-8af25c22-0011
Class       = default
DC          = my-dc
Drain       = true; 2019-11-27T08:50:54Z deadline; ignoring system jobs
Eligibility = ineligible
Status      = ready
Uptime      = 311h9m59s

Drivers
Driver    Detected  Healthy  Message                                                                         Time
docker    true      true     Healthy                                                                         2019-11-14T08:56:50Z
exec      true      true     Healthy                                                                         2019-11-14T08:56:50Z
java      false     false    <none>                                                                          2019-11-14T08:56:50Z
qemu      false     false    <none>                                                                          2019-11-14T08:56:50Z
raw_exec  false     false    disabled                                                                        2019-11-14T08:56:50Z
rkt       false     false    Failed to execute rkt version: exec: "rkt": executable file not found in $PATH  2019-11-14T08:56:50Z

Node Events
Time                  Subsystem  Message                  Details
2019-11-27T07:50:54Z  Drain      Node drain strategy set  <none>
2019-11-14T08:42:55Z  Cluster    Node re-registered       <none>
2019-11-14T08:42:09Z  Cluster    Node heartbeat missed    <none>
2019-10-30T14:07:12Z  Cluster    Node registered          <none>

Allocated Resources
CPU              Memory         Disk
19766/24746 MHz  24 GiB/30 GiB  8.6 GiB/27 GiB

Allocation Resource Utilization
CPU             Memory
2157/24746 MHz  5.9 GiB/30 GiB

Host Resource Utilization
CPU             Memory         Disk
6822/25496 MHz  17 GiB/31 GiB  4.4 GiB/30 GiB

Allocations
ID        Eval ID   Node ID   Task Group                Version       Desired  Status   Created                    Modified
ee940ff5  016b0c20  1e3b2610  my-awesome-task-group-01  824638234992  run      pending  2019-11-26T11:42:59+01:00  2019-11-26T11:42:59+01:00
7df0280f  97934eb2  1e3b2610  my-awesome-task-group-02  824637944160  stop     pending  2019-11-26T10:48:22+01:00  2019-11-26T10:57:08+01:00
50a645e9  75cb8b8e  1e3b2610  my-awesome-task-group-03  824635333456  stop     pending  2019-11-26T10:34:34+01:00  2019-11-26T10:41:16+01:00
59170ea3  bb7b8702  1e3b2610  my-awesome-task-group-04  824638153008  run      pending  2019-11-26T10:29:55+01:00  2019-11-26T10:29:55+01:00
486653be  8aeeebee  1e3b2610  my-awesome-task-group-05  824635331440  stop     pending  2019-11-25T13:00:00+01:00  2019-11-27T07:16:59+01:00
47478c55  79e7bbda  1e3b2610  my-awesome-task-group-05  824635329888  stop     pending  2019-11-23T16:53:03+01:00  2019-11-26T10:27:55+01:00
4585de6b  55da26ce  1e3b2610  my-awesome-task-group-06  824636173440  run      pending  2019-11-23T16:05:32+01:00  2019-11-23T16:05:32+01:00
8c2c8616  31559a46  1e3b2610  my-awesome-task-group-07  824638154928  run      pending  2019-11-23T01:00:00+01:00  2019-11-23T01:00:00+01:00
1de288e4  399f5d60  1e3b2610  my-awesome-task-group-08  824636578784  stop     pending  2019-11-23T00:14:27+01:00  2019-11-26T10:28:00+01:00
6013f2b3  9895fcb5  1e3b2610  my-awesome-task-group-09  824639410272  stop     running  2019-11-22T13:54:48+01:00  2019-11-27T08:50:56+01:00
5226161d  3f9cbc28  1e3b2610  my-awesome-task-group-10  824635312832  stop     running  2019-11-22T13:49:37+01:00  2019-11-27T08:50:55+01:00
8da1b5f1  33c5685a  1e3b2610  my-awesome-task-group-11  824641962080  stop     running  2019-11-22T13:10:40+01:00  2019-11-27T08:50:55+01:00
ddafdeb2  4d16c207  1e3b2610  my-awesome-task-group-12  824639033344  stop     running  2019-11-20T11:18:49+01:00  2019-11-27T08:50:55+01:00
cdb29e98  74991fdc  1e3b2610  my-awesome-task-group-13  824641502776  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:56+01:00
02c47c6a  166dafe1  1e3b2610  my-awesome-task-group-14  824640186736  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
dd9e1cae  3fe37dd9  1e3b2610  my-awesome-task-group-15  824639807072  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
0bd892eb  658e3bd9  1e3b2610  my-awesome-task-group-16  824640188816  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
813155e1  805a515e  1e3b2610  my-awesome-task-group-17  824637945552  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
22a86288  b747ec55  1e3b2610  my-awesome-task-group-18  824635953776  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
85ad09a4  43c4a47e  1e3b2610  my-awesome-task-group-19  824637947632  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
0d0a66f6  e4fe735b  1e3b2610  my-awesome-task-group-20  824640190880  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
3c4b2b41  bdc4e0fd  1e3b2610  my-awesome-task-group-21  824635327488  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
85e1283b  61c62a1e  1e3b2610  my-awesome-task-group-22  824640026720  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
0eadac30  a9bf5d48  1e3b2610  my-awesome-task-group-23  824638155584  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
5e24e0d8  71013171  1e3b2610  my-awesome-task-group-24  824639407736  stop     running  2019-11-19T18:34:24+01:00  2019-11-27T08:50:55+01:00
18f76384  8075b3ca  1e3b2610  my-awesome-task-group-25  824636572880  stop     running  2019-11-19T18:29:57+01:00  2019-11-27T08:50:55+01:00
aa93b0ec  1912e991  1e3b2610  my-awesome-task-group-26  824641499936  stop     running  2019-11-19T18:29:05+01:00  2019-11-27T08:50:55+01:00
9f81fd0f  1912e991  1e3b2610  my-awesome-task-group-27  824641967168  stop     running  2019-11-19T18:29:05+01:00  2019-11-27T08:50:55+01:00
a395e909  ee381d2c  1e3b2610  my-awesome-task-group-28  824641969840  stop     running  2019-11-19T18:29:05+01:00  2019-11-27T08:50:55+01:00
d63ee7a7  4a7b9dbd  1e3b2610  my-awesome-task-group-29  824639803520  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
57a6c5c0  82b3dfe6  1e3b2610  my-awesome-task-group-30  824635317792  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
e663560f  b745ec07  1e3b2610  my-awesome-task-group-31  824642736560  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
85b14ee2  b422baae  1e3b2610  my-awesome-task-group-32  824640022512  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
e62b96b5  82b3dfe6  1e3b2610  my-awesome-task-group-33  824642732288  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
62fe2923  663196a3  1e3b2610  my-awesome-task-group-34  824643776240  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
65a363c8  26c4f2e9  1e3b2610  my-awesome-task-group-35  824643779616  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
5c1114c1  4ccdda9c  1e3b2610  my-awesome-task-group-36  824633981688  stop     running  2019-11-19T14:36:30+01:00  2019-11-27T08:50:55+01:00
c4c25cbf  74b2b79d  1e3b2610  my-awesome-task-group-37  824638158584  run      running  2019-11-14T09:42:55+01:00  2019-11-24T03:21:30+01:00
4432dfb2  aab128e3  1e3b2610  my-awesome-task-group-38  824636171160  run      running  2019-11-14T09:42:55+01:00  2019-11-14T09:43:18+01:00

As you can see, the node is draining. I tried to move the allocaitions to another node, but they stayed running. I had to restart nomad to move the allocations.

In the moment of the first allocation that get the pending state the logs shows again:

Nov 22 23:14:27 w-8af25c22-0011 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-11-22T23:14:27.767394Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","reason":"new allocations and over max (50)"}
Nov 22 23:14:27 w-8af25c22-0011 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-11-22T23:14:27.768906Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7"}

That allocation doesn't appear in the list of allocations of the node (I guess it was garbage collected as the log says, but maybe not correctly).
Searching the logs for that allocation, I saw a few logs referencing that allocations after this point in time (which is weird).

Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"error renewing token","@module":"client.vault","@timestamp":"2019-11-23T04:42:15.755652Z","error":"failed to renew the vault token: Error making API request.\n\nURL: PUT https://vault.service.consul:8200/v1/auth/token/renew-self\nCode: 403. Errors:\n\n* permission denied"}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"failed to renew Vault token","@module":"client.alloc_runner.task_runner.task_hook.vault","@timestamp":"2019-11-23T04:42:15.755704Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"failed to renew the vault token: Error making API request.\n\nURL: PUT https://vault.service.consul:8200/v1/auth/token/renew-self\nCode: 403. Errors:\n\n* permission denied","task":"my-awesome-task-50"}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"error deriving vault tokens","@module":"client.vault","@timestamp":"2019-11-23T04:42:15.757211Z","error":"Can't request Vault token for terminal allocation"}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"error deriving token","@module":"client.vault","@timestamp":"2019-11-23T04:42:15.757264Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"Can't request Vault token for terminal allocation","task_names":["my-awesome-task-50"]}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"failed to derive Vault token","@module":"client.alloc_runner.task_runner.task_hook.vault","@timestamp":"2019-11-23T04:42:15.757298Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"Can't request Vault token for terminal allocation","server_side":true,"task":"my-awesome-task-50"}

Following are the other logs that I found about that allocation.

When it started:

Nov 21 16:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"update hook failed","@module":"client.alloc_runner.task_runner","@timestamp":"2019-11-21T16:42:15.746594Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"unable to get address for service \"my-awesome-task-50\": cannot use address_mode=\"driver\": no driver network exists","name":"consul_services","task":"my-awesome-task-50"}
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.746836 [INFO] (runner) creating new runner (dry: false, once: false)
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.747233 [INFO] (runner) creating watcher
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.747464 [INFO] (runner) starting
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.846437 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/235243f0-f863-5600-bd2b-cf97a9c255b7/openidm-work-aws/secrets/file.env"
Nov 21 16:44:16 w-8af25c22-0011 nomad: {"@level":"info","@message":"created container","@module":"client.driver_mgr.docker","@timestamp":"2019-11-21T16:44:16.914962Z","container_id":"43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120","driver":"docker"}
Nov 21 16:44:18 w-8af25c22-0011 nomad: {"@level":"info","@message":"started container","@module":"client.driver_mgr.docker","@timestamp":"2019-11-21T16:44:18.530035Z","container_id":"43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120","driver":"docker"}
Nov 21 16:44:18 w-8af25c22-0011 consul: 2019/11/21 16:44:18 [INFO] agent: Synced service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-1"
Nov 21 16:44:18 w-8af25c22-0011 consul: 2019/11/21 16:44:18 [INFO] agent: Synced service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-2"

Another moment.

Nov 21 17:05:37 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-21T17:05:37.051190Z","access_token_id":"","access_token_name":"","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","command":["bash"],"exec_id":"fb5eaf11-ec7e-cdda-326d-42e1430e75b0","task":"my-awesome-task-50","tty":true}
Nov 21 17:10:08 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-21T17:10:08.602232Z","access_token_id":"","access_token_name":"","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","command":["bash"],"exec_id":"e94c44d0-05fb-5ae9-5535-283cc5effbd3","task":"my-awesome-task-50","tty":true}



Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-1"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered check "_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-2"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered check "_nomad-check-0312618a6f394f78168db7261452fe0ad71d915c"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [ERR] http: Request PUT /v1/agent/check/deregister/_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b, error: Unknown check "_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b" from=127.0.0.1:49318
Nov 22 09:55:52 w-8af25c22-0011 nomad: {"@level":"warn","@message":"failed to update services in Consul","@module":"consul.sync","@timestamp":"2019-11-22T09:55:52.918465Z","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b\")"}
Nov 22 09:55:52 w-8af25c22-0011 nomad: {"@level":"info","@message":"successfully updated services in Consul","@module":"consul.sync","@timestamp":"2019-11-22T09:55:52.925661Z"}
Nov 22 09:55:57 w-8af25c22-0011 dockerd: time="2019-11-22T09:55:57.880753963Z" level=info msg="Container 43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120 failed to exit within 0 seconds of signal 15 - using the force"
Nov 22 09:55:57 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session ended with an error","@module":"client","@timestamp":"2019-11-22T09:55:57.947800Z","code":500,"error":"stream closed"}
Nov 22 09:55:57 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session ended with an error","@module":"client","@timestamp":"2019-11-22T09:55:57.948122Z","code":500,"error":"stream closed"}
Nov 22 09:55:58 w-8af25c22-0011 nomad: {"@level":"info","@message":"stopped container","@module":"client.driver_mgr.docker","@timestamp":"2019-11-22T09:55:58.961653Z","container_id":"43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120","driver":"docker"}
Nov 22 09:55:58 w-8af25c22-0011 nomad: {"@level":"error","@message":"failed to start stats collection for task with unrecoverable error","@module":"client.alloc_runner.task_runner.task_hook.stats_hook","@timestamp":"2019-11-22T09:55:58.963746Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"container stopped","task":"my-awesome-task-50"}
Nov 22 09:55:59 w-8af25c22-0011 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-11-22T09:55:59.126528Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7"}

As you can see, here there is another message about garbage collecting this allocation, which I guess shouln't happen.

Thats all I could find, I'll keep the logs anyway if they can help in a future.
I'll be happy to provide any other info if needed.

@jorgemarey
Copy link
Contributor Author

Hi again,
I think I found the root cause for our problems (at least the ones reported here). I revisited the goroute traces and found the following:

goroutine 4793402 [semacquire, 611 minutes]:
sync.runtime_Semacquire(0xc0031afc08)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc0031afc00)
	/usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks(0xc00890cf00, 0xc008d65020)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:525 +0x4b2
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).destroyImpl(0xc00890cf00)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:738 +0x2f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Destroy
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:806 +0xe2

goroutine 4793403 [select, 611 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Kill(0xc006798b40, 0x24828a0, 0xc0000c6038, 0xc006ca3400, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/lifecycle.go:80 +0x225
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks.func1(0xc0031afc00, 0xc00890cf00, 0xc0031afbf0, 0xc008d3d410, 0xc0065f3e03, 0x3, 0xc006798b40)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:514 +0xa4
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:512 +0x49c
goroutine 850953 [semacquire, 17054 minutes]:
sync.runtime_Semacquire(0xc005714558)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc005714550)
	/usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks(0xc00cd63200, 0xc007ae5da0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:525 +0x4b2
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).destroyImpl(0xc00cd63200)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:738 +0x2f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Destroy
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:806 +0xe2

goroutine 850954 [select, 17054 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Kill(0xc002354d80, 0x24828a0, 0xc0000c6038, 0xc00cb9e000, 0xc0001aa590, 0xc002cbf7b0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/lifecycle.go:80 +0x225
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks.func1(0xc005714550, 0xc00cd63200, 0xc005714540, 0xc00601b2c0, 0xc00066df5b, 0x3, 0xc002354d80)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:514 +0xa4
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:512 +0x49c

Two goroutes were blocking when waiting to finish task killing. Thus reaching gc_parallel_destroys, which made new allocations block when the node has more than gc_max_allocs.

The livecycle manager calls killCtxCancel and wait for it for the task to end o the context to be done. The problem is that none of those things happend and made that block.

Inspecting the goroutines I found 2 that started on task_runner and were blocking:

goroutine 387964 [select, 18503 minutes]:
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc001484c00, 0xc006ad7180, 0x1)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:242 +0xcc
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(0xc001484c00, 0xc000051900, 0x963ce1)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:253 +0x2b
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000d34dc0, 0x1fd93e0, 0xc00d226080, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:741 +0x5c0
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000d34dc0, 0x0, 0x10)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:648 +0x3d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc004a47d40, 0xc00c701838, 0xc00c701828, 0x9699c0, 0xc004a47d40)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:534 +0x7f
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg(0xc004a47d40, 0x1fd93e0, 0xc00d226080, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:647 +0x8d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.invoke(0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00d226060, 0x1fd93e0, 0xc00d226080, 0xc002043200, 0x0, ...)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:73 +0x141
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc002043200, 0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00d226060, 0x1fd93e0, 0xc00d226080, 0x0, ...)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:37 +0x1b3
github.com/hashicorp/nomad/client/logmon/proto.(*logMonClient).Stop(0xc00020ea08, 0x24828a0, 0xc0000c6030, 0xc00d226060, 0x0, 0x0, 0x0, 0x0, 0x0, 0x22635e8)
	/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/proto/logmon.pb.go:268 +0xd2
github.com/hashicorp/nomad/client/logmon.(*logmonClient).Stop(0xc006803a40, 0x113f467, 0xc00c701ac8)
	/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/client.go:40 +0x7a
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*logmonHook).Stop(0xc0014e9080, 0x2482860, 0xc008881ec0, 0xc000a44400, 0x3af5418, 0xc0031c2901, 0x5)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/logmon_hook.go:195 +0x6f
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).stop(0xc002354d80, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:366 +0x2a5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc002354d80)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:529 +0x5f1
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:297 +0xa6
goroutine 424573 [select, 915 minutes]:
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00d3b5600, 0xc00cb98af0, 0x1)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:242 +0xcc
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(0xc00d3b5600, 0xc000046000, 0x963ce1)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:253 +0x2b
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc00b064160, 0x1fd93e0, 0xc00b58d820, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:741 +0x5c0
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc00b064160, 0x0, 0x10)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:648 +0x3d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc000d857a0, 0xc004ac9838, 0xc004ac9828, 0x9699c0, 0xc000d857a0)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:534 +0x7f
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg(0xc000d857a0, 0x1fd93e0, 0xc00b58d820, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:647 +0x8d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.invoke(0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00b58d800, 0x1fd93e0, 0xc00b58d820, 0xc00a60a900, 0x0, ...)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:73 +0x141
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc00a60a900, 0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00b58d800, 0x1fd93e0, 0xc00b58d820, 0x0, ...)
	/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:37 +0x1b3
github.com/hashicorp/nomad/client/logmon/proto.(*logMonClient).Stop(0xc0061d80a8, 0x24828a0, 0xc0000c6030, 0xc00b58d800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x22635e8)
	/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/proto/logmon.pb.go:268 +0xd2
github.com/hashicorp/nomad/client/logmon.(*logmonClient).Stop(0xc003122ea0, 0x113f467, 0xc004ac9ac8)
	/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/client.go:40 +0x7a
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*logmonHook).Stop(0xc002732300, 0x2482860, 0xc003965a40, 0xc00000fbd8, 0x3af5418, 0xc00212f001, 0x5)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/logmon_hook.go:195 +0x6f
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).stop(0xc006798b40, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:366 +0x2a5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc006798b40)
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:529 +0x5f1
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:297 +0xa6

Looking at the code I saw that when livecycle calls killCtxCancel that cancels the context, but the only posthook that does not use that context is logmon (there the context is ignored). Thus leading to all the blocking explained earlier.

The problem (beside the context being ingored) seems to be on the grpc library. I track the goroutine to this, but I don't really know if thats what's happening here. Maybe upgrading the grpc dependencies will fix this.

Hope that this heepls to identify the problem completely.

@notnoop
Copy link
Contributor

notnoop commented Nov 29, 2019

@jorgemarey That's some great sleuthing! I'll dig into it next week and follow up. Thank you very much!

notnoop pushed a commit that referenced this issue Dec 2, 2019
Add an RPC timeout for logmon.  In
#6461 (comment) ,
`logmonClient.Stop` locked up and indefinitely blocked the task runner
destroy operation.

This is an incremental improvement.  We still need to follow up to
understand how we got to that state, and the full impact of locked-up
Stop and its link to pending allocations on restart.
@stale
Copy link

stale bot commented Feb 27, 2020

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Mar 28, 2020

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed Mar 28, 2020
@notnoop notnoop reopened this Mar 28, 2020
@Legogris
Copy link

Legogris commented Jun 18, 2020

I think hit this too now.
I think this is what happened: Had a job that may or may not fail due to certain conditions (still in development, this is not in a production environment).

Added job.update.auto_revert = true.
Job was failing, just like the previous allocation.
Nomad UI now says Failed due to progress deadline - no stable job version to auto revert to. Attempting to start the job again does nothing, but the folloing shows up in logs:

http: request failed: method=GET path=/v1/job/testjob/deployment?index=226130 error="rpc error: EOF" code=500

I am also seeing this in the log (searching for this error is what led me here):

nomad.client: DeriveVaultToken failed: recoverable=false error="Can't request Vault token for terminal allocation"

There is a Vault integration on the nomad cluster, but not for this job.

Attempting to trigger gc through the API does remove all the allocations, but does not remove the job, and no new attempt at running new revisions have any effect.

When restarting the leader, there is a batch of logs coming on another node:

http: request failed: method=GET path=/v1/job/testjob/allocations?index=226209 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/allocations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob?index=226202 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/allocations?index=226209 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/allocations?index=226209 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob?index=226202 error="rpc error: EOF" code=500

During the failing attempts, there was no log output from the job though I would expect it, so it looks like it could be related to the loggingm as you mentioned @notnoop

Nomad version 0.11.3.

jorgemarey pushed a commit to jorgemarey/nomad that referenced this issue Nov 1, 2023
Add an RPC timeout for logmon.  In
hashicorp#6461 (comment) ,
`logmonClient.Stop` locked up and indefinitely blocked the task runner
destroy operation.

This is an incremental improvement.  We still need to follow up to
understand how we got to that state, and the full impact of locked-up
Stop and its link to pending allocations on restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants