Orphaned Health Check #1524

jshaw86 · 2016-08-05T20:36:50Z

Nomad version

nomad 0.4.0

Operating system and Environment details

14.04

Issue

This maybe a consul issue but coming through on the nomad agent, but it seems that a health check can get into a state where the health check endpoint is checked indefinitely but not in the registered.

Reproduction steps

Not completely sure but we scheduled many jobs(~100) in serial and tore them down serially and eventually got to this state.

Nomad Client logs (if appropriate)

Aug  5 20:29:48 nomad-agent1 consul[1167]: agent: http request failed 'http://XXX.XXX.XXX.XXX:34222/ping': Get http://XXX.XXX.XXX.XXX:34222/ping: dial tcp XXX.XXX.XXX.XXX:34222: getsockopt: connection refused

I can share the result of curl localhost:8500/v1/health/state/any to show there is no health check under this IP/Port offline as there is some sensitive information in there.

UPDATE: I got it out of this state by restarting consul

The text was updated successfully, but these errors were encountered:

dadgar · 2016-08-05T23:37:09Z

Hmm,

Sorry I am having a hard time following. What you are saying is that Consul has a health check registered that does not correspond to any Nomad service running on that machine?

jshaw86 · 2016-08-06T00:19:29Z

@dadgar correct i'm seeing the above health check still firing even after the nomad job/task group has gone away. Here is another example from another nomad client in the same region:

Aug  6 00:18:22 nomad-agent3:     2016/08/06 00:18:22 [WARN] agent: http request failed 'http://172.16.103.71:21233/ping': Get http://172.16.103.71:21233/ping: dial tcp 172.16.103.71:21233: getsockopt: connection refused

dadgar · 2016-08-06T00:29:12Z

@jshaw86 Is the check registered in Consul? And are those logs from nomad or consul?

sean- · 2016-08-06T00:35:01Z

What type of check is this? Is it a script check?

jshaw86 · 2016-08-06T00:59:58Z

@dadgar curl localhost:8500/v1/health/state/any doesn't show the check. @sean- it's an http check

sean- · 2016-08-07T05:46:03Z

@jshaw86 Is this reproducible? When you say you "tore the jobs down serially" how did you do that? If you let the host run for a while, does the check deregister eventually after 4hrs?

jshaw86 · 2016-08-07T23:55:32Z

@sean- I am able to reproduce it consistently on our staging env by scheduling and tearing down a nomad job every 10s one at a time(serially) and also running nomad gc once a minute on system cron on a nomad sever

diptanu · 2016-08-11T23:24:38Z

@jshaw86 When this happens next time, can you share the result of curl http://localhost:8500/v1/agent/checks. We want to see the currently registered consul checks when this happens.

dadgar · 2016-08-12T00:04:01Z

So I just ran: while true;do nomad run -detach example.nomad && sleep 3 && nomad stop -yes e;done and register/deregistered the job 200 times and could not reproduce.

@jshaw86 It would be great next time if you got into this condition to get the consul agent logs, the nomad client logs, and the checks diptanu mentioned. If they are sensitive you can email them to me.

jshaw86 · 2016-08-19T19:07:02Z

We are still running into this intermittently here are the client logs.
https://gist.github.com/jshaw86/d6b5c5285e66d61a604a7b9d881e3f87

jshaw86 · 2016-08-20T01:36:23Z

We just had another one, these are the most concise logs I can provide:
The timeline is:

the service was created at 1:16 server time
We stopped the service around 1:23 server time
The service did go away till 1:28 server time

Aug 20 01:16:31 nomad-agent6 consul[9023]: agent: Synced service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:16:31 nomad-agent6 nomad-agent6.gold.aws-sjc-1.ps.pn:     2016/08/20 01:16:31 [INFO] agent: Synced service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 nomad-agent6.gold.aws-sjc-1.ps.pn:     2016/08/20 01:28:07 [INFO] agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 consul[9023]: agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 nomad-agent6.gold.aws-sjc-1.ps.pn:     2016/08/20 01:28:07 [INFO] agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'
Aug 20 01:28:07 nomad-agent6 consul[9023]: agent: Deregistered service '_nomad-executor-12a71626-20f9-0d97-4ced-892efa683610-event_handler_270_2_sjc-blocks-blocks-subkey-demo-36-blockId-216-urlprefix-demo-36_js-before-publish/run/before/sdfdsfdsfsdfsd'

ketzacoatl · 2016-08-30T23:23:48Z

I have seen similar behavior from 0.3.x and 0.4.x. I am not sure if it is a consul or nomad issue, but something like consul-cli service deregister ... fails to deregister the service - it persists - at least until I do something drastic to consul, like rm -rf the leaders and such. I do not remember if a reboot was enough, but I can test that. I have a "stuck" service now, one that does not want to be deregistered.

slackpad · 2016-08-31T17:26:18Z

Getting the output of /v1/agent/checks as @diptanu suggested would be good if you can. The /v1/health/state/any call goes up to the servers so it would be interesting to see if the agent is in the right state internally but the changes aren't synced to the servers properly, or if the agent also shows the check, too.

dadgar · 2016-08-31T17:34:16Z

@ketzacoatl Did you by chance change the tags on the service?

ketzacoatl · 2016-08-31T17:40:11Z

@dadgar It might be possible, but it's highly unlikely (I rarely change them once set). Also, if I ask consul to deregister the service, consul should remove the entry, no?

dadgar · 2016-10-03T18:22:51Z

If anyone is hitting this could you report the /v1/agent/checks from Consul and also see if the executor is still alive (ps aux | grep nomad looking for the alloc-id that has leaked the check)

jshaw86 · 2016-10-03T18:50:25Z

@dadgar I did check this a while back, the process was cleaned up seems like it was just on the consul end but will try to repo. It's possible that I could have raced something because after 5 minutes it does seem to get cleaned up.

dadgar · 2016-10-03T18:52:22Z

@jshaw86 hmm, we run a periodic reaping of all checks/services that shouldn't be there every 30 seconds. There was an issue with the executor crashing that we have fixed in master that may have been adding an artificial delay

dadgar · 2016-10-03T20:34:16Z

@jshaw86 It honestly may just be Consuls own reconcile with the servers. It may be gone from the agent but the remote servers have it because of a short outage etc

jshaw86 · 2016-10-03T21:14:37Z

@dadgar ok have some updated information. Another engineer has seen the executor still stick around when the scheduler says the job has been killed, and the health check still exists because we can route to the job in fabio so you'll have to take our word for it till I'm able to replicate this again. According to other engineers we have only seen this in production lately. This is the logs from our latest experience with the issue from last week (no ps aux or /v1/agent/checks though) sorry.

cody@nomad-sched1:~$ sudo grep --after 10 "block-1377" /var/log/syslog
Sep 30 16:22:52 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (172.591838ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52.574966 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (172.591838ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52.600945 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.516279ms)
Sep 30 16:22:52 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.516279ms)
Sep 30 16:22:52 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (296.87989ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52.904859 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (296.87989ms)
Sep 30 16:22:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:52 [DEBUG] memberlist: TCP connection from=54.153.25.24:53126
Sep 30 16:22:52 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53126
Sep 30 16:22:53 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (314.066651ms)
Sep 30 16:22:53 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:53.227528 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (314.066651ms)
Sep 30 16:22:53 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (217.344405ms)
Sep 30 16:22:53 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:53.451414 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (217.344405ms)
Sep 30 16:22:53 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (381.372145ms)
Sep 30 16:22:53 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:22:53.839125 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (381.372145ms)
Sep 30 16:23:00 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:00.379889 [DEBUG] http: Request /v1/status/peers (640.963Âµs)
Sep 30 16:23:00 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (640.963Âµs)
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01.957803 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (170.477387ms)
Sep 30 16:23:01 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (170.477387ms)
Sep 30 16:23:01 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:01 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (28.729859ms)
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01.993402 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (28.729859ms)
Sep 30 16:23:01 nomad-sched1 nomad[7917]: worker: dequeued evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:01 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:01.993642 [DEBUG] worker: dequeued evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.094056 [DEBUG] sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.094123 [DEBUG] sched: <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.099487 [DEBUG] worker: updated evaluation <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>
Sep 30 16:23:02 nomad-sched1 nomad[7917]: worker: updated evaluation <Eval '759c5b0d-25ec-6a62-28d7-63809fce92e0' JobID: 'block-1377'>
Sep 30 16:23:02 nomad-sched1 nomad[7917]: worker: ack for evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.100054 [DEBUG] worker: ack for evaluation 759c5b0d-25ec-6a62-28d7-63809fce92e0
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (152.851468ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.152555 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (152.851468ms)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (324.76766ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.483807 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (324.76766ms)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (218.299436ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.708572 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (218.299436ms)
Sep 30 16:23:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (195.900792ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02.910850 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (195.900792ms)
Sep 30 16:23:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:02 [DEBUG] memberlist: TCP connection from=54.153.25.24:53135
Sep 30 16:23:02 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53135
Sep 30 16:23:10 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:10.381828 [DEBUG] http: Request /v1/status/peers (612.067Âµs)
Sep 30 16:23:10 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (612.067Âµs)
Sep 30 16:23:12 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:12 [DEBUG] memberlist: TCP connection from=54.153.25.24:53138
Sep 30 16:23:12 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53138
Sep 30 16:23:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:20.383669 [DEBUG] http: Request /v1/status/peers (824.531Âµs)
Sep 30 16:23:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (824.531Âµs)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22 [DEBUG] memberlist: TCP connection from=52.210.3.117:33295
Sep 30 16:23:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.210.3.117:33295
Sep 30 16:23:22 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (96.402526ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.705804 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (96.402526ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:22 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:22 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.112998ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.722866 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.112998ms)
Sep 30 16:23:22 nomad-sched1 nomad[7917]: worker: dequeued evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.723103 [DEBUG] worker: dequeued evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad[7917]: sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.823614 [DEBUG] sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Sep 30 16:23:22 nomad-sched1 nomad[7917]: sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.823774 [DEBUG] sched: <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>: setting status to complete
Sep 30 16:23:22 nomad-sched1 nomad[7917]: worker: updated evaluation <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.828752 [DEBUG] worker: updated evaluation <Eval 'c78ceb3e-2555-b5ae-545b-ffecc113cd1c' JobID: 'block-1377'>
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.829258 [DEBUG] worker: ack for evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad[7917]: worker: ack for evaluation c78ceb3e-2555-b5ae-545b-ffecc113cd1c
Sep 30 16:23:22 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (153.094773ms)
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22.883081 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (153.094773ms)
Sep 30 16:23:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53142
Sep 30 16:23:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53142
Sep 30 16:23:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:23.221073 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (331.586744ms)
Sep 30 16:23:23 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (331.586744ms)
Sep 30 16:23:23 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (132.801342ms)
Sep 30 16:23:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:23.361030 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (132.801342ms)
Sep 30 16:23:23 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (382.602937ms)
Sep 30 16:23:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:23.749917 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (382.602937ms)
Sep 30 16:23:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:30.385578 [DEBUG] http: Request /v1/status/peers (579.83Âµs)
Sep 30 16:23:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (579.83Âµs)
Sep 30 16:23:32 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:32 [DEBUG] memberlist: TCP connection from=54.153.25.24:53146
Sep 30 16:23:32 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53146
Sep 30 16:23:38 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (167.365244ms)
Sep 30 16:23:38 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:38.963120 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (167.365244ms)
Sep 30 16:23:38 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:38 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:38 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:23:38 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (9.814613ms)
Sep 30 16:23:38 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:38.979790 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (9.814613ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (175.725721ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.161073 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (175.725721ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.334407 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (166.677003ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (166.677003ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (113.739365ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.454551 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (113.739365ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39.657075 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (196.093239ms)
Sep 30 16:23:39 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (196.093239ms)
Sep 30 16:23:39 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:39 [DEBUG] memberlist: TCP connection from=52.196.105.134:49018
Sep 30 16:23:39 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.196.105.134:49018
Sep 30 16:23:40 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:40.387349 [DEBUG] http: Request /v1/status/peers (571.116Âµs)
Sep 30 16:23:40 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (571.116Âµs)
Sep 30 16:23:42 nomad-sched1 nomad[7917]: memberlist: Initiating push/pull sync with: 52.67.133.140:4648
Sep 30 16:23:42 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:42 [DEBUG] memberlist: Initiating push/pull sync with: 52.67.133.140:4648
Sep 30 16:23:42 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:42 [DEBUG] memberlist: TCP connection from=54.153.25.24:53152
Sep 30 16:23:42 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53152
Sep 30 16:23:47 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:23:47 [DEBUG] memberlist: TCP connection from=52.28.134.183:35171
Sep 30 16:23:47 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.28.134.183:35171
--
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.212563 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (89.62714ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (89.62714ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14 [ERR] nomad.fsm: DeleteJob failed: job not found
Sep 30 16:24:14 nomad-sched1 nomad[7917]: nomad.fsm: DeleteJob failed: job not found
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.229740 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.367554ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.367554ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (291.521385ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.526976 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (291.521385ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.716981 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (183.547855ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (183.547855ms)
Sep 30 16:24:14 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (114.117226ms)
Sep 30 16:24:14 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:14.837533 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (114.117226ms)
Sep 30 16:24:15 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (196.210604ms)
Sep 30 16:24:15 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:15.040109 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (196.210604ms)
Sep 30 16:24:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:20.394727 [DEBUG] http: Request /v1/status/peers (594.891Âµs)
Sep 30 16:24:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (594.891Âµs)
Sep 30 16:24:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53166
Sep 30 16:24:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53166
Sep 30 16:24:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:30.396488 [DEBUG] http: Request /v1/status/peers (589.899Âµs)
Sep 30 16:24:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (589.899Âµs)
Sep 30 16:24:32 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:32 [DEBUG] memberlist: TCP connection from=54.153.25.24:53170
Sep 30 16:24:32 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53170
Sep 30 16:24:40 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:24:40.398469 [DEBUG] http: Request /v1/status/peers (614.901Âµs)
Sep 30 16:24:40 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (614.901Âµs)
--
Sep 30 16:30:11 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (618.76Âµs)
Sep 30 16:30:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:11.802702 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (618.76Âµs)
Sep 30 16:30:12 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:12 [DEBUG] memberlist: TCP connection from=54.153.25.24:53294
Sep 30 16:30:12 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53294
Sep 30 16:30:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:20.464923 [DEBUG] http: Request /v1/status/peers (828.174Âµs)
Sep 30 16:30:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (828.174Âµs)
Sep 30 16:30:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53298
Sep 30 16:30:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53298
Sep 30 16:30:24 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:24 [DEBUG] memberlist: TCP connection from=52.210.3.117:33451
Sep 30 16:30:24 nomad-sched1 nomad[7917]: memberlist: TCP connection from=52.210.3.117:33451
Sep 30 16:30:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:30:30.466843 [DEBUG] http: Request /v1/status/peers (709.511Âµs)
Sep 30 16:30:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (709.511Âµs)
--
Sep 30 16:32:23 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (642.208Âµs)
Sep 30 16:32:23 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:23.321322 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (642.208Âµs)
Sep 30 16:32:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:30.504188 [DEBUG] http: Request /v1/status/peers (546.261Âµs)
Sep 30 16:32:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (546.261Âµs)
Sep 30 16:32:32 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:32 [DEBUG] memberlist: TCP connection from=54.153.25.24:53354
Sep 30 16:32:32 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53354
Sep 30 16:32:40 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:40.508130 [DEBUG] http: Request /v1/status/peers (824.243Âµs)
Sep 30 16:32:40 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (824.243Âµs)
Sep 30 16:32:42 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:42 [DEBUG] memberlist: TCP connection from=54.153.25.24:53358
Sep 30 16:32:42 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53358
Sep 30 16:32:43 nomad-sched1 nomad[7917]: memberlist: Initiating push/pull sync with: 52.53.214.137:4648
Sep 30 16:32:43 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:43 [DEBUG] memberlist: Initiating push/pull sync with: 52.53.214.137:4648
Sep 30 16:32:48 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:48.257714 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (587.943Âµs)
Sep 30 16:32:48 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (587.943Âµs)
Sep 30 16:32:50 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:50.509921 [DEBUG] http: Request /v1/status/peers (805.596Âµs)
Sep 30 16:32:50 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (805.596Âµs)
Sep 30 16:32:52 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:32:52 [DEBUG] memberlist: TCP connection from=54.153.25.24:53363
Sep 30 16:32:52 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53363
Sep 30 16:33:00 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:33:00.511687 [DEBUG] http: Request /v1/status/peers (554.613Âµs)
Sep 30 16:33:00 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (554.613Âµs)
Sep 30 16:33:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:33:02 [DEBUG] memberlist: TCP connection from=54.153.25.24:53366
Sep 30 16:33:02 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53366
Sep 30 16:33:10 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:33:10.517381 [DEBUG] http: Request /v1/status/peers (865.756Âµs)
Sep 30 16:33:10 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (865.756Âµs)
--
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.498730 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (82.419862ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (82.419862ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.514238 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.711153ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.711153ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (163.844039ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.682368 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (163.844039ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (167.805091ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.855205 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (167.805091ms)
Sep 30 16:35:02 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (124.516265ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02.984734 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (124.516265ms)
Sep 30 16:35:02 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:02 [DEBUG] memberlist: TCP connection from=54.153.25.24:53433
Sep 30 16:35:02 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53433
Sep 30 16:35:03 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (401.086005ms)
Sep 30 16:35:03 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:03.390524 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (401.086005ms)
Sep 30 16:35:10 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:10.544413 [DEBUG] http: Request /v1/status/peers (639.484Âµs)
Sep 30 16:35:10 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (639.484Âµs)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-iad-1 (168.885863ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.255354 [DEBUG] http: Request /v1/job/block-1377?region=aws-iad-1 (168.885863ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.272016 [DEBUG] worker: dequeued evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: dequeued evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-sjc-1 (10.324493ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.272175 [DEBUG] http: Request /v1/job/block-1377?region=aws-sjc-1 (10.324493ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 1) (ignore 0) (lost 0)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.372482 [DEBUG] sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: allocs: (place 0) (update 0) (migrate 0) (stop 1) (ignore 0) (lost 0)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.379207 [DEBUG] worker: submitted plan for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: submitted plan for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: setting status to complete
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.379273 [DEBUG] sched: <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>: setting status to complete
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.384522 [DEBUG] worker: updated evaluation <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: updated evaluation <Eval 'f601717c-a75e-d1fa-d628-3939aa775fb7' JobID: 'block-1377'>
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.385040 [DEBUG] worker: ack for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad[7917]: worker: ack for evaluation f601717c-a75e-d1fa-d628-3939aa775fb7
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.434930 [DEBUG] http: Request /v1/job/block-1377?region=aws-dub-1 (156.806916ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-dub-1 (156.806916ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-fra-1 (171.396837ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.612535 [DEBUG] http: Request /v1/job/block-1377?region=aws-fra-1 (171.396837ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-hnd-1 (123.235384ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.742420 [DEBUG] http: Request /v1/job/block-1377?region=aws-hnd-1 (123.235384ms)
Sep 30 16:35:11 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1377?region=aws-gru-1 (201.863125ms)
Sep 30 16:35:11 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:11.950616 [DEBUG] http: Request /v1/job/block-1377?region=aws-gru-1 (201.863125ms)
Sep 30 16:35:12 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:12 [DEBUG] memberlist: TCP connection from=54.153.25.24:53438
Sep 30 16:35:12 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53438
Sep 30 16:35:20 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:20.546219 [DEBUG] http: Request /v1/status/peers (563.724Âµs)
Sep 30 16:35:20 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (563.724Âµs)
Sep 30 16:35:22 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:22 [DEBUG] memberlist: TCP connection from=54.153.25.24:53441
Sep 30 16:35:22 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53441
Sep 30 16:35:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:30.548164 [DEBUG] http: Request /v1/status/peers (682.291Âµs)
Sep 30 16:35:30 nomad-sched1 nomad[7917]: http: Request /v1/status/peers (682.291Âµs)
Sep 30 16:35:33 nomad-sched1 nomad[7917]: memberlist: TCP connection from=54.153.25.24:53445
Sep 30 16:35:33 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 16:35:33 [DEBUG] memberlist: TCP connection from=54.153.25.24:53445
--
Sep 30 17:18:26 nomad-sched1 nomad[7917]: http: Request /v1/jobs?prefix=block-1377 (636.73Âµs)
Sep 30 17:18:26 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:26.751581 [DEBUG] http: Request /v1/jobs?prefix=block-1377 (636.73Âµs)
Sep 30 17:18:29 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-iad-1 (93.550792ms)
Sep 30 17:18:29 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:29.886239 [DEBUG] http: Request /v1/job/block-1383?region=aws-iad-1 (93.550792ms)
Sep 30 17:18:29 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:29.913957 [DEBUG] http: Request /v1/job/block-1383?region=aws-sjc-1 (20.581057ms)
Sep 30 17:18:29 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-sjc-1 (20.581057ms)
Sep 30 17:18:30 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-dub-1 (157.772615ms)
Sep 30 17:18:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:30.077297 [DEBUG] http: Request /v1/job/block-1383?region=aws-dub-1 (157.772615ms)
Sep 30 17:18:30 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-fra-1 (313.180495ms)
Sep 30 17:18:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:30.397174 [DEBUG] http: Request /v1/job/block-1383?region=aws-fra-1 (313.180495ms)
Sep 30 17:18:30 nomad-sched1 nomad[7917]: http: Request /v1/job/block-1383?region=aws-hnd-1 (113.223584ms)
Sep 30 17:18:30 nomad-sched1 nomad-sched1.gold.aws-sjc-1.ps.pn:     2016/09/30 17:18:30.517723 [DEBUG] http: Request /v1/job/block-1383?region=aws-hnd-1 (113.223584ms)

jshaw86 · 2016-10-21T02:02:49Z

@dadgar, I wasn't able to get to CLI in time to get a ps or curl the agent checks but I can confirm the executor and process are still running as the transformation I was doing in the request was still getting returned, then after 5 minutes it goes away.

Will try to catch it again when I see it it's just very hard to react in time. :)

jshaw86 · 2016-11-06T04:05:06Z

just an update we are going to upgrade to consul 0.7, as there seemed to be some health check fixes in there, and monitor to see if that fixes the issue. I'm not 100% sure this will do it as some of the nomad jobs were still running after being told to stop.

dadgar · 2016-11-09T23:34:25Z

Changed the milestone not to indicate that we won't continue to work on this but we have larger plans on refactoring the consul code that should make it more robust. This is a larger piece of work though so want the milestone to reflect it.

jshaw86 · 2016-11-10T01:34:47Z

@dadgar does this mean ya'll have root causes this down to the nomad implementation of consul or just speculting?

dadgar · 2016-11-10T18:07:46Z

Root cause hasn't been determined but I wouldn't chalk it up to speculation either. The current implementation is complex and has low debug ability as consul syncing occurs in various processes and if they ever get out of sync they are prone to flap.

The new design will be centralized, simpler and much easier to debug.

jshaw86 · 2016-11-10T18:16:28Z

@dadgar oh ok. So when we upgrade to 0.7 you wouldn't expect much difference in behavior? Is there anything we should look for after we upgrade in the logs besides the ps and v1 health check call you recommended?

dadgar · 2016-11-10T20:50:30Z

@jshaw86 Yeah I would try to capture the logs, ps aux, agent and /v1/health/state/any. It is hard to tell if the bug is Nomad or Consul. The difference between agent and server health should be telling

jshaw86 · 2017-01-25T19:51:55Z

@dadgar once this is delivered in 0.6.0 is the idea that it will be fixed or that we will be able to gather meaningful logs to then address the issue?

I noticed some changelog statements on the consul end that may have helped the stability as well so were you guys thinking that it could be a consul issue or combination of consul and nomad.

dadgar · 2017-01-28T00:26:44Z

@jshaw86 I am pretty sure it is a Nomad issue but as mentioned lack of debug ability makes it difficult to be certain. It is likely that 0.5.4/0.5.5 will bring the fix and it will be much easier to provide useful logs since the nomad client itself will be managing the registration/deregistration rather than the executor.

jshaw86 · 2017-02-01T22:24:49Z

Just an update I created a generic nomad service with Count of 1000 with an http health check on consul 0.7.2 with nomad 0.4.2 and seeing the same connection refused orphans in syslog:

Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:26127/ping': Get http://172.16.103.248:26127/ping: dial tcp 172.16.103.248:26127: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:55514/ping': Get http://172.16.103.248:55514/ping: dial tcp 172.16.103.248:55514: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:22438/ping': Get http://172.16.103.248:22438/ping: dial tcp 172.16.103.248:22438: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:37401/ping': Get http://172.16.103.248:37401/ping: dial tcp 172.16.103.248:37401: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:55514/ping': Get http://172.16.103.248:55514/ping: dial tcp 172.16.103.248:55514: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:22438/ping': Get http://172.16.103.248:22438/ping: dial tcp 172.16.103.248:22438: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:37401/ping': Get http://172.16.103.248:37401/ping: dial tcp 172.16.103.248:37401: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:38639/ping': Get http://172.16.103.248:38639/ping: dial tcp 172.16.103.248:38639: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:32198/ping': Get http://172.16.103.248:32198/ping: dial tcp 172.16.103.248:32198: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:40691/ping': Get http://172.16.103.248:40691/ping: dial tcp 172.16.103.248:40691: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:32467/ping': Get http://172.16.103.248:32467/ping: dial tcp 172.16.103.248:32467: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:39892/ping': Get http://172.16.103.248:39892/ping: dial tcp 172.16.103.248:39892: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:50814/ping': Get http://172.16.103.248:50814/ping: dial tcp 172.16.103.248:50814: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:52702/ping': Get http://172.16.103.248:52702/ping: dial tcp 172.16.103.248:52702: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 consul[9796]: agent: http request failed 'http://172.16.103.248:40036/ping': Get http://172.16.103.248:40036/ping: dial tcp 172.16.103.248:40036: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:38639/ping': Get http://172.16.103.248:38639/ping: dial tcp 172.16.103.248:38639: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:32198/ping': Get http://172.16.103.248:32198/ping: dial tcp 172.16.103.248:32198: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:40691/ping': Get http://172.16.103.248:40691/ping: dial tcp 172.16.103.248:40691: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:32467/ping': Get http://172.16.103.248:32467/ping: dial tcp 172.16.103.248:32467: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:39892/ping': Get http://172.16.103.248:39892/ping: dial tcp 172.16.103.248:39892: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:50814/ping': Get http://172.16.103.248:50814/ping: dial tcp 172.16.103.248:50814: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:52702/ping': Get http://172.16.103.248:52702/ping: dial tcp 172.16.103.248:52702: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:40036/ping': Get http://172.16.103.248:40036/ping: dial tcp 172.16.103.248:40036: getsockopt: connection refused
Feb  1 22:20:28 nomad-agent4 pn-consul:     2017/02/01 22:20:28 [WARN] agent: http request failed 'http://172.16.103.248:21025/ping': Get http://172.16.103.248:21025/ping: dial tcp 172.16.103.248:21025: getsockopt: connection refused

Will repeat with 0.5.3 and hopefully 0.6.0 fixes this.

dadgar · 2017-02-01T23:16:50Z

@jshaw86 Can you break down the steps:

1 Nomad agent? Bring a single job with a health check on a dynamic port to count 1000 and then stop the job and then theres orphans?

jshaw86 · 2017-02-01T23:26:06Z

@dadgar

In production it's hundreds of jobs with Count < 3, three nomad servers, many nomad agents, federated six regions. This particular case it was one job with Count > 1000, single nomad server four nomad clients. Both should reproduce the issue.

Steps:
1 . Provision a bunch jobs or a job with a high count
2. nomad stop
3. curl -X PUT http://localhost:4646/v1/system/gc
4. tail syslog or wherever your consul logs dump to

schmichael · 2017-03-31T22:40:59Z

This should be fixed by #2467 and released in 0.6.0.

Linking to #2478 for tracking.

schmichael · 2017-07-03T23:50:20Z

(Should have closed with my last comment; sorry for the delay)

github-actions · 2022-12-12T02:17:09Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug stage/waiting-reply theme/discovery labels Aug 5, 2016

dadgar added stage/needs-investigation and removed stage/waiting-reply labels Aug 22, 2016

dadgar added this to the v0.5.0 milestone Oct 3, 2016

dadgar mentioned this issue Oct 6, 2016

Phantom Job and Service #1794

Closed

dadgar added this to the v0.6.0 milestone Nov 9, 2016

dadgar removed this from the v0.5.0 milestone Nov 9, 2016

schmichael closed this as completed Jul 3, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 12, 2022

Orphaned Health Check #1524

Orphaned Health Check #1524

Comments

jshaw86 commented Aug 5, 2016 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Client logs (if appropriate)

dadgar commented Aug 5, 2016

jshaw86 commented Aug 6, 2016 • edited Loading

dadgar commented Aug 6, 2016 • edited Loading

sean- commented Aug 6, 2016

jshaw86 commented Aug 6, 2016 • edited Loading

sean- commented Aug 7, 2016

jshaw86 commented Aug 7, 2016

diptanu commented Aug 11, 2016

dadgar commented Aug 12, 2016

jshaw86 commented Aug 19, 2016

jshaw86 commented Aug 20, 2016

ketzacoatl commented Aug 30, 2016

slackpad commented Aug 31, 2016 • edited Loading

dadgar commented Aug 31, 2016

ketzacoatl commented Aug 31, 2016

dadgar commented Oct 3, 2016

jshaw86 commented Oct 3, 2016

dadgar commented Oct 3, 2016

dadgar commented Oct 3, 2016

jshaw86 commented Oct 3, 2016

jshaw86 commented Oct 21, 2016

jshaw86 commented Nov 6, 2016

dadgar commented Nov 9, 2016

jshaw86 commented Nov 10, 2016

dadgar commented Nov 10, 2016

jshaw86 commented Nov 10, 2016

dadgar commented Nov 10, 2016

jshaw86 commented Jan 25, 2017

dadgar commented Jan 28, 2017

jshaw86 commented Feb 1, 2017

dadgar commented Feb 1, 2017

jshaw86 commented Feb 1, 2017 • edited Loading

schmichael commented Mar 31, 2017

schmichael commented Jul 3, 2017

github-actions bot commented Dec 12, 2022

jshaw86 commented Aug 5, 2016 •

edited

Loading

jshaw86 commented Aug 6, 2016 •

edited

Loading

dadgar commented Aug 6, 2016 •

edited

Loading

jshaw86 commented Aug 6, 2016 •

edited

Loading

slackpad commented Aug 31, 2016 •

edited

Loading

jshaw86 commented Feb 1, 2017 •

edited

Loading