Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes sporadically marks status as down, causing allocation loss #3595

Closed
lovwal opened this issue Nov 28, 2017 · 15 comments
Closed

Nodes sporadically marks status as down, causing allocation loss #3595

lovwal opened this issue Nov 28, 2017 · 15 comments

Comments

@lovwal
Copy link

lovwal commented Nov 28, 2017

Nomad version

Nomad v0.7.0

Operating system and Environment details

Centos 7.4 3.10.0-693.el7.x86_64

Issue

Nodes sporadically marks status as down, causing allocations to be lost:

nomad node-status -self
ID = 25e16dd3
Name = -
Class =
DC = thule
Drain = false
Status = ready
Drivers = docker,exec,java
Uptime = 191h9m11s

Allocated Resources
CPU Memory Disk IOPS
0/15594 MHz 0 B/20 GiB 0 B/100 GiB 0/0

Allocation Resource Utilization
CPU Memory
0/15594 MHz 0 B/20 GiB

Host Resource Utilization
CPU Memory Disk
13419/15594 MHz 18 GiB/20 GiB 4.7 GiB/100 GiB

Allocations
ID Node ID Task Group Version Desired Status Created At
9bdaf6a0 25e16dd3 cadvisor 0 run pending 11/28/17 16:04:02 UTC
feab91c0 25e16dd3 cadvisor 0 stop pending 11/28/17 16:02:47 UTC
205dafc2 25e16dd3 cadvisor 0 stop complete 11/28/17 16:01:41 UTC
f17e652b 25e16dd3 cadvisor 0 stop complete 11/28/17 16:00:46 UTC
d2c047b9 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:58 UTC
492c658e 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:18 UTC
a6b62701 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:42 UTC
214a4eb2 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:12 UTC
6de2b900 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:43 UTC
6404e852 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:17 UTC
1f5ea2c4 25e16dd3 cadvisor 0 stop complete 11/28/17 15:56:48 UTC
...

Moments later:
nomad node-status -self

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): node down
ID = 25e16dd3
Name = -
Class =
DC = thule
Drain = false
Status = down
Drivers = docker,exec,java

Allocated Resources
CPU Memory Disk IOPS
0/15594 MHz 0 B/20 GiB 0 B/100 GiB 0/0

Allocation Resource Utilization
CPU Memory
0/15594 MHz 0 B/20 GiB

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): actual resource usage not present

Allocations
ID Node ID Task Group Version Desired Status Created At
9bdaf6a0 25e16dd3 cadvisor 0 stop lost 11/28/17 16:04:02 UTC
feab91c0 25e16dd3 cadvisor 0 stop lost 11/28/17 16:02:47 UTC
205dafc2 25e16dd3 cadvisor 0 stop complete 11/28/17 16:01:41 UTC
f17e652b 25e16dd3 cadvisor 0 stop complete 11/28/17 16:00:46 UTC
d2c047b9 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:58 UTC
492c658e 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:18 UTC
a6b62701 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:42 UTC
214a4eb2 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:12 UTC
6de2b900 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:43 UTC
6404e852 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:17 UTC
1f5ea2c4 25e16dd3 cadvisor 0 stop complete 11/28/17 15:56:48 UTC
...

Worth mentioning, we saw CPU utilization reaching ~90% (this might be the actual causation) on the affected node and memory utilization steadily going up until OOM killed. We've seen this occur 2-3 times since upgrade to 0.7.0.

Reproduction steps

Unclear.

Nomad Client logs (if appropriate)

Reloaded with DEBUG level while the issue was ongoing, so unfortunately I don't think it will provide much.
nomad-logs.txt

Please let me know what further information to collect and I'll grab it next time we see this occur.

@chelseakomlo
Copy link
Contributor

Hi, thanks for reporting this issue. Have you seen this issue in the past only when clients are at high CPU utilization? Any more logs that you have for this issue would be helpful.

@lovwal
Copy link
Author

lovwal commented Nov 29, 2017

I looked through old metrics during the last occurrence, looks like this issue wasn't related to CPU utilization during that occurrence. Note however, nomad was restarted before being OOM killed that time.
Unfortunately, those were all the logs I had from the node. One of the server members reports heartbeats being missed from the affected node
Nov 28 15:56:43 thu-ctrl5 nomad[637]: 2017/11/28 15:56:43.924767 [WARN] nomad.heartbeat: node '25e16dd3-8013-9436-0d3b-732ecb1
af322' TTL expired
...
Nov 28 16:04:25 thu-ctrl5 nomad[637]: 2017/11/28 16:04:25.897615 [WARN] nomad.heartbeat: node '25e16dd3-8013-9436-0d3b-732ecb1af322' TTL expired

But that's all I got unfortunately.

@TheSpbra1n
Copy link

Caught the same error.
Some logs at this time:

Nov 29 10:49:20 docker-01 nomad[9241]:     2017/11/29 10:49:20.828705 [DEBUG] http: Request /v1/agent/health?type=client (346.809µs)
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.049446 [DEBUG] client: RPC failed to server [xxx]:4647: rpc error: EOF
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22 [ERR] yamux: keepalive failed: session shutdown
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.049486 [DEBUG] client: RPC failed to server [xxx]:4647: rpc error: EOF
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.065911 [DEBUG] client: updated allocations at index 2655224 (total 98) (pulled 1) (filtered 97)
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.066198 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 97)

@dadgar
Copy link
Contributor

dadgar commented Dec 7, 2017

Hey all,

The heartbeat loop is done in a contention freeway with its own go-routine so the fact that it missed its heartbeat may mean that the server is so overloaded that it failed to send its response for quite a while! I added some logging that will be in 0.7.1 that will help us dig further. So please update to that when you get a chance and report any other failures with those logs!

@dadgar
Copy link
Contributor

dadgar commented Mar 12, 2018

Closing due to #3890

@dadgar dadgar closed this as completed Mar 12, 2018
@capone212
Copy link
Contributor

capone212 commented May 22, 2018

Hi all, I have this issue on cluster with windows boxes.
Managed to reproduce it stable. Simplest steps are:

  1. Run one stand-alone node (server+client, bootstrap expect =1)
  2. Run some tasks
  3. Hibernate (sleep mode) windows for 30 minutes
  4. wake up windows.

At this point nomad node status constantly flapping from ready to down. Issue goes away after nomad restart.

@capone212
Copy link
Contributor

Please find nomad logs
https://gist.github.com/capone212/4a36e98aca87bfadff3b4399363ff691

Problem starts at
2018/05/22 16:06:32.751737 [WARN] nomad.heartbeat: node '5c645ff5-528a-9e2b-744d-6e6165eedef4' TTL expired

@capone212
Copy link
Contributor

@dadgar please consider reopening this issue

@capone212
Copy link
Contributor

2018/05/22 16:06:42.288382 [WARN] client: heartbeat missed (request took 35.3634ms). Heartbeat TTL was 9.500984734s and heartbeated after 9.5366451s
2018/05/22 16:06:52.616929 [WARN] client: heartbeat missed (request took 31.0827ms). Heartbeat TTL was 10.296911237s and heartbeated after 10.3285475s
2018/05/22 16:07:04.891458 [WARN] client: heartbeat missed (request took 18.0339ms). Heartbeat TTL was 12.256493582s and heartbeated after 12.274529s
2018/05/22 16:07:15.195204 [WARN] client: heartbeat missed (request took 30.3297ms). Heartbeat TTL was 10.273167754s and heartbeated after 10.3037461s
2018/05/22 16:07:29.220006 [WARN] client: heartbeat missed (request took 34.3677ms). Heartbeat TTL was 13.989477386s and heartbeated after 14.0248014s
2018/05/22 16:07:46.411138 [WARN] client: heartbeat missed (request took 17.0588ms). Heartbeat TTL was 8.415857593s and heartbeated after 8.4330008s
2018/05/22 16:08:07.818020 [WARN] client: heartbeat missed (request took 24.6247ms). Heartbeat TTL was 12.407020047s and heartbeated after 12.4326358s
2018/05/22 16:08:21.051683 [WARN] client: heartbeat missed (request took 33.9317ms). Heartbeat TTL was 13.199205098s and heartbeated after 13.2336625s
2018/05/22 16:08:43.145138 [WARN] client: heartbeat missed (request took 23.296ms). Heartbeat TTL was 13.123152495s and heartbeated after 13.1473043s
2018/05/22 16:08:59.969811 [WARN] client: heartbeat missed (request took 36.6344ms). Heartbeat TTL was 8.36411564s and heartbeated after 8.4007989s
2018/05/22 16:09:22.946321 [WARN] client: heartbeat missed (request took 31.0424ms). Heartbeat TTL was 13.500516305s and heartbeated after 13.5317883s
2018/05/22 16:09:44.635384 [WARN] client: heartbeat missed (request took 27.8115ms). Heartbeat TTL was 13.884580018s and heartbeated after 13.9123942s
2018/05/22 16:09:57.672985 [WARN] client: heartbeat missed (request took 35.7857ms). Heartbeat TTL was 13.001369943s and heartbeated after 13.0376008s
2018/05/22 16:10:16.330154 [WARN] client: heartbeat missed (request took 21.0317ms). Heartbeat TTL was 11.262666605s and heartbeated after 11.2845332s

@dadgar
Copy link
Contributor

dadgar commented May 22, 2018

@capone212 Would you mind trying on 0.8.3. There have been quite a few improvements to the heartbeating system. For more details on them you can look at the changelog.

@capone212
Copy link
Contributor

Opened pull request #4331

@capone212 capone212 mentioned this issue May 24, 2018
@dadgar
Copy link
Contributor

dadgar commented May 24, 2018

@capone212 With that patch applied you do not see those logs with your reproduction steps?

@capone212
Copy link
Contributor

capone212 commented May 24, 2018

@dadgar I can reproduce the problem on vanilla v0.8.3, but with my patch I can't reproduce it any more.
I am sure it fixes the problem.

@capone212
Copy link
Contributor

capone212 commented May 24, 2018

@dadgar I see a clear race there
Consider the following scenario:

  1. Timer expires and timer handler (invalidateHeartbeat) scheduled for execution
  2. The Timer is Reset-ed in resetHeartbeatTimerLocked after client heartbeat
  3. invalidateHeartbeat removes active timer from heartbeatTimers maps
  4. client hearbeat inserts new active timer in resetHeartbeatTimerLocked
  5. after leaked timer from step 3 expires, it calls invalidateHeartbeat which again leaks active timer from step 4, and essentially we are in circle loop from steps 3 to 5.

The net effect is client status flaps from down to ready constantly.

dadgar pushed a commit that referenced this issue May 25, 2018
Stopping heartbeat timer before remove
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants