-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes sporadically marks status as down, causing allocation loss #3595
Comments
Hi, thanks for reporting this issue. Have you seen this issue in the past only when clients are at high CPU utilization? Any more logs that you have for this issue would be helpful. |
I looked through old metrics during the last occurrence, looks like this issue wasn't related to CPU utilization during that occurrence. Note however, nomad was restarted before being OOM killed that time. But that's all I got unfortunately. |
Caught the same error.
|
Hey all, The heartbeat loop is done in a contention freeway with its own go-routine so the fact that it missed its heartbeat may mean that the server is so overloaded that it failed to send its response for quite a while! I added some logging that will be in 0.7.1 that will help us dig further. So please update to that when you get a chance and report any other failures with those logs! |
Closing due to #3890 |
Hi all, I have this issue on cluster with windows boxes.
At this point nomad node status constantly flapping from ready to down. Issue goes away after nomad restart. |
Please find nomad logs Problem starts at |
@dadgar please consider reopening this issue |
|
@capone212 Would you mind trying on 0.8.3. There have been quite a few improvements to the heartbeating system. For more details on them you can look at the changelog. |
Opened pull request #4331 |
@capone212 With that patch applied you do not see those logs with your reproduction steps? |
@dadgar I can reproduce the problem on vanilla v0.8.3, but with my patch I can't reproduce it any more. |
@dadgar I see a clear race there
The net effect is client status flaps from down to ready constantly. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.7.0
Operating system and Environment details
Centos 7.4 3.10.0-693.el7.x86_64
Issue
Nodes sporadically marks status as down, causing allocations to be lost:
nomad node-status -self
ID = 25e16dd3
Name = -
Class =
DC = thule
Drain = false
Status = ready
Drivers = docker,exec,java
Uptime = 191h9m11s
Allocated Resources
CPU Memory Disk IOPS
0/15594 MHz 0 B/20 GiB 0 B/100 GiB 0/0
Allocation Resource Utilization
CPU Memory
0/15594 MHz 0 B/20 GiB
Host Resource Utilization
CPU Memory Disk
13419/15594 MHz 18 GiB/20 GiB 4.7 GiB/100 GiB
Allocations
ID Node ID Task Group Version Desired Status Created At
9bdaf6a0 25e16dd3 cadvisor 0 run pending 11/28/17 16:04:02 UTC
feab91c0 25e16dd3 cadvisor 0 stop pending 11/28/17 16:02:47 UTC
205dafc2 25e16dd3 cadvisor 0 stop complete 11/28/17 16:01:41 UTC
f17e652b 25e16dd3 cadvisor 0 stop complete 11/28/17 16:00:46 UTC
d2c047b9 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:58 UTC
492c658e 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:18 UTC
a6b62701 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:42 UTC
214a4eb2 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:12 UTC
6de2b900 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:43 UTC
6404e852 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:17 UTC
1f5ea2c4 25e16dd3 cadvisor 0 stop complete 11/28/17 15:56:48 UTC
...
Moments later:
nomad node-status -self
error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): node down
ID = 25e16dd3
Name = -
Class =
DC = thule
Drain = false
Status = down
Drivers = docker,exec,java
Allocated Resources
CPU Memory Disk IOPS
0/15594 MHz 0 B/20 GiB 0 B/100 GiB 0/0
Allocation Resource Utilization
CPU Memory
0/15594 MHz 0 B/20 GiB
error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): actual resource usage not present
Allocations
ID Node ID Task Group Version Desired Status Created At
9bdaf6a0 25e16dd3 cadvisor 0 stop lost 11/28/17 16:04:02 UTC
feab91c0 25e16dd3 cadvisor 0 stop lost 11/28/17 16:02:47 UTC
205dafc2 25e16dd3 cadvisor 0 stop complete 11/28/17 16:01:41 UTC
f17e652b 25e16dd3 cadvisor 0 stop complete 11/28/17 16:00:46 UTC
d2c047b9 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:58 UTC
492c658e 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:18 UTC
a6b62701 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:42 UTC
214a4eb2 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:12 UTC
6de2b900 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:43 UTC
6404e852 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:17 UTC
1f5ea2c4 25e16dd3 cadvisor 0 stop complete 11/28/17 15:56:48 UTC
...
Worth mentioning, we saw CPU utilization reaching ~90% (this might be the actual causation) on the affected node and memory utilization steadily going up until OOM killed. We've seen this occur 2-3 times since upgrade to 0.7.0.
Reproduction steps
Unclear.
Nomad Client logs (if appropriate)
Reloaded with DEBUG level while the issue was ongoing, so unfortunately I don't think it will provide much.
nomad-logs.txt
Please let me know what further information to collect and I'll grab it next time we see this occur.
The text was updated successfully, but these errors were encountered: