-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak #3420
Comments
I suspect that this is due to GC issues I'm working on as we speak! Is there any chance you could bump the log level to debug on a node and paste a similar subset? |
Here it is: Now I am also running nomad as root user which solved permission denied errors. |
Excellent, thanks for the updated logs. This in particular is definitely a bug I'll look into:
Regarding the permissions issues: Had you initially run Nomad as root and then restarted it as a non-root user? I'm not sure how else you would get permission issues. Generally we recommend running Nomad as root although the Are you still getting OOM killed after running as root again? If so do you have any logs from when that happens? |
I initially ran it as nomad user. The problem was that Nomad wasn't able to GC downloaded artifacts due to permission issues. I solved this by switching to root user. Note that I am still constantly getting following warnings:
|
If you have over 50 running allocations on that node that can be ignored. One way to check if you have curl and jq installed: $ curl -s localhost:4646/v1/node/e16562df/allocations | jq '. | length'
26 To get rid of the warning you can bump the max_allocs setting to squelch the warning and GC less aggressively. However if you do not have 50 running allocations on this node you were probably bit by the bug fixed in #3445. It will be released in 0.7.1, but I've attached a build if you'd like to test. |
I've also submitted a followup PR to lower that log level in most situations: #3490 |
I switched to a build that you provided but the client just won't start: |
Fixes the panic mentioned in #3420 (comment) While a leader task dying serially stops all follower tasks, the synchronizing of state is asynchrnous. Nomad can shutdown before all follower tasks have updated their state to dead thus saving the state necessary to hit this panic: *have a non-terminal alloc with a dead leader.* The actual fix is a simple nil check to not assume non-terminal allocs leader's have a TaskRunner.
It appears to be a bug unrelated to your previous issues: Nomad panics when restoring an alloc whose leader task failed before the previous shutdown. Am I correct in assuming you have an allocation with a If so please test the binary attached to #3502 if you're able. |
Fixes the panic mentioned in #3420 (comment) While a leader task dying serially stops all follower tasks, the synchronizing of state is asynchrnous. Nomad can shutdown before all follower tasks have updated their state to dead thus saving the state necessary to hit this panic: *have a non-terminal alloc with a dead leader.* The actual fix is a simple nil check to not assume non-terminal allocs leader's have a TaskRunner.
I tested following build linux_amd64.zip and I can confirm that our nomad clients are now stable. I wasn't able to test #3502 but I believe it will solve issues with failed leader tasks. |
Fixes the panic mentioned in #3420 (comment) While a leader task dying serially stops all follower tasks, the synchronizing of state is asynchrnous. Nomad can shutdown before all follower tasks have updated their state to dead thus saving the state necessary to hit this panic: *have a non-terminal alloc with a dead leader.* The actual fix is a simple nil check to not assume non-terminal allocs leader's have a TaskRunner.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.6.3
Operating system and Environment details
Ubuntu 16.04.03 LTS (GNU/Linux 4.4.0-1038-aws x86_64)
Issue
We are running 3 nomad servers and 5 nomad clients.
On daily basis we experience issues with nomad clients crashing or consuming all available host memory. Issues started at
Oct 19 07:45:28
. I suspect this is related with permissions and GC when running nomad as a non root user (failed to remove alloc dir - permission denied).Artifact was downloaded and extracted to
local/data/app/cache
.This folder is then mounted to
/data/app/cache
:This probably causes permission issues since uid and gid are wrong:
/var/lib/nomad/alloc/6a536fb2-e1b0-606c-5146-ff3ccd5023ca/php-fpm/local/data/app/cache/articles/twig/b8/b86ff870da894b5cdcf66b38c2e920b8e124d7fe7471327e2062929dcc2a6d16.php
-rw-r--r-- 1 82 82
Nomad Client logs
nomad client logs
The text was updated successfully, but these errors were encountered: