-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nomad > 0.9.x causing locking contention on servers with > 100 containers and journald #6759
Comments
Hi @42wim! Thanks for submitting this issue! I've seen this exact sort of behavior with |
Seems like PR #6820 will fix this issue (haven't tested it yet) |
In case it's useful for anyone else, we avoid this problem in our cluster by running with Nomad's Docker log collection turned off, all of our Docker containers set to log via journald (in non-blocking mode, since the data is not critical and 99% of people wouldn't want a logging failure to cause a service interruption!), and a simple system job that continuously streams filtered data off journald and stores it in the exact same files and locations Nomad expects. This works quite well, makes both |
Forgot about this, the "disable log collection" option in #6820 fixed this issue for me, so I'm going to close it. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.9.6
Operating system and Environment details
Linux 3.10.0-957.27.2.el7.x86_64 / centos 7.7
Issue
We noticed some very high load on some of our cluster nodes.
Looking at the top of such a system shows the system is mostly idle and almost no wait states. But with an increasing load average
Some further debug lead to the docker threads being the culprit, which shows that they are in
call_rwsem_down_read_failed
andcall_rwsem_down_write_failed
This are apparently minor pagefaults, which can also be seen with ps
So docker is causing lots of page faults together with systemd, after some rabbitholes. This seems to becaused with the
mmap()
calls insystemd-journald
A first solution was moving our journald logs from disk (
persistent
) to memory (volatile
) but this didn't fix it.We also tried newer kernels up to 5.3.11 but fixing this issue in the kernel itself seems to be still a work in progress. For those that want to dive into a deeper rabbit hole: https://lwn.net/Articles/786105/
So as it doesn't seem there's a fix for the symptoms, we went for trying to fix the cause.
And the process causing this is
docker_logger
from nomad that constantly reads from every container.We now have a quick fix (by disabling part of this plugin) by adding a return on line 81 in
docker_logger.go
nomad/drivers/docker/docklog/docker_logger.go
Lines 79 to 82 in 6781d28
It seems we can't just easily remove the whole logger / docker_logger plugin because it's hardcoded.
Afterwards we get normal load averages again
Reproduction steps
Suggested solution
Add an option to make docker_logger optional.
When you're using journalbeat to send the journald logs to an ELK stack there's no need to have nomad logging output (as in 0.8.x)
The text was updated successfully, but these errors were encountered: