-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many open files #3686
Comments
Thanks for the thorough bug report and logs @janko-m! Does restarting the Nomad client node process making all of the connections fix the problem? (By default restarting the agent does not affect running allocations/tasks.) There are a few other things that would help us debug this: lsof on client nodeJust to be absolutely sure it's the Nomad client node process making too many connections could you post the output of goroutine dumpIs it possible to set If you're able to enable that on client nodes and the problem occurs again, please attach the output of http://localhost:4646/debug/pprof/goroutine?debug=2 (where DEBUG log levelLowest priority I can't think of any debug log lines that would be particularly useful, so this is the lowest priority for me. However if you're able to set Thanks again and sorry for the particularly nasty issue you've hit! |
No details just yet, but we are having a large production outage right now and this is one of the errors we are getting. |
If you do not include Every time I setup a new cluster and forget about this setting, eventually i get random client drops, crashy clusters and what not crazy. |
We set ulimit to max during provison |
Well, I take that back. Seems like that got removed from the playbook. |
@memelet putting it back will 99,9% fix your cluster instability :) |
@schmichael Unfortunately I don't get any more information about the nodes in that state, as we had to cycle the nodes out of the cluster. I think what caused this was that one of our jobs was frequently failing and restarting due to an invalid state. I think this caused Nomad to accumulate temporary files/directories and somehow retain all those connections from the Nomad server nodes. Since we stopped that job that was frequently restarting we hadn't had this issue on our main cluster. On our staging cluster a similar thing happened, I noticed the Nomad client node accumulating a lot of temporary files/directories, and there we also identified a job that was restarting frequently. |
@janko-m what was/is your |
@jippi Unlimited 🙈 |
@jippi So far it looks very good. We still get lots of |
@memelet can you gist your nomad server config? :) |
base.hcl:
agent.hcl:
|
@memelet okay, that config seem fine to me, do you have |
@jippi Yes, in the startup script we have |
If this is such a guaranteed issue, could it be included in docs somewhere? Maybe on this page: https://www.nomadproject.io/guides/cluster/requirements.html? |
@schmichael I don't know about the OP, but for the second time I've had a server instance go into a logging loop 2018/11/27 13:35:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files and
And proceed to fill up logging directory VERY rapidly, its clearly in a very tight loop and logging thousands of times a second. It is essentially out of control. Upping the open file limit seems to resolve (delay?) the issue. The excessive logging is a bug that needs to be fixed, it is unacceptable in the current state. |
I've seen the "too many open files" issue on servers running in production. In regards to this comment:
What is a reasonable limit? Should this depend on host/node side? |
Nomad version
Operating system and Environment details
We have 5 Nomad server nodes and 113 Nomad client nodes on AWS EC2.
Issue
One of our Nomad server nodes ran out of file descriptors, and now the cluster is struggling to select a leader. This happened the 3rd time already. Previously it was happening on version 0.5.6, and now it's still happening on 0.7.0 after we upgraded.
We can see from the
lsof.log
below that the vast majority (about 75%) of open file descriptors are towards ournomad-client-admin-4
node, which doesn't run more allocations than othernomad-client-admin-*
nodes. I included the log fornomad-client-admin-4
as well, where the only thing I can see is that there is anomad_exporter
job which is being restarted frequently, I don't know if that might be the cause.Reproduction steps
N/A
Nomad Server logs (if appropriate)
There are a lot of
too many file descriptors
log lines now, so I tried to extract something relevant:Earliest errors we have
Today's errors: nomad.log
sudo lsof
output: lsof.logNomad Client logs (if appropriate)
nomad-client-admin-4
log: nomad-admin-client-4.logJob file (if appropriate)
prometheus_exporters.hcl
The text was updated successfully, but these errors were encountered: