Too many open files #3686

janko · 2017-12-21T12:10:29Z

Nomad version

$ nomad -v
Nomad v0.7.0

Operating system and Environment details

$ uname -a
Linux iot-useast1-prod-nomad-server-1 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty

We have 5 Nomad server nodes and 113 Nomad client nodes on AWS EC2.

Issue

One of our Nomad server nodes ran out of file descriptors, and now the cluster is struggling to select a leader. This happened the 3rd time already. Previously it was happening on version 0.5.6, and now it's still happening on 0.7.0 after we upgraded.

We can see from the lsof.log below that the vast majority (about 75%) of open file descriptors are towards our nomad-client-admin-4 node, which doesn't run more allocations than other nomad-client-admin-* nodes. I included the log for nomad-client-admin-4 as well, where the only thing I can see is that there is a nomad_exporter job which is being restarted frequently, I don't know if that might be the cause.

Reproduction steps

N/A

Nomad Server logs (if appropriate)

There are a lot of too many file descriptors log lines now, so I tried to extract something relevant:

Earliest errors we have

    2017/12/16 06:47:59 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:09 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:09 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:09 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:10 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:10 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:10 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:19 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:19 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:19 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:20 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:20 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:20 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:29 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:29 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:29 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:30 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:30 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:30 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:40 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:40 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:40 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:50 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:50 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:50 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:51 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:51 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:51 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:02 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:02 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:02 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:12 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:12 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:12 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:13 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:13 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:13 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:24 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:24 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:24 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:26 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:26 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:26 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:41 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:41 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:41 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:01 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:01 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:02 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:20 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:20 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:20 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:22 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:22 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:22 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:40 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:40 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:40 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:42 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:42 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:42 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:55.527388 [ERR] worker: failed to dequeue evaluation: eval broker disabled
    2017/12/16 06:51:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:51:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:51:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:51:03 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:51:03 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:51:03 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:53:31.974209 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:31.974447 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:34.493762 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:34.518940 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:44 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:53:44 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:53:44 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:55:43 [ERR] raft-net: Failed to flush response: write tcp 10.0.11.172:4647->10.0.21.37:33314: write: broken pipe
    2017/12/16 06:55:44.008619 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:55:44.802281 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:55:45.124885 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:55:45.124895 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:56:48.406489 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:56:48.406509 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:56:52.407520 [ERR] worker: failed to dequeue evaluation: rpc error: rpc error: eval broker disabled

Today's errors: nomad.log

sudo lsof output: lsof.log

Nomad Client logs (if appropriate)

nomad-client-admin-4 log: nomad-admin-client-4.log

Job file (if appropriate)

prometheus_exporters.hcl

job "prometheus_exporters" {
  type = "service"
  datacenters = ["useast1"]

  constraint {
    attribute = "${node.class}"
    value = "admin"
  }

  group "consul_exporter" {
    count = 1

    task "consul_exporter" {
      driver = "docker"
      config {
        image = "registry.service.m2x:5000/attm2x/consul-exporter:673081d"
        args = ["-consul.server=${attr.unique.network.ip-address}:8500"]
        port_map { http = 9107 }
      }
      service {
        name = "prometheus-consul-exporter"
        port = "http"
        check {
          type = "http"
          path = "/"
          timeout = "5s"
          interval = "30s"
        }
      }
      resources {
        memory = 64
        network {
          mbits = 1
          port "http" {}
        }
      }
    }
  }

  group "es_iot_exporter" {
    count = 1

    task "es_iot_exporter" {
      driver = "docker"
      config {
        image = "registry.service.m2x:5000/attm2x/elasticsearch-exporter:2dd77e1"
        args = [
          "-es.uri=http://244.es-iot-client.service.m2x:9200",
          "-es.all"
        ]
        port_map { http = 9108 }
      }
      service {
        name = "prometheus-es-iot-exporter"
        port = "http"
        check {
          type = "http"
          path = "/"
          timeout = "5s"
          interval = "30s"
        }
      }
      resources {
        memory = 64
        network {
          mbits = 1
          port "http" {}
        }
      }
    }
  }

  group "nomad_exporter" {
    count = 1

    task "nomad_exporter" {
      driver = "docker"
      config {
        image = "registry.service.m2x:5000/attm2x/nomad-exporter:e65b05d"
        command = "nomad-exporter"
        args = [
          "-nomad.server=http://nomad.service.m2x:4646"
        ]
        port_map { http = 9172 }
      }
      service {
        name = "prometheus-nomad-exporter"
        port = "http"
        check {
          type = "http"
          path = "/"
          timeout = "5s"
          interval = "30s"
        }
      }
      resources {
        memory = 64
        network {
          mbits = 1
          port "http" {}
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

schmichael · 2018-01-10T18:10:40Z

Thanks for the thorough bug report and logs @janko-m!

Does restarting the Nomad client node process making all of the connections fix the problem? (By default restarting the agent does not affect running allocations/tasks.)

There are a few other things that would help us debug this:

lsof on client node

Just to be absolutely sure it's the Nomad client node process making too many connections could you post the output of lsof from the client node?

goroutine dump

Is it possible to set debug = true on your client nodes? If we're able to get a goroutine dump of the client node opening the large number of connections we should be able to get to the bottom of this quickly.

If you're able to enable that on client nodes and the problem occurs again, please attach the output of http://localhost:4646/debug/pprof/goroutine?debug=2 (where localhost is the client node making so many connections).

DEBUG log level

Lowest priority

I can't think of any debug log lines that would be particularly useful, so this is the lowest priority for me. However if you're able to set log_level = debug on the problematic client that would definitely give us more information to work with.

Thanks again and sorry for the particularly nasty issue you've hit!

memelet · 2018-03-12T11:39:09Z

No details just yet, but we are having a large production outage right now and this is one of the errors we are getting.

jippi · 2018-03-12T11:46:37Z

If you do not include ulimit -n 65536 (or a similar higher-than-1024 value) your nomad cluster will have a bad time, guaranteed.

Every time I setup a new cluster and forget about this setting, eventually i get random client drops, crashy clusters and what not crazy.

memelet · 2018-03-12T11:53:54Z

We set ulimit to max during provison

memelet · 2018-03-12T11:57:53Z

Well, I take that back. Seems like that got removed from the playbook.

jippi · 2018-03-12T12:32:36Z

@memelet putting it back will 99,9% fix your cluster instability :)

janko · 2018-03-12T12:51:02Z

@schmichael Unfortunately I don't get any more information about the nodes in that state, as we had to cycle the nodes out of the cluster.

I think what caused this was that one of our jobs was frequently failing and restarting due to an invalid state. I think this caused Nomad to accumulate temporary files/directories and somehow retain all those connections from the Nomad server nodes.

Since we stopped that job that was frequently restarting we hadn't had this issue on our main cluster. On our staging cluster a similar thing happened, I noticed the Nomad client node accumulating a lot of temporary files/directories, and there we also identified a job that was restarting frequently.

jippi · 2018-03-12T13:24:35Z

@janko-m what was/is your ulimit value for the nomad client/servers ?

janko · 2018-03-12T13:39:33Z

@jippi Unlimited 🙈

memelet · 2018-03-12T17:22:50Z

@jippi So far it looks very good.

We still get lots of dropping update to alloc ... errors but I think that is unrelated. Thanks!

jippi · 2018-03-12T17:37:12Z

@memelet can you gist your nomad server config? :)

memelet · 2018-03-12T20:49:55Z

@jippi

base.hcl:

bind_addr = "0.0.0.0" # the default

data_dir  = "/data/nomad"

advertise {
  http = "10.0.17.125"
  rpc  = "10.0.17.125"
  serf = "10.0.17.125"
}

consul {
  address = "127.0.0.1:8500"
}

datacenter = "production"

enable_debug = true

http_api_response_headers {
  env = "production"
}

log_level = "INFO"


name = "p-mesos-master-2"  # you can see our heritage ;-)

region = "us-east-1"

agent.hcl:

server {
    enabled = true
    bootstrap_expect = 5
    data_dir = "/var/log/nomad"
    rejoin_after_leave = true
}

jippi · 2018-03-13T08:40:57Z

@memelet okay, that config seem fine to me, do you have GOMAXPROCS set in your env when running nomad?

memelet · 2018-03-13T15:02:41Z

@jippi Yes, in the startup script we have export GOMAXPROCS='nproc'.

kurtwheeler · 2018-08-29T17:31:52Z

If you do not include ulimit -n 65536 (or a similar higher-than-1024 value) your nomad cluster will have a bad time, guaranteed.

If this is such a guaranteed issue, could it be included in docs somewhere? Maybe on this page: https://www.nomadproject.io/guides/cluster/requirements.html?

nkissebe · 2018-11-27T19:02:39Z

@schmichael I don't know about the OP, but for the second time I've had a server instance go into a logging loop

2018/11/27 13:35:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files

and

2018/11/27 13:35:35.820153 [ERR] nomad.rpc: failed to accept RPC conn: accept tcp [::]:4647: accept4: too many open files

And proceed to fill up logging directory VERY rapidly, its clearly in a very tight loop and logging thousands of times a second. It is essentially out of control. Upping the open file limit seems to resolve (delay?) the issue. The excessive logging is a bug that needs to be fixed, it is unacceptable in the current state.

schmichael · 2018-11-29T04:57:08Z

@nkissebe It's like this log line being hit in a tight loop:

nomad/nomad/rpc.go

Line 108 in 3a26035

r.logger.Error("failed to accept RPC conn", "error", err)

We'll get it fixed for Nomad 0.9.0.

ketzacoatl · 2018-12-10T16:56:54Z

I've seen the "too many open files" issue on servers running in production.

In regards to this comment:

If you do not include ulimit -n 65536 (or a similar higher-than-1024 value) your nomad cluster will have a bad time, guaranteed.

What is a reasonable limit? Should this depend on host/node side?

See also: - hashicorp/nomad#6557 - hashicorp/nomad#3686

schmichael added type/bug stage/needs-investigation labels Jan 4, 2018

kurtwheeler mentioned this issue Aug 29, 2018

Sets the ulimit option for files for nomad clients. AlexsLemonade/refinebio#562

Merged

cgbaker mentioned this issue Dec 7, 2018

rpc accept loop: added backoff on logging #4974

Merged

notnoop mentioned this issue Oct 25, 2019

nomad coping with OS resources limits #6557

Open

bdossantos added a commit to bdossantos/ansible-nomad that referenced this issue Oct 28, 2019

chore: set LimitNOFILE and LimitNPROC to infinity

ab7b6df

See also: - hashicorp/nomad#6557 - hashicorp/nomad#3686

tgross added theme/resource-utilization stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Dec 16, 2020

tgross added the stage/needs-verification Issue needs verifying it still exists label Mar 4, 2021

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many open files #3686

Too many open files #3686

janko commented Dec 21, 2017 •

edited

Loading

schmichael commented Jan 10, 2018

memelet commented Mar 12, 2018

jippi commented Mar 12, 2018

memelet commented Mar 12, 2018

memelet commented Mar 12, 2018

jippi commented Mar 12, 2018

janko commented Mar 12, 2018 •

edited

Loading

jippi commented Mar 12, 2018

janko commented Mar 12, 2018

memelet commented Mar 12, 2018

jippi commented Mar 12, 2018

memelet commented Mar 12, 2018 •

edited

Loading

jippi commented Mar 13, 2018

memelet commented Mar 13, 2018

kurtwheeler commented Aug 29, 2018

nkissebe commented Nov 27, 2018

schmichael commented Nov 29, 2018

ketzacoatl commented Dec 10, 2018

Too many open files #3686

Too many open files #3686

Comments

janko commented Dec 21, 2017 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

schmichael commented Jan 10, 2018

lsof on client node

goroutine dump

DEBUG log level

memelet commented Mar 12, 2018

jippi commented Mar 12, 2018

memelet commented Mar 12, 2018

memelet commented Mar 12, 2018

jippi commented Mar 12, 2018

janko commented Mar 12, 2018 • edited Loading

jippi commented Mar 12, 2018

janko commented Mar 12, 2018

memelet commented Mar 12, 2018

jippi commented Mar 12, 2018

memelet commented Mar 12, 2018 • edited Loading

jippi commented Mar 13, 2018

memelet commented Mar 13, 2018

kurtwheeler commented Aug 29, 2018

nkissebe commented Nov 27, 2018

schmichael commented Nov 29, 2018

ketzacoatl commented Dec 10, 2018

janko commented Dec 21, 2017 •

edited

Loading

janko commented Mar 12, 2018 •

edited

Loading

memelet commented Mar 12, 2018 •

edited

Loading