Investigate too many open files error #1316

bowenwang1996 · 2019-09-16T02:32:15Z

Sometimes after running for a while, a node would stops working because of some too many files open error. We need to investigate the cause and see whether we need to increase the limit for the number of file descriptors allowed.

ailisp · 2019-09-16T03:54:40Z

Happens to run into a linux version of vscode opens to much file issue last week, and the solution was increase file descriptor limit per process. I should be able to use similar configuration to this.

…

On Sun, Sep 15, 2019 at 7:32 PM Bowen Wang ***@***.***> wrote: Assigned #1316 <#1316> to @ailisp <https://github.com/ailisp>. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#1316>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADFFFCCFGVQTLQEPMMSSKS3QJ3V3NANCNFSM4IW47MTA> .

bowenwang1996 · 2019-09-16T04:24:24Z

We should figure out why the error occurs. It could be that there is something wrong with our code that causes unnecessary opening of file descriptors. If that's the case, simply increasing the limit might not help.

ailisp · 2019-09-16T05:31:32Z

Sounds good. AFAIK the default limit is 65536, so I agree there should be some unclouded handles coding error need to be avoided . I’ll investigate.

…

On Sun, Sep 15, 2019 at 9:24 PM Bowen Wang ***@***.***> wrote: We should figure out why the error occurs. It could be that there is something wrong with our code that causes unnecessary opening of file descriptors. If that's the case, simply increasing the limit might not help. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1316>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADFFFCAUS5HCZAEQZ673MATQJ4C7TANCNFSM4IW47MTA> .

bowenwang1996 · 2019-09-16T05:32:27Z

Default is 1024 on gcloud.

ailisp · 2019-09-16T05:49:53Z

👍Good catch, That sounds too small

…

On Sun, Sep 15, 2019 at 10:32 PM Bowen Wang ***@***.***> wrote: Default is 1024 on gcloud. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1316>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADFFFCDJYUD3FLTBHIKR6D3QJ4K6ZANCNFSM4IW47MTA> .

ailisp · 2019-09-16T19:11:43Z

Some findings: 1. default (or it might already be increased) handle is 8192:

2. Current number of node1 is 1240

I'll keep monitoring this number to see if it's constantly growing (and therefore need fix)
3. From a quick look of these file handles, 99% are sockets. Not sure if this is the intended way our node and underlying tokio executor is working, or it's definitely shouldn't be 1000+ socket opening at the same time? @bowenwang1996

MaksymZavershynskyi · 2019-09-17T02:13:10Z

That's definitely not intended we are probably opening zombie sockets. Note, we are using Actix which in turn uses Tokio. We might be having some zombie actors that are holding sockets and are doing nothing.

bowenwang1996 · 2019-09-17T03:02:20Z

It's possible that some file descriptors are not properly released. Maybe it is a problem with actix

ailisp · 2019-09-18T00:50:31Z

Got it. Some new findings today:

file descriptors goes down to 11xx as of today so it's not ever growing.
with @bowenwang1996 's help found more than 90% fd is socket to a peer node.
This number seems not peer gossip messages as in a fresh 10-validator network, each validator shows 9/9/40, but they have only 65 descriptors. During running load tester, this number slightly increase to 100+, but decrease to 70-80 after loadtester finished.

bowenwang1996 · 2019-09-18T00:54:34Z

Actually it is still undesirable because we intend to have 100 validators in total and if we need ~70 file descriptors per validator the number is still too large. Besides, non-validators will also connect to validators, thereby creating even more sockets.

MaksymZavershynskyi · 2019-09-18T10:27:54Z

In general, 1 or 2 sockets per peer should be sufficient.

ailisp · 2019-10-04T01:07:15Z

@nearmax @bowenwang1996 Finally find the reason is telemetry
In staging-node1 do ss | awk '{print $6}' | sort | uniq -c will get something like:

The vast majority of connection is to this two ip:

   1470 34.83.40.137:https
   1563 34.83.64.96:https

These are two ips to the backend of near-explorer:

The node-to-node connection (via :24567) is actually good, in staging testnet there's 3 node and every node has exactly socket connect to the other two:

(A note I found when reading this output: a.b.c.d.bc.googleusercontent.com always have ip d.c.b.a, use this it's quite easy to identify node and who connect to who in the above ouput)

(also when i test load test network with several hundred tps, there's never issue of too many file handles, even during the loadtesting file handles doesn't increase compare to "before loadtesting start", the reason is there isn't a node explorerer backend connect to the telemetry)

bowenwang1996 · 2019-10-04T01:11:12Z

Interesting. @frol can you take a look?

frol · 2019-10-04T10:11:29Z

Is this the node serving https://rpc.nearprotocol.com/? NEAR Explorer requires to sync all the blocks when I do reset (I have done it several times in August and at most twice in September, and the last time was on Wed [Oct, 2]).

Syncing all the blocks over the current RPC generates a huge number of requests (it does 250 concurrent requests over HTTPS).

If the root cause of the number of descriptors is due to this load, we should increase it for this public node anyway (it is just a general web-service tuning). Other steps that will help to eliminate the issue: do not reset Explorer often (it is not that often as it used to be in August), implement a better API on Nearcore side (gRPC with streaming is a great option).

…e the number of new connections Ref: near/nearcore#1316

frol · 2019-10-04T11:08:37Z

I have just learned that NodeJS does not reuse (keep-alive) connections by default using node-fetch, so this also introduces quite an overhead. I have PRed a fix: near/near-api-js#83

ailisp · 2019-10-04T15:27:11Z

@frol thanks for the fix! besides this I suspect rust side we node info is POST to nodejs side, which also opens a new connection each time, i'll take a look

bowenwang1996 · 2019-10-04T15:43:16Z

Actually I observed that this issue happened on nodes other than the default rpc one. Also it happens on staging and I don't think there is explorer for staging. So maybe there's something else going on?

ailisp · 2019-10-04T15:46:50Z

I think it’s not fully because of rpc. Rpc node have more file opens than non rpc nodes; and even non rpc node has more than usual file handles as you observed. The problem is telemetry post, which happens to all node that has a telemery url in confit. So it’s likely rust side, when post info to telemetry node, we didn’t close socket properly

…

On Fri, Oct 4, 2019 at 8:43 AM Bowen Wang ***@***.***> wrote: Actually I observed that this issue happened on nodes other than the default rpc one. Also it happens on staging and I don't think there is explorer for staging. So maybe there's something else going on? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1316>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADFFFCEGT6KOWE6JXGQRVOLQM5QBLANCNFSM4IW47MTA> .

frol · 2019-10-04T15:48:52Z

@bowenwang1996 I confirm that there is no Explorer pointing to staging (though, I have tested Explorer [from my laptop] against staging RPC in the mid of September), and the telemetry should not be sent from staging net (as far as I recall, there is a default config for telemetry only for the testnet)

ailisp · 2019-10-04T15:51:45Z

Telemetry url confit exists in staging node. So even if you did not do anything in nodejs side, in rust side it still try to open connection to POST telemetry url

…

On Fri, Oct 4, 2019 at 8:48 AM Vlad Frolov ***@***.***> wrote: @bowenwang1996 <https://github.com/bowenwang1996> I confirm that there is no Explorer pointing to staging (though, I have tested Explorer [from my laptop] against staging RPC in the mid of September), and the telemetry should not be sent from staging net (as far as I recall, there is a default config for telemetry only for the testnet) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1316>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADFFFCHXDSV7Q6TSKKIMXUDQM5QWLANCNFSM4IW47MTA> .

ailisp · 2019-10-04T23:48:23Z

@bowenwang1996 @nearmax I'm able to reproduce this locally and i can locate exactly which line cause this, but do not have an idea how to fix it yet, looks like a actix_web awc client bug.

Steps to generate:

near init, modify config.json telemery section:

  "telemetry": {
    "endpoints": [
      "https://explorer.nearprotocol.com/api/nodes"
    ]
  },

near run. find its pid by ps aux | grep near
watch -n 10 'lsof -i -a -p <pid>'
wait for 3-5 minutes, you will see connections to 96.64.83.34.bc.googleusercontent.com:https and 137.40.83.34.bc.googleusercontent.com:https gradually increase. Note it's not happen every 10s but randomly every 80-100s. Telemetry is called every 10s, this means actix awc client close some of connections correctly but not all.
Comment the call to telemetry.do_send in chain/telemetry/src/lib.rs
cargo build -p near
Redo 2 and 3
wait for 10 minutes, you won't see sockets increase.

ilblackdragon · 2019-11-05T06:09:48Z

@ailisp is this still an active issue?

ailisp · 2019-11-05T18:44:59Z

@ilblackdragon fix also merged into master. Closed

bowenwang1996 · 2019-11-13T16:37:16Z

It happens again on the stakewars node.

ailisp · 2019-11-18T19:22:43Z

@bowenwang1996 It's a different cause. This time is not too much unclosed connection from explorer but rpc. In stakewars-1

sudo lsof -p 28497 | wc -l # all opened files
sudo lsof -p 28497 | grep TCP | wc -l # all tcp sockets
sudo lsof -p 28497 | grep TCP | sort -k 7

From the first two command the problem is still too many opened TCP connections (>90%). from the last command we can see the connection into tcp:3000 are all from different IP addresses, each ip only has 1-3 connection. (Previously they're all from ip address of explorer on render)
Part of result below:

And most likely this isn't an issue, because when I trace each connection it closes very soon after opening. Overall connection to TCP:3030 also continuously stay in constant level when server load is constant (This is different in the case when explorer connect to pull data and the connection never cleanup):

And given only stakewars-1 is on load balancer and currently only stakewars-1 have many opened (~600 vs ~100), I suggest to load balance all our stakewar nodes

bowenwang1996 · 2019-11-18T19:52:34Z

The problem I see with load balancer is that if one node crashed or falls out of sync, the load balancer will give confusing information.

ailisp · 2019-11-18T20:18:06Z

The problem I see with load balancer is that if one node crashed or falls out of sync, the load balancer will give confusing information.

Imo to avoid this we need a health check endpoint that only returns 200 when node is totally healthy (not crushed + sync), Loadbalancer is simple&stupid, it just attach nodes that pass the health check. With loadbalancer is more robust, node1 crush doesn't breaks https://rpc.*.nearprotocol. As for reaching specific node for debugging, you can still by pass loadbalancer by use node ip
Also we can actually set two set of loadbalancer, one to stakewars-1, one to all.

bowenwang1996 · 2019-11-26T15:54:21Z

@ailisp I took a look at the open connections. It seems that most of them are connections to port 3030, which indicates that they might be from explorer or something that pulls data from rpc constantly.

ailisp · 2019-11-26T16:54:12Z

True. So now it's not the actix-client, but actix-server issue. Is it possible it's really that many client connect to server? As i compared in lsof output, every seconds later many connection went away, and many new connection established (In a single second there isn't that many, but consider if every client have a keepalive=2min)

bowenwang1996 · 2019-11-26T16:56:21Z

Given the participation of stakewars I don't think there are 600 connections even in 2 minutes

ailisp · 2019-11-26T18:10:06Z

Correction: default keep alive in actix-server is 5s.
I observe every ~5 seconds there's a clean up on connections which reduce most or all tcp:3030 connections. But it's still likely as server running some dead connection accumulate. I'll keep monitoring

mfornet · 2019-11-26T18:18:19Z

Connection to port 3030 every ~5 seconds are very likely done by Prometheus. Maybe there is a Prometheus issue serving at *:3030/metrics?

frol · 2019-11-27T06:19:21Z

I have checked Explorer, and it keeps 240 connections open total for all three networks (testnet, staging, tatooine), and over a few minutes period, it did not open any new connections. Well, it may benefit from using keep-alive to shut down the pool of connections when it does not sync anymore, but it should not cause the load you describe.

ailisp · 2019-12-18T01:21:21Z

Unable to observe this recently in stakewars & main testnet recently. For stakewars, TCP 3030 flunctuate between 30 to 100. Imo if it's garbage collected on time it's fine. This conform to what @frol mentions ~80 for each net. For those part from prometheus should be fine - num/traffic of prometheus node will be small compare to normal user.

But today when i run target/debug/near run locally for the first time, it !!! Try reboot laptop, near init, near run, still this error until i increase ulimit to more than 1024. This means even with no traffic it could cause too many open files?

bowenwang1996 · 2019-12-18T01:28:46Z

Maybe it's caused by database? If file descriptors are not cleaned properly there might be too many open files.

ailisp · 2020-01-14T05:49:03Z

Unable to see this as of Jan 13, 2020 in main testnet

bowenwang1996 assigned ailisp Sep 16, 2019

frol added a commit to near/near-api-js that referenced this issue Oct 4, 2019

Keep-alive (reuse) connections running in NodeJS environment to reduc…

2cf6dca

…e the number of new connections Ref: near/nearcore#1316

frol mentioned this issue Oct 4, 2019

Keep-alive (reuse) connections running in NodeJS environment to reduce the number of new connections near/near-api-js#83

Merged

ailisp mentioned this issue Oct 5, 2019

fix too much unclosed sockets (file descriptor) #1413

Merged

ailisp added the in-staging label Oct 10, 2019

ailisp closed this as completed Nov 5, 2019

bowenwang1996 reopened this Nov 13, 2019

ilblackdragon added this to the MainNet milestone Dec 8, 2019

ailisp closed this as completed Jan 14, 2020

Investigate too many open files error #1316

Investigate too many open files error #1316

Comments

bowenwang1996 commented Sep 16, 2019

ailisp commented Sep 16, 2019 via email

bowenwang1996 commented Sep 16, 2019

ailisp commented Sep 16, 2019 via email

bowenwang1996 commented Sep 16, 2019

ailisp commented Sep 16, 2019 via email

ailisp commented Sep 16, 2019

MaksymZavershynskyi commented Sep 17, 2019

bowenwang1996 commented Sep 17, 2019

ailisp commented Sep 18, 2019

bowenwang1996 commented Sep 18, 2019 • edited Loading

MaksymZavershynskyi commented Sep 18, 2019

ailisp commented Oct 4, 2019 • edited Loading

bowenwang1996 commented Oct 4, 2019

frol commented Oct 4, 2019

frol commented Oct 4, 2019

ailisp commented Oct 4, 2019

bowenwang1996 commented Oct 4, 2019

ailisp commented Oct 4, 2019 via email

frol commented Oct 4, 2019

ailisp commented Oct 4, 2019 via email

ailisp commented Oct 4, 2019 • edited Loading

ilblackdragon commented Nov 5, 2019

ailisp commented Nov 5, 2019

bowenwang1996 commented Nov 13, 2019

ailisp commented Nov 18, 2019

bowenwang1996 commented Nov 18, 2019

ailisp commented Nov 18, 2019 • edited Loading

bowenwang1996 commented Nov 26, 2019

ailisp commented Nov 26, 2019

bowenwang1996 commented Nov 26, 2019 • edited Loading

ailisp commented Nov 26, 2019

mfornet commented Nov 26, 2019

frol commented Nov 27, 2019

ailisp commented Dec 18, 2019

bowenwang1996 commented Dec 18, 2019

ailisp commented Jan 14, 2020

bowenwang1996 commented Sep 18, 2019 •

edited

Loading

ailisp commented Oct 4, 2019 •

edited

Loading

ailisp commented Oct 4, 2019 •

edited

Loading

ailisp commented Nov 18, 2019 •

edited

Loading

bowenwang1996 commented Nov 26, 2019 •

edited

Loading