Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate too many open files error #1316

Closed
bowenwang1996 opened this issue Sep 16, 2019 · 36 comments
Closed

Investigate too many open files error #1316

bowenwang1996 opened this issue Sep 16, 2019 · 36 comments
Assignees
Milestone

Comments

@bowenwang1996
Copy link
Collaborator

Sometimes after running for a while, a node would stops working because of some too many files open error. We need to investigate the cause and see whether we need to increase the limit for the number of file descriptors allowed.

@ailisp
Copy link
Member

ailisp commented Sep 16, 2019 via email

@bowenwang1996
Copy link
Collaborator Author

We should figure out why the error occurs. It could be that there is something wrong with our code that causes unnecessary opening of file descriptors. If that's the case, simply increasing the limit might not help.

@ailisp
Copy link
Member

ailisp commented Sep 16, 2019 via email

@bowenwang1996
Copy link
Collaborator Author

Default is 1024 on gcloud.

@ailisp
Copy link
Member

ailisp commented Sep 16, 2019 via email

@ailisp
Copy link
Member

ailisp commented Sep 16, 2019

Some findings: 1. default (or it might already be increased) handle is 8192:
image
2. Current number of node1 is 1240
image
I'll keep monitoring this number to see if it's constantly growing (and therefore need fix)
3. From a quick look of these file handles, 99% are sockets. Not sure if this is the intended way our node and underlying tokio executor is working, or it's definitely shouldn't be 1000+ socket opening at the same time? @bowenwang1996

@MaksymZavershynskyi
Copy link
Contributor

That's definitely not intended we are probably opening zombie sockets. Note, we are using Actix which in turn uses Tokio. We might be having some zombie actors that are holding sockets and are doing nothing.

@bowenwang1996
Copy link
Collaborator Author

It's possible that some file descriptors are not properly released. Maybe it is a problem with actix

@ailisp
Copy link
Member

ailisp commented Sep 18, 2019

Got it. Some new findings today:

  1. file descriptors goes down to 11xx as of today so it's not ever growing.
  2. with @bowenwang1996 's help found more than 90% fd is socket to a peer node.
  3. This number seems not peer gossip messages as in a fresh 10-validator network, each validator shows 9/9/40, but they have only 65 descriptors. During running load tester, this number slightly increase to 100+, but decrease to 70-80 after loadtester finished.

@bowenwang1996
Copy link
Collaborator Author

bowenwang1996 commented Sep 18, 2019

Actually it is still undesirable because we intend to have 100 validators in total and if we need ~70 file descriptors per validator the number is still too large. Besides, non-validators will also connect to validators, thereby creating even more sockets.

@MaksymZavershynskyi
Copy link
Contributor

In general, 1 or 2 sockets per peer should be sufficient.

@ailisp
Copy link
Member

ailisp commented Oct 4, 2019

@nearmax @bowenwang1996 Finally find the reason is telemetry
In staging-node1 do ss | awk '{print $6}' | sort | uniq -c will get something like:
image
The vast majority of connection is to this two ip:

   1470 34.83.40.137:https
   1563 34.83.64.96:https

These are two ips to the backend of near-explorer:
image

The node-to-node connection (via :24567) is actually good, in staging testnet there's 3 node and every node has exactly socket connect to the other two:

image
image
image

(A note I found when reading this output: a.b.c.d.bc.googleusercontent.com always have ip d.c.b.a, use this it's quite easy to identify node and who connect to who in the above ouput)

(also when i test load test network with several hundred tps, there's never issue of too many file handles, even during the loadtesting file handles doesn't increase compare to "before loadtesting start", the reason is there isn't a node explorerer backend connect to the telemetry)

@bowenwang1996
Copy link
Collaborator Author

Interesting. @frol can you take a look?

@frol
Copy link
Collaborator

frol commented Oct 4, 2019

Is this the node serving https://rpc.nearprotocol.com/? NEAR Explorer requires to sync all the blocks when I do reset (I have done it several times in August and at most twice in September, and the last time was on Wed [Oct, 2]).

Syncing all the blocks over the current RPC generates a huge number of requests (it does 250 concurrent requests over HTTPS).

If the root cause of the number of descriptors is due to this load, we should increase it for this public node anyway (it is just a general web-service tuning). Other steps that will help to eliminate the issue: do not reset Explorer often (it is not that often as it used to be in August), implement a better API on Nearcore side (gRPC with streaming is a great option).

@frol
Copy link
Collaborator

frol commented Oct 4, 2019

I have just learned that NodeJS does not reuse (keep-alive) connections by default using node-fetch, so this also introduces quite an overhead. I have PRed a fix: near/near-api-js#83

@ailisp
Copy link
Member

ailisp commented Oct 4, 2019

@frol thanks for the fix! besides this I suspect rust side we node info is POST to nodejs side, which also opens a new connection each time, i'll take a look

@bowenwang1996
Copy link
Collaborator Author

Actually I observed that this issue happened on nodes other than the default rpc one. Also it happens on staging and I don't think there is explorer for staging. So maybe there's something else going on?

@ailisp
Copy link
Member

ailisp commented Oct 4, 2019 via email

@frol
Copy link
Collaborator

frol commented Oct 4, 2019

@bowenwang1996 I confirm that there is no Explorer pointing to staging (though, I have tested Explorer [from my laptop] against staging RPC in the mid of September), and the telemetry should not be sent from staging net (as far as I recall, there is a default config for telemetry only for the testnet)

@ailisp
Copy link
Member

ailisp commented Oct 4, 2019 via email

@ailisp
Copy link
Member

ailisp commented Oct 4, 2019

@bowenwang1996 @nearmax I'm able to reproduce this locally and i can locate exactly which line cause this, but do not have an idea how to fix it yet, looks like a actix_web awc client bug.

Steps to generate:

  1. near init, modify config.json telemery section:
  "telemetry": {
    "endpoints": [
      "https://explorer.nearprotocol.com/api/nodes"
    ]
  },
  1. near run. find its pid by ps aux | grep near

  2. watch -n 10 'lsof -i -a -p <pid>'

  3. wait for 3-5 minutes, you will see connections to 96.64.83.34.bc.googleusercontent.com:https and 137.40.83.34.bc.googleusercontent.com:https gradually increase. Note it's not happen every 10s but randomly every 80-100s. Telemetry is called every 10s, this means actix awc client close some of connections correctly but not all.

  4. Comment the call to telemetry.do_send in chain/telemetry/src/lib.rs

  5. cargo build -p near

  6. Redo 2 and 3

  7. wait for 10 minutes, you won't see sockets increase.

@ilblackdragon
Copy link
Member

@ailisp is this still an active issue?

@ailisp
Copy link
Member

ailisp commented Nov 5, 2019

@ilblackdragon fix also merged into master. Closed

@ailisp ailisp closed this as completed Nov 5, 2019
@bowenwang1996
Copy link
Collaborator Author

It happens again on the stakewars node.

@bowenwang1996 bowenwang1996 reopened this Nov 13, 2019
@ailisp
Copy link
Member

ailisp commented Nov 18, 2019

@bowenwang1996 It's a different cause. This time is not too much unclosed connection from explorer but rpc. In stakewars-1

sudo lsof -p 28497 | wc -l # all opened files
sudo lsof -p 28497 | grep TCP | wc -l # all tcp sockets
sudo lsof -p 28497 | grep TCP | sort -k 7

From the first two command the problem is still too many opened TCP connections (>90%). from the last command we can see the connection into tcp:3000 are all from different IP addresses, each ip only has 1-3 connection. (Previously they're all from ip address of explorer on render)
Part of result below:
image

And most likely this isn't an issue, because when I trace each connection it closes very soon after opening. Overall connection to TCP:3030 also continuously stay in constant level when server load is constant (This is different in the case when explorer connect to pull data and the connection never cleanup):
image

And given only stakewars-1 is on load balancer and currently only stakewars-1 have many opened (~600 vs ~100), I suggest to load balance all our stakewar nodes

@bowenwang1996
Copy link
Collaborator Author

The problem I see with load balancer is that if one node crashed or falls out of sync, the load balancer will give confusing information.

@ailisp
Copy link
Member

ailisp commented Nov 18, 2019

The problem I see with load balancer is that if one node crashed or falls out of sync, the load balancer will give confusing information.

Imo to avoid this we need a health check endpoint that only returns 200 when node is totally healthy (not crushed + sync), Loadbalancer is simple&stupid, it just attach nodes that pass the health check. With loadbalancer is more robust, node1 crush doesn't breaks https://rpc.*.nearprotocol. As for reaching specific node for debugging, you can still by pass loadbalancer by use node ip
Also we can actually set two set of loadbalancer, one to stakewars-1, one to all.

@bowenwang1996
Copy link
Collaborator Author

@ailisp I took a look at the open connections. It seems that most of them are connections to port 3030, which indicates that they might be from explorer or something that pulls data from rpc constantly.

@ailisp
Copy link
Member

ailisp commented Nov 26, 2019

True. So now it's not the actix-client, but actix-server issue. Is it possible it's really that many client connect to server? As i compared in lsof output, every seconds later many connection went away, and many new connection established (In a single second there isn't that many, but consider if every client have a keepalive=2min)

@bowenwang1996
Copy link
Collaborator Author

bowenwang1996 commented Nov 26, 2019

Given the participation of stakewars I don't think there are 600 connections even in 2 minutes

@ailisp
Copy link
Member

ailisp commented Nov 26, 2019

Correction: default keep alive in actix-server is 5s.
I observe every ~5 seconds there's a clean up on connections which reduce most or all tcp:3030 connections. But it's still likely as server running some dead connection accumulate. I'll keep monitoring

@mfornet
Copy link
Member

mfornet commented Nov 26, 2019

Connection to port 3030 every ~5 seconds are very likely done by Prometheus. Maybe there is a Prometheus issue serving at *:3030/metrics?

@frol
Copy link
Collaborator

frol commented Nov 27, 2019

I have checked Explorer, and it keeps 240 connections open total for all three networks (testnet, staging, tatooine), and over a few minutes period, it did not open any new connections. Well, it may benefit from using keep-alive to shut down the pool of connections when it does not sync anymore, but it should not cause the load you describe.

@ilblackdragon ilblackdragon added this to the MainNet milestone Dec 8, 2019
@ailisp
Copy link
Member

ailisp commented Dec 18, 2019

Unable to observe this recently in stakewars & main testnet recently. For stakewars, TCP 3030 flunctuate between 30 to 100. Imo if it's garbage collected on time it's fine. This conform to what @frol mentions ~80 for each net. For those part from prometheus should be fine - num/traffic of prometheus node will be small compare to normal user.

But today when i run target/debug/near run locally for the first time, it !!! Try reboot laptop, near init, near run, still this error until i increase ulimit to more than 1024. This means even with no traffic it could cause too many open files?
image

@bowenwang1996
Copy link
Collaborator Author

Maybe it's caused by database? If file descriptors are not cleaned properly there might be too many open files.

@ailisp
Copy link
Member

ailisp commented Jan 14, 2020

Unable to see this as of Jan 13, 2020 in main testnet

@ailisp ailisp closed this as completed Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants