-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate too many open files error #1316
Comments
Happens to run into a linux version of vscode opens to much file issue last
week, and the solution was increase file descriptor limit per process. I
should be able to use similar configuration to this.
…On Sun, Sep 15, 2019 at 7:32 PM Bowen Wang ***@***.***> wrote:
Assigned #1316 <#1316> to
@ailisp <https://github.com/ailisp>.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#1316>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADFFFCCFGVQTLQEPMMSSKS3QJ3V3NANCNFSM4IW47MTA>
.
|
We should figure out why the error occurs. It could be that there is something wrong with our code that causes unnecessary opening of file descriptors. If that's the case, simply increasing the limit might not help. |
Sounds good. AFAIK the default limit is 65536, so I agree there should be
some unclouded handles coding error need to be avoided . I’ll investigate.
…On Sun, Sep 15, 2019 at 9:24 PM Bowen Wang ***@***.***> wrote:
We should figure out why the error occurs. It could be that there is
something wrong with our code that causes unnecessary opening of file
descriptors. If that's the case, simply increasing the limit might not help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1316>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADFFFCAUS5HCZAEQZ673MATQJ4C7TANCNFSM4IW47MTA>
.
|
Default is 1024 on gcloud. |
👍Good catch, That sounds too small
…On Sun, Sep 15, 2019 at 10:32 PM Bowen Wang ***@***.***> wrote:
Default is 1024 on gcloud.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1316>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADFFFCDJYUD3FLTBHIKR6D3QJ4K6ZANCNFSM4IW47MTA>
.
|
Some findings: 1. default (or it might already be increased) handle is 8192: |
That's definitely not intended we are probably opening zombie sockets. Note, we are using Actix which in turn uses Tokio. We might be having some zombie actors that are holding sockets and are doing nothing. |
It's possible that some file descriptors are not properly released. Maybe it is a problem with actix |
Got it. Some new findings today:
|
Actually it is still undesirable because we intend to have 100 validators in total and if we need ~70 file descriptors per validator the number is still too large. Besides, non-validators will also connect to validators, thereby creating even more sockets. |
In general, 1 or 2 sockets per peer should be sufficient. |
@nearmax @bowenwang1996 Finally find the reason is telemetry
These are two ips to the backend of near-explorer: The node-to-node connection (via (A note I found when reading this output: a.b.c.d.bc.googleusercontent.com always have ip d.c.b.a, use this it's quite easy to identify node and who connect to who in the above ouput) (also when i test load test network with several hundred tps, there's never issue of too many file handles, even during the loadtesting file handles doesn't increase compare to "before loadtesting start", the reason is there isn't a node explorerer backend connect to the telemetry) |
Interesting. @frol can you take a look? |
Is this the node serving https://rpc.nearprotocol.com/? NEAR Explorer requires to sync all the blocks when I do reset (I have done it several times in August and at most twice in September, and the last time was on Wed [Oct, 2]). Syncing all the blocks over the current RPC generates a huge number of requests (it does 250 concurrent requests over HTTPS). If the root cause of the number of descriptors is due to this load, we should increase it for this public node anyway (it is just a general web-service tuning). Other steps that will help to eliminate the issue: do not reset Explorer often (it is not that often as it used to be in August), implement a better API on Nearcore side (gRPC with streaming is a great option). |
…e the number of new connections Ref: near/nearcore#1316
I have just learned that NodeJS does not reuse (keep-alive) connections by default using |
@frol thanks for the fix! besides this I suspect rust side we node info is POST to nodejs side, which also opens a new connection each time, i'll take a look |
Actually I observed that this issue happened on nodes other than the default rpc one. Also it happens on staging and I don't think there is explorer for staging. So maybe there's something else going on? |
I think it’s not fully because of rpc. Rpc node have more file opens than
non rpc nodes; and even non rpc node has more than usual file handles as
you observed. The problem is telemetry post, which happens to all node that
has a telemery url in confit. So it’s likely rust side, when post info to
telemetry node, we didn’t close socket properly
…On Fri, Oct 4, 2019 at 8:43 AM Bowen Wang ***@***.***> wrote:
Actually I observed that this issue happened on nodes other than the
default rpc one. Also it happens on staging and I don't think there is
explorer for staging. So maybe there's something else going on?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1316>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADFFFCEGT6KOWE6JXGQRVOLQM5QBLANCNFSM4IW47MTA>
.
|
@bowenwang1996 I confirm that there is no Explorer pointing to staging (though, I have tested Explorer [from my laptop] against staging RPC in the mid of September), and the telemetry should not be sent from staging net (as far as I recall, there is a default config for telemetry only for the |
Telemetry url confit exists in staging node. So even if you did not do
anything in nodejs side, in rust side it still try to open connection to
POST telemetry url
…On Fri, Oct 4, 2019 at 8:48 AM Vlad Frolov ***@***.***> wrote:
@bowenwang1996 <https://github.com/bowenwang1996> I confirm that there is
no Explorer pointing to staging (though, I have tested Explorer [from my
laptop] against staging RPC in the mid of September), and the telemetry
should not be sent from staging net (as far as I recall, there is a default
config for telemetry only for the testnet)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1316>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADFFFCHXDSV7Q6TSKKIMXUDQM5QWLANCNFSM4IW47MTA>
.
|
@bowenwang1996 @nearmax I'm able to reproduce this locally and i can locate exactly which line cause this, but do not have an idea how to fix it yet, looks like a actix_web awc client bug. Steps to generate:
|
@ailisp is this still an active issue? |
@ilblackdragon fix also merged into master. Closed |
It happens again on the stakewars node. |
@bowenwang1996 It's a different cause. This time is not too much unclosed connection from explorer but rpc. In stakewars-1
From the first two command the problem is still too many opened TCP connections (>90%). from the last command we can see the connection into tcp:3000 are all from different IP addresses, each ip only has 1-3 connection. (Previously they're all from ip address of explorer on render) And most likely this isn't an issue, because when I trace each connection it closes very soon after opening. Overall connection to TCP:3030 also continuously stay in constant level when server load is constant (This is different in the case when explorer connect to pull data and the connection never cleanup): And given only stakewars-1 is on load balancer and currently only stakewars-1 have many opened (~600 vs ~100), I suggest to load balance all our stakewar nodes |
The problem I see with load balancer is that if one node crashed or falls out of sync, the load balancer will give confusing information. |
Imo to avoid this we need a health check endpoint that only returns 200 when node is totally healthy (not crushed + sync), Loadbalancer is simple&stupid, it just attach nodes that pass the health check. With loadbalancer is more robust, node1 crush doesn't breaks |
@ailisp I took a look at the open connections. It seems that most of them are connections to port 3030, which indicates that they might be from explorer or something that pulls data from rpc constantly. |
True. So now it's not the actix-client, but actix-server issue. Is it possible it's really that many client connect to server? As i compared in lsof output, every seconds later many connection went away, and many new connection established (In a single second there isn't that many, but consider if every client have a keepalive=2min) |
Given the participation of stakewars I don't think there are 600 connections even in 2 minutes |
Correction: default keep alive in actix-server is 5s. |
Connection to port 3030 every ~5 seconds are very likely done by Prometheus. Maybe there is a Prometheus issue serving at *:3030/metrics? |
I have checked Explorer, and it keeps 240 connections open total for all three networks (testnet, staging, tatooine), and over a few minutes period, it did not open any new connections. Well, it may benefit from using keep-alive to shut down the pool of connections when it does not sync anymore, but it should not cause the load you describe. |
Unable to observe this recently in stakewars & main testnet recently. For stakewars, TCP 3030 flunctuate between 30 to 100. Imo if it's garbage collected on time it's fine. This conform to what @frol mentions ~80 for each net. For those part from prometheus should be fine - num/traffic of prometheus node will be small compare to normal user. But today when i run |
Maybe it's caused by database? If file descriptors are not cleaned properly there might be too many open files. |
Unable to see this as of Jan 13, 2020 in main testnet |
Sometimes after running for a while, a node would stops working because of some too many files open error. We need to investigate the cause and see whether we need to increase the limit for the number of file descriptors allowed.
The text was updated successfully, but these errors were encountered: