-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798
Comments
There are no TCP packets to Explorer when nearcore gets into this state, which supports my guess that there is nothing wrong with Explorer, and it is actually a problem on nearcore side. There is also no connection open to 443 port (Keep-Alive HTTPS connection to Explorer is dead). Further investigation is needed. |
I could have a look at this next week. From a brief glance at telemetry, it seems like a custom implementation, so I might need some guidance. Is there a reason for the custom connector and huge conn_lifetime? actix/actix-web#1047 seems possibly related. |
Here is the implementation: https://github.com/near/nearcore/blob/dc73b0fbc36ad55f79fa7574a45adf1d25c60993/chain/telemetry/src/lib.rs There is nothing fancy, we just try to send a POST request every 10 seconds. We might need to implement a logic to re-connect the client. I assumed AWC should do it automatically, but looking at it now, I don't believe so. The huge conn_lifetime is to avoid too many open file descriptors: #1413 |
I see. Shall I just add exponential retry + reconnect? The large conn_lifetime seemed like a red flag to me exactly because of the file descriptors, but it might be some other heisenbug. I'd reckon having reconnect and exponential backoff there is a good feature anyways, since so many things can go wrong with network requests. |
Yes, let's do that. Just keep in mind that telemetry heartbeat is 10 seconds, so we should either have timeouts less then 10 seconds, so we don't step over itself, or have some guard to skip the heartbeat if the connection is not yet established. BTW, telemetry heartbeat happens here: nearcore/chain/client/src/info.rs Line 197 in dc73b0f
|
I've had a closer look. I think implementing reconnection logic is a bit more difficult, due to the future spawned in Either solution feels a bit dirty; would be nicer to figure out what is causing this error exactly. Digging into the cause led me to: https://docs.rs/h2/0.3.0/h2/struct.Reason.html. Perhaps AWC is not handling reconnections correctly after receiving REASON(0)? |
We had file descriptors leak when we did not reuse it (#1413), so there might be another issue of not properly closing the connections. It also seems to be related to either HTTPS or HTTP2 (it felt like HTTP 1.1 without encryption worked absolutely fine)
This was also my main blame. |
I posted this on discord and @bowenwang1996 thought it might be related to this issue. I have reviewed this issue, but I am thinking my issue is different. Symptoms of this issue: Log messages indicating Telemetry couldn't be sent. No TCP sessions to explorer. My symptoms: I started validating on Mainnet for the first time. Validating Node status isn't updating and the Node Key is wrong. Online Node status information seems correct and the Node Key is correct. I have established HTTPS sessions that I believe are to explorer. I noticed this issue about an hour before I started validating and restarted neard, but the issue persisted. |
@bowenwang1996, I have a backup node without a validator key and that Node Key doesn't match. I had another validator that doesn't exist anymore, it was shutdown about a week before we were in the active set. |
Doesn't leak TCP descriptors and works after network interuption. Closes #2798
Made a simple PR which uses HTTP 1.1, telemetry sending works after network interuptions, and file_descriptors/sockets are cleaned up after each request. |
Doesn't leak TCP descriptors and works after network interuption. Closes #2798
Doesn't leak TCP descriptors and works after network interuption. Closes #2798
Doesn't leak TCP descriptors and works after network interuption. Closes #2798
Describe the bug
nearcore constantly (every 10 seconds as Telemetry gets tried to be sent) reports:
To Reproduce
Unknown.
Expected behavior
Telemetry should not fail to get delivered.
Version (please complete the following information):
Additional context
It appears to me that there is something wrong with Keep-Alive implementation since once the error get reported, it never recovers even though the network is fine and the requests to https://explorer.devnet.near.org/api/nodes can be sent just fine from the same server.
The text was updated successfully, but these errors were encountered: