Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798

Closed
frol opened this issue Jun 4, 2020 · 11 comments · Fixed by #4300
Closed

Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798

frol opened this issue Jun 4, 2020 · 11 comments · Fixed by #4300
Assignees
Labels
A-RPC Area: rpc C-bug Category: This is a bug C-good-first-issue Category: issues that are self-contained and easy for newcomers to work on. T-public-interfaces Team: issues relevant to the public interfaces team

Comments

@frol
Copy link
Collaborator

frol commented Jun 4, 2020

Describe the bug
nearcore constantly (every 10 seconds as Telemetry gets tried to be sent) reports:

INFO telemetry: Telemetry data could not be sent due to: protocol error: not a result of an error

To Reproduce
Unknown.

Expected behavior
Telemetry should not fail to get delivered.

Version (please complete the following information):

  • all versions (master, devnet, betanet, testnet, mainnet)

Additional context
It appears to me that there is something wrong with Keep-Alive implementation since once the error get reported, it never recovers even though the network is fine and the requests to https://explorer.devnet.near.org/api/nodes can be sent just fine from the same server.

@frol frol added C-bug Category: This is a bug A-RPC Area: rpc labels Jun 4, 2020
@frol frol self-assigned this Jun 4, 2020
@frol
Copy link
Collaborator Author

frol commented Jun 8, 2020

There are no TCP packets to Explorer when nearcore gets into this state, which supports my guess that there is nothing wrong with Explorer, and it is actually a problem on nearcore side. There is also no connection open to 443 port (Keep-Alive HTTPS connection to Explorer is dead). Further investigation is needed.

@frol frol added T-public-interfaces Team: issues relevant to the public interfaces team C-good-first-issue Category: issues that are self-contained and easy for newcomers to work on. help wanted labels Feb 5, 2021
@KaiserKarel
Copy link
Contributor

KaiserKarel commented Mar 6, 2021

I could have a look at this next week. From a brief glance at telemetry, it seems like a custom implementation, so I might need some guidance.

Is there a reason for the custom connector and huge conn_lifetime?

actix/actix-web#1047 seems possibly related.

@frol
Copy link
Collaborator Author

frol commented Mar 9, 2021

Here is the implementation: https://github.com/near/nearcore/blob/dc73b0fbc36ad55f79fa7574a45adf1d25c60993/chain/telemetry/src/lib.rs

There is nothing fancy, we just try to send a POST request every 10 seconds. We might need to implement a logic to re-connect the client. I assumed AWC should do it automatically, but looking at it now, I don't believe so.

The huge conn_lifetime is to avoid too many open file descriptors: #1413

@KaiserKarel
Copy link
Contributor

KaiserKarel commented Mar 9, 2021

I see. Shall I just add exponential retry + reconnect? The large conn_lifetime seemed like a red flag to me exactly because of the file descriptors, but it might be some other heisenbug.

I'd reckon having reconnect and exponential backoff there is a good feature anyways, since so many things can go wrong with network requests.

@frol
Copy link
Collaborator Author

frol commented Mar 9, 2021

Yes, let's do that. Just keep in mind that telemetry heartbeat is 10 seconds, so we should either have timeouts less then 10 seconds, so we don't step over itself, or have some guard to skip the heartbeat if the connection is not yet established.

BTW, telemetry heartbeat happens here:

telemetry(&self.telemetry_actor, content);

@KaiserKarel
Copy link
Contributor

I've had a closer look. I think implementing reconnection logic is a bit more difficult, due to the future spawned in near_performance_metrics::actix::spawn being 'static. I could do something with Arcs, Mutexes, WeakRefs etc to update the client on an error; but perhaps it might be simpler to not reuse connections at all? Telemetry heartbeat is only performed once every 10s anyways.

Either solution feels a bit dirty; would be nicer to figure out what is causing this error exactly.

Digging into the cause led me to: https://docs.rs/h2/0.3.0/h2/struct.Reason.html. Perhaps AWC is not handling reconnections correctly after receiving REASON(0)?

@frol
Copy link
Collaborator Author

frol commented Mar 24, 2021

but perhaps it might be simpler to not reuse connections at all?

We had file descriptors leak when we did not reuse it (#1413), so there might be another issue of not properly closing the connections. It also seems to be related to either HTTPS or HTTP2 (it felt like HTTP 1.1 without encryption worked absolutely fine)

Digging into the cause led me to: https://docs.rs/h2/0.3.0/h2/struct.Reason.html. Perhaps AWC is not handling reconnections correctly after receiving REASON(0)?

This was also my main blame.

@sdmurphy
Copy link

I posted this on discord and @bowenwang1996 thought it might be related to this issue. I have reviewed this issue, but I am thinking my issue is different.

Symptoms of this issue: Log messages indicating Telemetry couldn't be sent. No TCP sessions to explorer.

My symptoms: I started validating on Mainnet for the first time. Validating Node status isn't updating and the Node Key is wrong. Online Node status information seems correct and the Node Key is correct. I have established HTTPS sessions that I believe are to explorer. I noticed this issue about an hour before I started validating and restarted neard, but the issue persisted.

image

@bowenwang1996
Copy link
Collaborator

@sdmurphy is it possible that you have another node with the same validator key running at the same time? cc @frol

@sdmurphy
Copy link

@bowenwang1996, I have a backup node without a validator key and that Node Key doesn't match. I had another validator that doesn't exist anymore, it was shutdown about a week before we were in the active set.

@frol frol assigned mhalambek and unassigned frol Apr 29, 2021
mhalambek added a commit that referenced this issue May 14, 2021
Doesn't leak TCP descriptors and works after network interuption.

Closes #2798
@mhalambek
Copy link
Contributor

Made a simple PR which uses HTTP 1.1, telemetry sending works after network interuptions, and file_descriptors/sockets are cleaned up after each request.

mhalambek added a commit that referenced this issue May 14, 2021
Doesn't leak TCP descriptors and works after network interuption.

Closes #2798
mhalambek added a commit that referenced this issue May 14, 2021
Doesn't leak TCP descriptors and works after network interuption.

Closes #2798
mhalambek added a commit that referenced this issue May 18, 2021
Doesn't leak TCP descriptors and works after network interuption.

Closes #2798
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-RPC Area: rpc C-bug Category: This is a bug C-good-first-issue Category: issues that are self-contained and easy for newcomers to work on. T-public-interfaces Team: issues relevant to the public interfaces team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants