Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798

frol · 2020-06-04T21:16:22Z

Describe the bug
nearcore constantly (every 10 seconds as Telemetry gets tried to be sent) reports:

INFO telemetry: Telemetry data could not be sent due to: protocol error: not a result of an error

To Reproduce
Unknown.

Expected behavior
Telemetry should not fail to get delivered.

Version (please complete the following information):

all versions (master, devnet, betanet, testnet, mainnet)

Additional context
It appears to me that there is something wrong with Keep-Alive implementation since once the error get reported, it never recovers even though the network is fine and the requests to https://explorer.devnet.near.org/api/nodes can be sent just fine from the same server.

The text was updated successfully, but these errors were encountered:

frol · 2020-06-08T21:10:53Z

There are no TCP packets to Explorer when nearcore gets into this state, which supports my guess that there is nothing wrong with Explorer, and it is actually a problem on nearcore side. There is also no connection open to 443 port (Keep-Alive HTTPS connection to Explorer is dead). Further investigation is needed.

KaiserKarel · 2021-03-06T16:56:33Z

I could have a look at this next week. From a brief glance at telemetry, it seems like a custom implementation, so I might need some guidance.

Is there a reason for the custom connector and huge conn_lifetime?

actix/actix-web#1047 seems possibly related.

frol · 2021-03-09T14:50:17Z

Here is the implementation: https://github.com/near/nearcore/blob/dc73b0fbc36ad55f79fa7574a45adf1d25c60993/chain/telemetry/src/lib.rs

There is nothing fancy, we just try to send a POST request every 10 seconds. We might need to implement a logic to re-connect the client. I assumed AWC should do it automatically, but looking at it now, I don't believe so.

The huge conn_lifetime is to avoid too many open file descriptors: #1413

KaiserKarel · 2021-03-09T15:12:19Z

I see. Shall I just add exponential retry + reconnect? The large conn_lifetime seemed like a red flag to me exactly because of the file descriptors, but it might be some other heisenbug.

I'd reckon having reconnect and exponential backoff there is a good feature anyways, since so many things can go wrong with network requests.

frol · 2021-03-09T16:16:42Z

Yes, let's do that. Just keep in mind that telemetry heartbeat is 10 seconds, so we should either have timeouts less then 10 seconds, so we don't step over itself, or have some guard to skip the heartbeat if the connection is not yet established.

BTW, telemetry heartbeat happens here:

nearcore/chain/client/src/info.rs

Line 197 in dc73b0f

telemetry(&self.telemetry_actor, content);

KaiserKarel · 2021-03-23T11:13:22Z

I've had a closer look. I think implementing reconnection logic is a bit more difficult, due to the future spawned in near_performance_metrics::actix::spawn being 'static. I could do something with Arcs, Mutexes, WeakRefs etc to update the client on an error; but perhaps it might be simpler to not reuse connections at all? Telemetry heartbeat is only performed once every 10s anyways.

Either solution feels a bit dirty; would be nicer to figure out what is causing this error exactly.

Digging into the cause led me to: https://docs.rs/h2/0.3.0/h2/struct.Reason.html. Perhaps AWC is not handling reconnections correctly after receiving REASON(0)?

frol · 2021-03-24T15:16:55Z

but perhaps it might be simpler to not reuse connections at all?

We had file descriptors leak when we did not reuse it (#1413), so there might be another issue of not properly closing the connections. It also seems to be related to either HTTPS or HTTP2 (it felt like HTTP 1.1 without encryption worked absolutely fine)

Digging into the cause led me to: https://docs.rs/h2/0.3.0/h2/struct.Reason.html. Perhaps AWC is not handling reconnections correctly after receiving REASON(0)?

This was also my main blame.

sdmurphy · 2021-04-20T05:55:15Z

I posted this on discord and @bowenwang1996 thought it might be related to this issue. I have reviewed this issue, but I am thinking my issue is different.

Symptoms of this issue: Log messages indicating Telemetry couldn't be sent. No TCP sessions to explorer.

My symptoms: I started validating on Mainnet for the first time. Validating Node status isn't updating and the Node Key is wrong. Online Node status information seems correct and the Node Key is correct. I have established HTTPS sessions that I believe are to explorer. I noticed this issue about an hour before I started validating and restarted neard, but the issue persisted.

bowenwang1996 · 2021-04-20T23:24:43Z

@sdmurphy is it possible that you have another node with the same validator key running at the same time? cc @frol

sdmurphy · 2021-04-21T05:49:55Z

@bowenwang1996, I have a backup node without a validator key and that Node Key doesn't match. I had another validator that doesn't exist anymore, it was shutdown about a week before we were in the active set.

Closes #2798

Doesn't leak TCP descriptors and works after network interuption. Closes #2798

mhalambek · 2021-05-14T09:17:24Z

Made a simple PR which uses HTTP 1.1, telemetry sending works after network interuptions, and file_descriptors/sockets are cleaned up after each request.

Doesn't leak TCP descriptors and works after network interuption. Closes #2798

frol added C-bug Category: This is a bug A-RPC Area: rpc labels Jun 4, 2020

frol self-assigned this Jun 4, 2020

frol mentioned this issue Jun 4, 2020

Validator vs non-validator are possibly wrong near/near-explorer#308

Closed

bowenwang1996 mentioned this issue Jun 10, 2020

Extraneous Telemetry INFO Logging #2824

Closed

maxzaver added this to the Mainnet Guild 21.09.2020-02.10.2020 milestone Sep 19, 2020

frol modified the milestones: [Previous] Mainnet Guild 2020.09.21-2020.10.02, [Current] Mainnet Guild 2020.10.05-2020.10.16 Oct 5, 2020

frol added T-public-interfaces Team: issues relevant to the public interfaces team C-good-first-issue Category: issues that are self-contained and easy for newcomers to work on. help wanted labels Feb 5, 2021

frol assigned mhalambek and unassigned frol Apr 29, 2021

mhalambek added a commit that referenced this issue May 13, 2021

fix(telemetry) Decrease connection lifetime to remove stale connections

4c16a4d

Closes #2798

mhalambek mentioned this issue May 13, 2021

fix(telemetry) Decrease connection lifetime to remove stale connections #4300

Merged

mhalambek added a commit that referenced this issue May 14, 2021

fix(telemetry) Use http 1.1

ed1eb05

Doesn't leak TCP descriptors and works after network interuption. Closes #2798

mhalambek added a commit that referenced this issue May 14, 2021

fix(telemetry) Use http 1.1

2f8a0f4

Doesn't leak TCP descriptors and works after network interuption. Closes #2798

mhalambek added a commit that referenced this issue May 14, 2021

fix(telemetry) Use http 1.1

851d0a9

Doesn't leak TCP descriptors and works after network interuption. Closes #2798

mhalambek closed this as completed in #4300 May 18, 2021

mhalambek added a commit that referenced this issue May 18, 2021

fix(telemetry) Use http 1.1 (#4300)

77a4edc

Doesn't leak TCP descriptors and works after network interuption. Closes #2798

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798

Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798

frol commented Jun 4, 2020 •

edited

Loading

frol commented Jun 8, 2020

KaiserKarel commented Mar 6, 2021 •

edited

Loading

frol commented Mar 9, 2021

KaiserKarel commented Mar 9, 2021 •

edited

Loading

frol commented Mar 9, 2021

KaiserKarel commented Mar 23, 2021

frol commented Mar 24, 2021

sdmurphy commented Apr 20, 2021

bowenwang1996 commented Apr 20, 2021

sdmurphy commented Apr 21, 2021

mhalambek commented May 14, 2021

Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798

Telemetry sending stuck once and never deliver after that: "not a result of an error" #2798

Comments

frol commented Jun 4, 2020 • edited Loading

frol commented Jun 8, 2020

KaiserKarel commented Mar 6, 2021 • edited Loading

frol commented Mar 9, 2021

KaiserKarel commented Mar 9, 2021 • edited Loading

frol commented Mar 9, 2021

KaiserKarel commented Mar 23, 2021

frol commented Mar 24, 2021

sdmurphy commented Apr 20, 2021

bowenwang1996 commented Apr 20, 2021

sdmurphy commented Apr 21, 2021

mhalambek commented May 14, 2021

frol commented Jun 4, 2020 •

edited

Loading

KaiserKarel commented Mar 6, 2021 •

edited

Loading

KaiserKarel commented Mar 9, 2021 •

edited

Loading