You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Occasionally when making an HTTP request, we get an IncompleteMessage error from reqwest. We have at least one report of this causing a missed block:
Jun 27 08:55:01.069 WARN Builder failed to reveal payload parent_hash: "X", block_root: Y, relay_response_ms: 0, error: Builder(Reqwest(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(18550), path: "/eth/v1/builder/blinded_blocks", query: None, fragment: None }, source: hyper::Error(IncompleteMessage) })), info: this is common behaviour for some builders and may not indicate an issue, service: exec
This occurs when the server closes the connection while we are making the request. Because HTTP POST requests aren't usually idempotent, our HTTP client (reqwest) will not retry: it doesn't know whether the server has already acted on the request by the time it closes the connection. There's some more info here: hyperium/hyper#2136.
Steps to resolve
Option 1: Auto-Retry
We could add custom methods to our HTTP clients which automatically retry requests that fail with the IncompleteMessage error. I think this is probably the most robust solution. All of the requests we are making should fail gracefully if they are accidentally repeated, i.e. the worst case is that we get an application-level error from the CL/EL/relay on the other end of the connection. This should be better than what we have now.
The crates that would need modifying include:
common/eth2
beacon_node/execution_layer
beacon_node/builder_client
It might make sense to define a mixin trait on reqwest::Connection in its own crate, and then use the mixin methods from these other 3 crates.
Option 2: Disable Connection Pooling
Some other reqwest users had success disabling connection pooling (see hyperium/hyper#2136 (comment)), which I think means we will initiate a new TCP connection for every request. This seems very wasteful, and we should probably avoid doing this if we can. Some users also report that the error still occurs after disabling pooling.
Option 3: Adjustable Keep-Alive timeout
I think that we would be less likely to get these errors if we could ensure that the client's connection timeout is shorter than the server's. That avoids the scenario where the server hangs up on the client. There could be CLI parameters for each of the main HTTP connections (EL + builder connections for the BN, BN connection for the VC). The downside to this approach is that it requires manually fiddling with timeouts, and knowledge of the server's timeout. I think if we implement Option 1 this wouldn't be worth doing.
Regardless of which approach we take, testing is going to be a bit difficult.
The text was updated successfully, but these errors were encountered:
Nice write-up by @nflaig of an issue observed with Lodestar as the client and Lighthouse as the server. It might contain some insights for this issue (I haven't read in detail yet): ChainSafe/lodestar#5765
This has also been reported with web3signer connections. Lucikly, web3signer includes a flag for tweaking the idle timeout. If my theory from Option 3 is correct, then running web3signer with --idle-connection-timeout-seconds 100 should ensure that Lighthouse closes the connection before web3signer. The Lighthouse timeout is 90s (the default from reqwest).
Description
Occasionally when making an HTTP request, we get an
IncompleteMessage
error fromreqwest
. We have at least one report of this causing a missed block:This occurs when the server closes the connection while we are making the request. Because HTTP POST requests aren't usually idempotent, our HTTP client (
reqwest
) will not retry: it doesn't know whether the server has already acted on the request by the time it closes the connection. There's some more info here: hyperium/hyper#2136.Steps to resolve
Option 1: Auto-Retry
We could add custom methods to our HTTP clients which automatically retry requests that fail with the
IncompleteMessage
error. I think this is probably the most robust solution. All of the requests we are making should fail gracefully if they are accidentally repeated, i.e. the worst case is that we get an application-level error from the CL/EL/relay on the other end of the connection. This should be better than what we have now.The crates that would need modifying include:
common/eth2
beacon_node/execution_layer
beacon_node/builder_client
It might make sense to define a mixin trait on
reqwest::Connection
in its own crate, and then use the mixin methods from these other 3 crates.Option 2: Disable Connection Pooling
Some other
reqwest
users had success disabling connection pooling (see hyperium/hyper#2136 (comment)), which I think means we will initiate a new TCP connection for every request. This seems very wasteful, and we should probably avoid doing this if we can. Some users also report that the error still occurs after disabling pooling.Option 3: Adjustable Keep-Alive timeout
I think that we would be less likely to get these errors if we could ensure that the client's connection timeout is shorter than the server's. That avoids the scenario where the server hangs up on the client. There could be CLI parameters for each of the main HTTP connections (EL + builder connections for the BN, BN connection for the VC). The downside to this approach is that it requires manually fiddling with timeouts, and knowledge of the server's timeout. I think if we implement Option 1 this wouldn't be worth doing.
Regardless of which approach we take, testing is going to be a bit difficult.
The text was updated successfully, but these errors were encountered: