-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added retry mechanism in executionEngine for executePayload #3854
Conversation
Codecov Report
@@ Coverage Diff @@
## unstable #3854 +/- ##
=============================================
+ Coverage 0 36.11% +36.11%
=============================================
Files 0 325 +325
Lines 0 9043 +9043
Branches 0 1421 +1421
=============================================
+ Hits 0 3266 +3266
- Misses 0 5632 +5632
- Partials 0 145 +145 |
Performance Report✔️ no performance regression detected Full benchmark results
|
description: "Delay time between retries when retrying calls to the execution engine API", | ||
type: "number", | ||
defaultDescription: | ||
defaultOptions.executionEngine.mode === "http" ? String(defaultOptions.executionEngine.retryDelay) : "0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to tailor make it for http? even if there is for example a non http mode ever (like ws) this would stay the same i believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done so as to make it uniform with already existing execution.urls
and execution.timeout
options which are tailor made for http.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just do
defaultDescription: String(defaultOptions.executionEngine.retryDelay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think typescript isn't happy otherwise, mixes it up with mock options, so i guess this is alright!
// treated seperate from being INVALID. For now, just pass the error upstream. | ||
.catch((e: Error): EngineApiRpcReturnTypes[typeof method] => { | ||
// treated separate from being INVALID. For now, just pass the error upstream. | ||
.catch(async (e: Error) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's allowed, Can a catch callback handle rejects without causing an unhandled promise?
retryAttempts: this.retryAttempts, | ||
retryDelay: this.retryDelay, | ||
} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it would be very important to track retry behaviour. Meaning adding metrics for:
- histogram of retry attempts per call
- histogram of overall requests times to EL
So to de-duplicate code you can add a private method fetchEl
that handles metrics and calls fetchWithRetries
. You can even move the logic in fetchWithRetries
to here since no-one else is using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can even move the logic in fetchWithRetries to here since no-one else is using it.
I decided to leave fetchWithRetries
as is, because as mentioned by @g11tech here it makes it easier to test the retry mechanism separately.
@dapplion can you help clarify what you meant by histogram of retry attempts per call
? I interpreted that to mean another histogram that captures the duration for each retry, while @g11tech interpreted that to mean a counter to try each request.
Also I was wondering, this suggestion only focuses on adding metrics to retries to done in executionEngine/http
, while there are other places the JsonRpcHttpClient
is used, for example in eth1Provider.ts
. Would it be an idea to move metrics tracking totally into JsonRpcHttpClient
? Then every client who uses JsonRpcHttpClient
do not need to have metrics separate, it would be generated automatically from using JsonRpcHttpClient
to make requests.
If yes, then I can do that as another PR. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually might be good idea to add metrics to JsonRpcHttpClient. eth1 calls to not the execution engine should have metrics too
shouldRetry: opts?.shouldRetry, | ||
} | ||
); | ||
return parseRpcResponse(res, payload); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The should retry logic here should be based exclusively in networking errors. You only want to retry when the EL is unavailable. Here it's a bit tough because the underlying httpclient can change. So you should add a lot of unit (or e2e) tests that spin up a real server and try different things like:
- bad URL
- bad port
- NGINX rejects to forward
- etc
And ensure that only those are retried. Otherwise you can try to guess when a response is actually from an EL client, and don't retry only when you know the error is an "app layer" EL error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The should retry logic here should be based exclusively in networking errors. You only want to retry when the EL is unavailable
I believe this the case already. Line 94 (parseRpcResponse
) would only be reached if the call returns 200, and parseRpcResponse
parses that and throws ErrorJsonRpcResponse
. If the call fails for any other reason (if response is not ok, ie network errors) that is only when the request is retried.
So you should add a lot of unit (or e2e) tests that spin up a real server
I'll add these
describe("getPayload", async () => { | ||
it("getPayload with retry", async function () { | ||
this.timeout("10 min"); | ||
/** | ||
* curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"engine_getPayload","params":["0x0"],"id":67}' http://localhost:8545 | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to rewrite all the execution api cases, as it will be difficult to maintain/update any changes in two places, rather just test out fetchWithRetries
functionality exhaustively against any hypothetical request response (in the same way its happening here) basically write its test cases independentaly of the execution api logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you say to removing the tests as you suggested but leaving just:
notifyForkchoiceUpdate no retry when no pay load attributes
andnotifyForkchoiceUpdate with retry when pay load attributes
tests?
as this are specific to how the node calls the EL, and when those calls should be retried or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, lets leave it at that, no need to remove 👍
docker/grafana/datasource.local.yml
Outdated
@@ -4,6 +4,6 @@ datasources: | |||
- name: Prometheus | |||
type: prometheus | |||
access: proxy | |||
url: http://localhost:9090 | |||
url: http://host.docker.internal:9090 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is localhost:9090 not reachable in MacOS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that has to do with the fact that docker on MacOS does not have access to the host directly and it sits on top of a linux vm. Which makes it impossible to use localhost
to refer to the actual host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I would move this changes to another PR to keep this one in scope to dadepo/retry-el-executepayload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes was made due to the request to add metrics to the retry mechanism. Since it makes it easier to see the metrics via the Prometheus/Grafana setup locally, instead of having to first push and run on a linux server.
But I can move it to a separate PR as you suggested and just use the /metrics endpoint exposed by the node
@@ -113,6 +126,9 @@ export class ExecutionEngineHttp implements IExecutionEngine { | |||
{ | |||
retryAttempts: this.retryAttempts, | |||
retryDelay: this.retryDelay, | |||
onEachRetryFn: () => { | |||
this?.metrics?.executionEngineRequestCount.inc({method}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be undefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am removing this because i am moving metrics inside json rpc, also intend is to capture the retry count, not the total count
That method is called when the validator builds and proposes, this is just from my local setup without validator on the kiln network |
packages/lodestar/src/util/retry.ts
Outdated
@@ -28,7 +28,7 @@ export async function retry<A>(fn: (attempt: number) => A | Promise<A>, opts?: I | |||
const shouldRetry = opts?.shouldRetry; | |||
|
|||
let lastError: Error = Error("RetryError"); | |||
for (let i = 1; i <= maxRetries; i++) { | |||
for (let i = 0; i < maxRetries; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning! This change can break all existing usages of retry downstream! Please review carefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @g11tech can you help look into this? Given you made the modifications, it is probably faster to share any insights you had explaining the reason for the change. As far as I can see, everything still looks fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @dadepo I think i did it to not count first attempt as retry, you can revert it and may be correct a test that might fail because of retry comparison.
Overall looks good! Some changes I've done
Check the issue with packages/lodestar/src/util/retry.ts changes and should be good to go. If I've broken the tests please I appreciate if you can review them 🙏 |
1b34c47
to
4c23e6c
Compare
4c23e6c
to
e9ba839
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
Adding retries when CL calls EL
Closes #3567