Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added retry mechanism in executionEngine for executePayload #3854

Merged
merged 6 commits into from
Jul 25, 2022

Conversation

dadepo
Copy link
Contributor

@dadepo dadepo commented Mar 13, 2022

Motivation

Adding retries when CL calls EL

Closes #3567

@codecov
Copy link

codecov bot commented Mar 13, 2022

Codecov Report

Merging #3854 (f8a6521) into unstable (684c2e3) will increase coverage by 36.11%.
The diff coverage is n/a.

❗ Current head f8a6521 differs from pull request most recent head 713f213. Consider uploading reports for the commit 713f213 to get more accurate results

@@              Coverage Diff              @@
##           unstable    #3854       +/-   ##
=============================================
+ Coverage          0   36.11%   +36.11%     
=============================================
  Files             0      325      +325     
  Lines             0     9043     +9043     
  Branches          0     1421     +1421     
=============================================
+ Hits              0     3266     +3266     
- Misses            0     5632     +5632     
- Partials          0      145      +145     

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2022

Performance Report

✔️ no performance regression detected

Full benchmark results
Benchmark suite Current: d1864c0 Previous: 292b36d Ratio
getPubkeys - index2pubkey - req 1000 vs - 250000 vc 2.3306 ms/op 2.2164 ms/op 1.05
getPubkeys - validatorsArr - req 1000 vs - 250000 vc 77.744 us/op 67.183 us/op 1.16
BLS verify - blst-native 1.8497 ms/op 2.1676 ms/op 0.85
BLS verifyMultipleSignatures 3 - blst-native 3.8003 ms/op 4.4843 ms/op 0.85
BLS verifyMultipleSignatures 8 - blst-native 8.1823 ms/op 9.6956 ms/op 0.84
BLS verifyMultipleSignatures 32 - blst-native 29.665 ms/op 35.236 ms/op 0.84
BLS aggregatePubkeys 32 - blst-native 39.564 us/op 46.980 us/op 0.84
BLS aggregatePubkeys 128 - blst-native 152.06 us/op 182.79 us/op 0.83
getAttestationsForBlock 47.806 ms/op 44.068 ms/op 1.08
isKnown best case - 1 super set check 420.00 ns/op 474.00 ns/op 0.89
isKnown normal case - 2 super set checks 414.00 ns/op 466.00 ns/op 0.89
isKnown worse case - 16 super set checks 409.00 ns/op 464.00 ns/op 0.88
CheckpointStateCache - add get delete 9.2780 us/op 9.4910 us/op 0.98
validate gossip signedAggregateAndProof - struct 4.2945 ms/op 5.0248 ms/op 0.85
validate gossip attestation - struct 2.0359 ms/op 2.3669 ms/op 0.86
altair verifyImport mainnet_s3766816:31 8.9031 s/op 8.9948 s/op 0.99
pickEth1Vote - no votes 2.0698 ms/op 2.1036 ms/op 0.98
pickEth1Vote - max votes 21.915 ms/op 22.846 ms/op 0.96
pickEth1Vote - Eth1Data hashTreeRoot value x2048 11.878 ms/op 13.212 ms/op 0.90
pickEth1Vote - Eth1Data hashTreeRoot tree x2048 20.524 ms/op 21.857 ms/op 0.94
pickEth1Vote - Eth1Data fastSerialize value x2048 1.5257 ms/op 1.5243 ms/op 1.00
pickEth1Vote - Eth1Data fastSerialize tree x2048 17.078 ms/op 15.537 ms/op 1.10
bytes32 toHexString 1.0630 us/op 1.1950 us/op 0.89
bytes32 Buffer.toString(hex) 676.00 ns/op 822.00 ns/op 0.82
bytes32 Buffer.toString(hex) from Uint8Array 942.00 ns/op 1.0910 us/op 0.86
bytes32 Buffer.toString(hex) + 0x 688.00 ns/op 826.00 ns/op 0.83
Object access 1 prop 0.36200 ns/op 0.43400 ns/op 0.83
Map access 1 prop 0.29200 ns/op 0.30500 ns/op 0.96
Object get x1000 17.732 ns/op 11.571 ns/op 1.53
Map get x1000 1.1050 ns/op 1.1340 ns/op 0.97
Object set x1000 117.70 ns/op 89.066 ns/op 1.32
Map set x1000 71.693 ns/op 59.614 ns/op 1.20
Return object 10000 times 0.36790 ns/op 0.44360 ns/op 0.83
Throw Error 10000 times 6.0166 us/op 5.9495 us/op 1.01
enrSubnets - fastDeserialize 64 bits 2.7080 us/op 3.3000 us/op 0.82
enrSubnets - ssz BitVector 64 bits 753.00 ns/op 906.00 ns/op 0.83
enrSubnets - fastDeserialize 4 bits 384.00 ns/op 479.00 ns/op 0.80
enrSubnets - ssz BitVector 4 bits 729.00 ns/op 889.00 ns/op 0.82
prioritizePeers score -10:0 att 32-0.1 sync 2-0 95.221 us/op 95.712 us/op 0.99
prioritizePeers score 0:0 att 32-0.25 sync 2-0.25 124.24 us/op 135.24 us/op 0.92
prioritizePeers score 0:0 att 32-0.5 sync 2-0.5 214.12 us/op 247.53 us/op 0.87
prioritizePeers score 0:0 att 64-0.75 sync 4-0.75 484.23 us/op 336.05 us/op 1.44
prioritizePeers score 0:0 att 64-1 sync 4-1 464.63 us/op 407.29 us/op 1.14
RateTracker 1000000 limit, 1 obj count per request 187.22 ns/op 202.64 ns/op 0.92
RateTracker 1000000 limit, 2 obj count per request 141.82 ns/op 152.16 ns/op 0.93
RateTracker 1000000 limit, 4 obj count per request 115.58 ns/op 126.26 ns/op 0.92
RateTracker 1000000 limit, 8 obj count per request 110.57 ns/op 111.02 ns/op 1.00
RateTracker with prune 5.0390 us/op 4.8710 us/op 1.03
array of 16000 items push then shift 3.2095 us/op 51.612 us/op 0.06
LinkedList of 16000 items push then shift 29.758 ns/op 17.391 ns/op 1.71
array of 16000 items push then pop 259.22 ns/op 231.28 ns/op 1.12
LinkedList of 16000 items push then pop 22.962 ns/op 14.823 ns/op 1.55
array of 24000 items push then shift 4.5769 us/op 77.388 us/op 0.06
LinkedList of 24000 items push then shift 32.876 ns/op 22.016 ns/op 1.49
array of 24000 items push then pop 215.89 ns/op 200.67 ns/op 1.08
LinkedList of 24000 items push then pop 23.880 ns/op 16.495 ns/op 1.45
intersect bitArray bitLen 8 11.757 ns/op 11.003 ns/op 1.07
intersect array and set length 8 169.04 ns/op 160.85 ns/op 1.05
intersect bitArray bitLen 128 61.864 ns/op 55.513 ns/op 1.11
intersect array and set length 128 2.3625 us/op 2.0491 us/op 1.15
pass gossip attestations to forkchoice per slot 3.6214 ms/op 2.8437 ms/op 1.27
computeDeltas 3.9844 ms/op 3.1766 ms/op 1.25
computeProposerBoostScoreFromBalances 907.60 us/op 804.27 us/op 1.13
altair processAttestation - 250000 vs - 7PWei normalcase 3.9875 ms/op 4.0912 ms/op 0.97
altair processAttestation - 250000 vs - 7PWei worstcase 5.8850 ms/op 5.9621 ms/op 0.99
altair processAttestation - setStatus - 1/6 committees join 209.30 us/op 179.37 us/op 1.17
altair processAttestation - setStatus - 1/3 committees join 393.60 us/op 343.90 us/op 1.14
altair processAttestation - setStatus - 1/2 committees join 560.68 us/op 484.51 us/op 1.16
altair processAttestation - setStatus - 2/3 committees join 712.92 us/op 633.80 us/op 1.12
altair processAttestation - setStatus - 4/5 committees join 996.41 us/op 890.37 us/op 1.12
altair processAttestation - setStatus - 100% committees join 1.1800 ms/op 1.0869 ms/op 1.09
altair processBlock - 250000 vs - 7PWei normalcase 28.444 ms/op 24.584 ms/op 1.16
altair processBlock - 250000 vs - 7PWei normalcase hashState 41.769 ms/op 34.341 ms/op 1.22
altair processBlock - 250000 vs - 7PWei worstcase 81.984 ms/op 88.620 ms/op 0.93
altair processBlock - 250000 vs - 7PWei worstcase hashState 103.44 ms/op 97.535 ms/op 1.06
phase0 processBlock - 250000 vs - 7PWei normalcase 4.5582 ms/op 4.1568 ms/op 1.10
phase0 processBlock - 250000 vs - 7PWei worstcase 49.830 ms/op 53.595 ms/op 0.93
altair processEth1Data - 250000 vs - 7PWei normalcase 1.0605 ms/op 826.60 us/op 1.28
Tree 40 250000 create 902.48 ms/op 826.25 ms/op 1.09
Tree 40 250000 get(125000) 296.02 ns/op 246.14 ns/op 1.20
Tree 40 250000 set(125000) 2.6400 us/op 2.3106 us/op 1.14
Tree 40 250000 toArray() 32.935 ms/op 27.942 ms/op 1.18
Tree 40 250000 iterate all - toArray() + loop 34.312 ms/op 28.200 ms/op 1.22
Tree 40 250000 iterate all - get(i) 113.79 ms/op 111.33 ms/op 1.02
MutableVector 250000 create 16.278 ms/op 14.161 ms/op 1.15
MutableVector 250000 get(125000) 13.059 ns/op 10.890 ns/op 1.20
MutableVector 250000 set(125000) 684.13 ns/op 604.95 ns/op 1.13
MutableVector 250000 toArray() 7.7284 ms/op 6.7037 ms/op 1.15
MutableVector 250000 iterate all - toArray() + loop 12.687 ms/op 6.8739 ms/op 1.85
MutableVector 250000 iterate all - get(i) 3.3053 ms/op 2.6940 ms/op 1.23
Array 250000 create 6.6270 ms/op 6.6154 ms/op 1.00
Array 250000 clone - spread 2.7669 ms/op 3.5862 ms/op 0.77
Array 250000 get(125000) 1.1620 ns/op 1.6720 ns/op 0.69
Array 250000 set(125000) 1.1310 ns/op 1.6420 ns/op 0.69
Array 250000 iterate all - loop 167.87 us/op 152.96 us/op 1.10
effectiveBalanceIncrements clone Uint8Array 300000 73.143 us/op 60.893 us/op 1.20
effectiveBalanceIncrements clone MutableVector 300000 781.00 ns/op 1.0940 us/op 0.71
effectiveBalanceIncrements rw all Uint8Array 300000 252.50 us/op 247.59 us/op 1.02
effectiveBalanceIncrements rw all MutableVector 300000 184.79 ms/op 192.02 ms/op 0.96
phase0 afterProcessEpoch - 250000 vs - 7PWei 180.66 ms/op 189.44 ms/op 0.95
phase0 beforeProcessEpoch - 250000 vs - 7PWei 91.580 ms/op 81.313 ms/op 1.13
altair processEpoch - mainnet_e81889 598.85 ms/op 553.51 ms/op 1.08
mainnet_e81889 - altair beforeProcessEpoch 164.92 ms/op 127.02 ms/op 1.30
mainnet_e81889 - altair processJustificationAndFinalization 30.674 us/op 16.496 us/op 1.86
mainnet_e81889 - altair processInactivityUpdates 12.051 ms/op 9.1157 ms/op 1.32
mainnet_e81889 - altair processRewardsAndPenalties 99.154 ms/op 82.473 ms/op 1.20
mainnet_e81889 - altair processRegistryUpdates 4.9400 us/op 2.7910 us/op 1.77
mainnet_e81889 - altair processSlashings 1.2420 us/op 676.00 ns/op 1.84
mainnet_e81889 - altair processEth1DataReset 1.4720 us/op 657.00 ns/op 2.24
mainnet_e81889 - altair processEffectiveBalanceUpdates 2.5095 ms/op 1.9856 ms/op 1.26
mainnet_e81889 - altair processSlashingsReset 8.6240 us/op 4.4220 us/op 1.95
mainnet_e81889 - altair processRandaoMixesReset 5.6050 us/op 4.6520 us/op 1.20
mainnet_e81889 - altair processHistoricalRootsUpdate 1.3250 us/op 710.00 ns/op 1.87
mainnet_e81889 - altair processParticipationFlagUpdates 4.1900 us/op 4.2690 us/op 0.98
mainnet_e81889 - altair processSyncCommitteeUpdates 744.00 ns/op 582.00 ns/op 1.28
mainnet_e81889 - altair afterProcessEpoch 192.53 ms/op 219.46 ms/op 0.88
phase0 processEpoch - mainnet_e58758 549.08 ms/op 498.87 ms/op 1.10
mainnet_e58758 - phase0 beforeProcessEpoch 251.64 ms/op 184.25 ms/op 1.37
mainnet_e58758 - phase0 processJustificationAndFinalization 27.949 us/op 18.502 us/op 1.51
mainnet_e58758 - phase0 processRewardsAndPenalties 129.73 ms/op 104.83 ms/op 1.24
mainnet_e58758 - phase0 processRegistryUpdates 13.852 us/op 9.1840 us/op 1.51
mainnet_e58758 - phase0 processSlashings 1.0450 us/op 688.00 ns/op 1.52
mainnet_e58758 - phase0 processEth1DataReset 994.00 ns/op 646.00 ns/op 1.54
mainnet_e58758 - phase0 processEffectiveBalanceUpdates 2.1173 ms/op 1.9674 ms/op 1.08
mainnet_e58758 - phase0 processSlashingsReset 5.0650 us/op 4.8380 us/op 1.05
mainnet_e58758 - phase0 processRandaoMixesReset 5.8210 us/op 4.2930 us/op 1.36
mainnet_e58758 - phase0 processHistoricalRootsUpdate 887.00 ns/op 713.00 ns/op 1.24
mainnet_e58758 - phase0 processParticipationRecordUpdates 6.0710 us/op 4.5490 us/op 1.33
mainnet_e58758 - phase0 afterProcessEpoch 157.25 ms/op 163.05 ms/op 0.96
phase0 processEffectiveBalanceUpdates - 250000 normalcase 2.6486 ms/op 1.9893 ms/op 1.33
phase0 processEffectiveBalanceUpdates - 250000 worstcase 0.5 2.9990 ms/op 2.2588 ms/op 1.33
altair processInactivityUpdates - 250000 normalcase 43.238 ms/op 34.449 ms/op 1.26
altair processInactivityUpdates - 250000 worstcase 42.755 ms/op 36.828 ms/op 1.16
phase0 processRegistryUpdates - 250000 normalcase 8.0430 us/op 6.1030 us/op 1.32
phase0 processRegistryUpdates - 250000 badcase_full_deposits 461.25 us/op 373.94 us/op 1.23
phase0 processRegistryUpdates - 250000 worstcase 0.5 221.16 ms/op 182.90 ms/op 1.21
altair processRewardsAndPenalties - 250000 normalcase 93.474 ms/op 113.84 ms/op 0.82
altair processRewardsAndPenalties - 250000 worstcase 120.48 ms/op 77.572 ms/op 1.55
phase0 getAttestationDeltas - 250000 normalcase 13.269 ms/op 13.250 ms/op 1.00
phase0 getAttestationDeltas - 250000 worstcase 12.796 ms/op 14.194 ms/op 0.90
phase0 processSlashings - 250000 worstcase 5.3566 ms/op 5.2039 ms/op 1.03
altair processSyncCommitteeUpdates - 250000 277.76 ms/op 298.35 ms/op 0.93
BeaconState.hashTreeRoot - No change 471.00 ns/op 537.00 ns/op 0.88
BeaconState.hashTreeRoot - 1 full validator 63.223 us/op 72.511 us/op 0.87
BeaconState.hashTreeRoot - 32 full validator 635.90 us/op 720.24 us/op 0.88
BeaconState.hashTreeRoot - 512 full validator 5.8907 ms/op 8.8544 ms/op 0.67
BeaconState.hashTreeRoot - 1 validator.effectiveBalance 78.828 us/op 90.601 us/op 0.87
BeaconState.hashTreeRoot - 32 validator.effectiveBalance 1.1649 ms/op 1.2572 ms/op 0.93
BeaconState.hashTreeRoot - 512 validator.effectiveBalance 15.895 ms/op 17.606 ms/op 0.90
BeaconState.hashTreeRoot - 1 balances 58.422 us/op 70.232 us/op 0.83
BeaconState.hashTreeRoot - 32 balances 568.56 us/op 647.20 us/op 0.88
BeaconState.hashTreeRoot - 512 balances 6.4830 ms/op 6.3652 ms/op 1.02
BeaconState.hashTreeRoot - 250000 balances 90.537 ms/op 103.61 ms/op 0.87
aggregationBits - 2048 els - zipIndexesInBitList 32.002 us/op 34.338 us/op 0.93
regular array get 100000 times 67.467 us/op 60.626 us/op 1.11
wrappedArray get 100000 times 67.448 us/op 60.709 us/op 1.11
arrayWithProxy get 100000 times 28.862 ms/op 29.090 ms/op 0.99
ssz.Root.equals 473.00 ns/op 580.00 ns/op 0.82
byteArrayEquals 448.00 ns/op 585.00 ns/op 0.77
shuffle list - 16384 els 10.995 ms/op 11.332 ms/op 0.97
shuffle list - 250000 els 161.98 ms/op 167.51 ms/op 0.97
processSlot - 1 slots 12.025 us/op 13.741 us/op 0.88
processSlot - 32 slots 1.7401 ms/op 1.9738 ms/op 0.88
getEffectiveBalanceIncrementsZeroInactive - 250000 vs - 7PWei 963.21 us/op 392.94 us/op 2.45
getCommitteeAssignments - req 1 vs - 250000 vc 5.2777 ms/op 5.3968 ms/op 0.98
getCommitteeAssignments - req 100 vs - 250000 vc 7.3146 ms/op 7.8735 ms/op 0.93
getCommitteeAssignments - req 1000 vs - 250000 vc 7.7576 ms/op 8.4522 ms/op 0.92
computeProposers - vc 250000 18.454 ms/op 19.129 ms/op 0.96
computeEpochShuffling - vc 250000 165.80 ms/op 171.22 ms/op 0.97
getNextSyncCommittee - vc 250000 269.97 ms/op 293.24 ms/op 0.92

by benchmarkbot/action

description: "Delay time between retries when retrying calls to the execution engine API",
type: "number",
defaultDescription:
defaultOptions.executionEngine.mode === "http" ? String(defaultOptions.executionEngine.retryDelay) : "0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to tailor make it for http? even if there is for example a non http mode ever (like ws) this would stay the same i believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done so as to make it uniform with already existing execution.urls and execution.timeout options which are tailor made for http.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just do

defaultDescription: String(defaultOptions.executionEngine.retryDelay)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think typescript isn't happy otherwise, mixes it up with mock options, so i guess this is alright!

// treated seperate from being INVALID. For now, just pass the error upstream.
.catch((e: Error): EngineApiRpcReturnTypes[typeof method] => {
// treated separate from being INVALID. For now, just pass the error upstream.
.catch(async (e: Error) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's allowed, Can a catch callback handle rejects without causing an unhandled promise?

retryAttempts: this.retryAttempts,
retryDelay: this.retryDelay,
}
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be very important to track retry behaviour. Meaning adding metrics for:

  • histogram of retry attempts per call
  • histogram of overall requests times to EL

So to de-duplicate code you can add a private method fetchEl that handles metrics and calls fetchWithRetries. You can even move the logic in fetchWithRetries to here since no-one else is using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can even move the logic in fetchWithRetries to here since no-one else is using it.

I decided to leave fetchWithRetries as is, because as mentioned by @g11tech here it makes it easier to test the retry mechanism separately.

@dapplion can you help clarify what you meant by histogram of retry attempts per call? I interpreted that to mean another histogram that captures the duration for each retry, while @g11tech interpreted that to mean a counter to try each request.

Also I was wondering, this suggestion only focuses on adding metrics to retries to done in executionEngine/http, while there are other places the JsonRpcHttpClient is used, for example in eth1Provider.ts. Would it be an idea to move metrics tracking totally into JsonRpcHttpClient? Then every client who uses JsonRpcHttpClient do not need to have metrics separate, it would be generated automatically from using JsonRpcHttpClient to make requests.

If yes, then I can do that as another PR. Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually might be good idea to add metrics to JsonRpcHttpClient. eth1 calls to not the execution engine should have metrics too

shouldRetry: opts?.shouldRetry,
}
);
return parseRpcResponse(res, payload);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The should retry logic here should be based exclusively in networking errors. You only want to retry when the EL is unavailable. Here it's a bit tough because the underlying httpclient can change. So you should add a lot of unit (or e2e) tests that spin up a real server and try different things like:

  • bad URL
  • bad port
  • NGINX rejects to forward
  • etc

And ensure that only those are retried. Otherwise you can try to guess when a response is actually from an EL client, and don't retry only when you know the error is an "app layer" EL error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The should retry logic here should be based exclusively in networking errors. You only want to retry when the EL is unavailable

I believe this the case already. Line 94 (parseRpcResponse) would only be reached if the call returns 200, and parseRpcResponse parses that and throws ErrorJsonRpcResponse. If the call fails for any other reason (if response is not ok, ie network errors) that is only when the request is retried.

So you should add a lot of unit (or e2e) tests that spin up a real server

I'll add these

Comment on lines 60 to 65
describe("getPayload", async () => {
it("getPayload with retry", async function () {
this.timeout("10 min");
/**
* curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"engine_getPayload","params":["0x0"],"id":67}' http://localhost:8545
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to rewrite all the execution api cases, as it will be difficult to maintain/update any changes in two places, rather just test out fetchWithRetries functionality exhaustively against any hypothetical request response (in the same way its happening here) basically write its test cases independentaly of the execution api logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you say to removing the tests as you suggested but leaving just:

  • notifyForkchoiceUpdate no retry when no pay load attributes and
  • notifyForkchoiceUpdate with retry when pay load attributes tests?

as this are specific to how the node calls the EL, and when those calls should be retried or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lets leave it at that, no need to remove 👍

@dadepo dadepo dismissed a stale review via 77276a9 March 18, 2022 04:51
@@ -4,6 +4,6 @@ datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
url: http://host.docker.internal:9090
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is localhost:9090 not reachable in MacOS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that has to do with the fact that docker on MacOS does not have access to the host directly and it sits on top of a linux vm. Which makes it impossible to use localhost to refer to the actual host.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would move this changes to another PR to keep this one in scope to dadepo/retry-el-executepayload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes was made due to the request to add metrics to the retry mechanism. Since it makes it easier to see the metrics via the Prometheus/Grafana setup locally, instead of having to first push and run on a linux server.

But I can move it to a separate PR as you suggested and just use the /metrics endpoint exposed by the node

@g11tech
Copy link
Contributor

g11tech commented Mar 24, 2022

@dadepo @dapplion doing a bit of refac, moving metrics inside jsonrpc, adding dashboard, hopefully things will become cleaner and abstracted out from execution engine view point!

@@ -113,6 +126,9 @@ export class ExecutionEngineHttp implements IExecutionEngine {
{
retryAttempts: this.retryAttempts,
retryDelay: this.retryDelay,
onEachRetryFn: () => {
this?.metrics?.executionEngineRequestCount.inc({method});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be undefined?

Copy link
Contributor

@g11tech g11tech Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing this because i am moving metrics inside json rpc, also intend is to capture the retry count, not the total count

@g11tech g11tech dismissed a stale review via 2119b44 March 25, 2022 11:30
@g11tech
Copy link
Contributor

g11tech commented Mar 25, 2022

grafana dashboard cc: @dadepo @dapplion :
image

@dadepo
Copy link
Contributor Author

dadepo commented Mar 25, 2022

grafana dashboard cc: @dadepo @dapplion : image

Graph looks good. Just noticed that engine_getPayloadV1 is missing from the screenshot. Is it because it is actually missing or it was not just part of the screenshoot

@g11tech
Copy link
Contributor

g11tech commented Mar 25, 2022

grafana dashboard cc: @dadepo @dapplion : image

Graph looks good. Just noticed that engine_getPayloadV1 is missing from the screenshot. Is it because it is actually missing or it was not just part of the screenshoot

That method is called when the validator builds and proposes, this is just from my local setup without validator on the kiln network

g11tech
g11tech previously approved these changes Mar 25, 2022
@g11tech g11tech mentioned this pull request Apr 22, 2022
22 tasks
@dadepo dadepo dismissed stale reviews from ghost and g11tech via bfc911b April 28, 2022 09:40
@dadepo dadepo requested a review from a team as a code owner April 28, 2022 09:40
@dadepo
Copy link
Contributor Author

dadepo commented Apr 28, 2022

@dadepo @dapplion doing a bit of refac, moving metrics inside jsonrpc, adding dashboard, hopefully things will become cleaner and abstracted out from execution engine view point!

Hi @dapplion. Is this PR fine by you and can be merged? Or there are things you still will like improved?

@@ -28,7 +28,7 @@ export async function retry<A>(fn: (attempt: number) => A | Promise<A>, opts?: I
const shouldRetry = opts?.shouldRetry;

let lastError: Error = Error("RetryError");
for (let i = 1; i <= maxRetries; i++) {
for (let i = 0; i < maxRetries; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning! This change can break all existing usages of retry downstream! Please review carefully

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @g11tech can you help look into this? Given you made the modifications, it is probably faster to share any insights you had explaining the reason for the change. As far as I can see, everything still looks fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @dadepo I think i did it to not count first attempt as retry, you can revert it and may be correct a test that might fail because of retry comparison.

@dapplion
Copy link
Contributor

dapplion commented May 9, 2022

Hi @dapplion. Is this PR fine by you and can be merged? Or there are things you still will like improved?

Overall looks good! Some changes I've done

  • merged the metrics definitions into lodestar file, no need to keep them separate
  • de-duplicate test code

Check the issue with packages/lodestar/src/util/retry.ts changes and should be good to go.

If I've broken the tests please I appreciate if you can review them 🙏

@dapplion dapplion changed the base branch from master to unstable May 27, 2022 04:33
@g11tech g11tech force-pushed the dadepo/retry-el-executepayload branch from 1b34c47 to 4c23e6c Compare July 9, 2022 12:55
@g11tech g11tech force-pushed the dadepo/retry-el-executepayload branch from 4c23e6c to e9ba839 Compare July 19, 2022 15:06
@g11tech g11tech enabled auto-merge (squash) July 19, 2022 19:09
Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@g11tech g11tech requested a review from dapplion July 22, 2022 10:59
@g11tech g11tech merged commit 8b3fef2 into unstable Jul 25, 2022
@g11tech g11tech deleted the dadepo/retry-el-executepayload branch July 25, 2022 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retry mechanism in executionEngine for executePayload
5 participants