Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck/deadlocked using engine.experimental #10649

Closed
1 task done
michaelsproul opened this issue Sep 2, 2024 · 10 comments
Closed
1 task done

Stuck/deadlocked using engine.experimental #10649

michaelsproul opened this issue Sep 2, 2024 · 10 comments
Labels
A-consensus Related to the consensus engine C-bug An unexpected or incorrect behavior S-needs-investigation This issue requires detective work to figure out what's going wrong

Comments

@michaelsproul
Copy link
Contributor

Describe the bug

My mainnet Reth node got stuck for an hour doing seemingly nothing. A manual restart was required to bring it back.

Note the dead time in the logs below between 00:58 and 02:01. During this time Lighthouse was timing out trying to contact Reth.

Steps to reproduce

Run Reth v1.0.6 with --engine.experimental.

Node logs

2024-09-02T00:58:38.885978Z DEBUG reth::cli: Event: Handler(ForkchoiceUpdated(ForkchoiceState { head_block_hash: 0xe6e8afb1787d3d391291573193d6989b65142f4063b3614b6c0fa419f
ec358cc, safe_block_hash: 0xa08819cf7adc2316c89402ec388617337eccdf199c86baaef3936bbc6e6aac5b, finalized_block_hash: 0xa27e1375062f4795303301f3f4a8b7c922d9a065aa7d03db82f748
71d091f086 }, Valid))
2024-09-02T00:58:38.886007Z  INFO reth_node_events::node: Canonical chain committed number=20659449 hash=0xe6e8afb1787d3d391291573193d6989b65142f4063b3614b6c0fa419fec358cc 
elapsed=635.901µs
2024-09-02T00:58:38.926802Z DEBUG providers::db: Inserted block block_number=20659443 actions=[(InsertCanonicalHeaders, 113.002µs), (InsertHeaders, 71.098µs), (InsertHeader
Numbers, 77.942µs), (GetParentTD, 1.956µs), (InsertHeaderTerminalDifficulties, 34.501µs), (GetNextTxNum, 8.102µs), (InsertTransactionSenders, 0ns), (InsertTransactions, 1.2
54415ms), (InsertTransactionHashNumbers, 38.972791ms), (InsertBlockWithdrawals, 61.949µs), (InsertBlockBodyIndices, 55.873µs), (InsertTransactionBlocks, 55.314µs)]
2024-09-02T00:58:38.926823Z DEBUG provider::storage_writer: Writing headers and transactions.
2024-09-02T00:58:38.926841Z DEBUG provider::storage_writer: Updating transaction metadata after writing block_num=20659443
2024-09-02T00:58:39.014573Z DEBUG provider::storage_writer: Appended block data range=20659443..=20659443
2024-09-02T00:58:39.050799Z DEBUG provider::static_file: Commit segment=Headers path="/mnt/storage/reth/data/static_files/static_file_headers_20500000_20999999" duration=36
.208908ms
2024-09-02T00:58:39.075622Z DEBUG provider::static_file: Commit segment=Transactions path="/mnt/storage/reth/data/static_files/static_file_transactions_20500000_20999999" d
uration=24.690548ms
2024-09-02T00:58:39.149682Z DEBUG storage::db::mdbx: Commit total_duration=73.947958ms commit_latency=Some(CommitLatency(MDBX_commit_latency { preparation: 0, gc_wallclock:
 81, audit: 0, write: 0, sync: 4759, ending: 2, whole: 4843, gc_cputime: 0, gc_prof: MDBX_commit_latency__bindgen_ty_1 { wloops: 0, coalescences: 0, wipes: 0, flushes: 0, k
icks: 0, work_counter: 0, work_rtime_monotonic: 0, work_xtime_cpu: 0, work_rsteps: 0, work_xpages: 0, work_majflt: 0, self_counter: 0, self_rtime_monotonic: 0, self_xtime_c
pu: 0, self_rsteps: 0, self_xpages: 0, self_majflt: 0 } })) is_read_only=false
2024-09-02T00:58:39.385860Z DEBUG engine: received no engine message for some time, while waiting for persistence task to complete
2024-09-02T00:58:59.380610Z DEBUG reqwest::connect: starting new connection: http://ipinfo.io/    
2024-09-02T00:58:59.380999Z DEBUG hyper_util::client::legacy::connect::dns: resolving host="ipinfo.io"
2024-09-02T00:58:59.382832Z DEBUG reqwest::connect: starting new connection: http://icanhazip.com/    
2024-09-02T00:58:59.383261Z DEBUG hyper_util::client::legacy::connect::dns: resolving host="icanhazip.com"
2024-09-02T00:58:59.385011Z DEBUG reqwest::connect: starting new connection: http://ifconfig.me/    
2024-09-02T00:58:59.385076Z DEBUG hyper_util::client::legacy::connect::dns: resolving host="ifconfig.me"
2024-09-02T00:58:59.395451Z DEBUG hyper_util::client::legacy::connect::http: connecting to 34.117.59.81:80
2024-09-02T00:58:59.395532Z DEBUG hyper_util::client::legacy::connect::http: connecting to 104.16.185.241:80
2024-09-02T00:58:59.403577Z DEBUG hyper_util::client::legacy::connect::http: connecting to 34.160.111.145:80
2024-09-02T00:58:59.408870Z DEBUG hyper_util::client::legacy::connect::http: connected to 34.117.59.81:80
2024-09-02T00:58:59.410315Z DEBUG hyper_util::client::legacy::connect::http: connected to 104.16.185.241:80
2024-09-02T00:58:59.414266Z DEBUG hyper_util::client::legacy::connect::http: connected to 34.160.111.145:80
2024-09-02T00:58:59.463722Z DEBUG hyper_util::client::legacy::pool: pooling idle connection for ("http", icanhazip.com)
2024-09-02T02:01:14.932288Z  INFO reth::cli: Initialized tracing, debug log directory: /mnt/storage/reth/logs/mainnet
2024-09-02T02:01:14.934454Z  INFO reth::cli: Starting reth version="1.0.6 (c228fe15)"

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 1.0.6
Commit SHA: c228fe1
Build Timestamp: 2024-08-29T18:18:33.351492336Z
Build Features: asm_keccak,jemalloc
Build Profile: maxperf

What database version are you on?

2024-09-02T02:09:44.148521Z INFO Initialized tracing, debug log directory: /home/michael/.cache/reth/logs/mainnet
Current database version: 2
Local database version: 2

Which chain / network are you on?

Mainnet

What type of node are you running?

Full via --full flag

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

  • I agree to follow the Code of Conduct
@michaelsproul michaelsproul added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels Sep 2, 2024
@mattsse
Copy link
Collaborator

mattsse commented Sep 2, 2024

@joshieDo I wonder if this has something to do with --full node?

@mattsse mattsse added A-consensus Related to the consensus engine S-needs-investigation This issue requires detective work to figure out what's going wrong and removed S-needs-triage This issue needs to be labelled labels Sep 2, 2024
@joshieDo
Copy link
Collaborator

joshieDo commented Sep 2, 2024

maybe... but the logs show that it finished the persistence task on on_save_blocks, so it should have sent the result back

@mattsse
Copy link
Collaborator

mattsse commented Sep 2, 2024

there's also something off with the log order, because

2024-09-02T02:01:14.934454Z INFO reth::cli: Starting reth version="1.0.6 (c228fe1)"

this should get emitted first

@shekhirin
Copy link
Collaborator

shekhirin commented Sep 2, 2024

@mattsse it was a restart (check the timestamps), looks fine to me

@mattsse
Copy link
Collaborator

mattsse commented Sep 2, 2024

During this time Lighthouse was timing out trying to contact Reth.

okay, perhaps this is an issue with the rpc request routing, e.g. the task isn't polled anymore resulting in unhandled requests, but then we should see the events warning about CL client...

so I assume this is indeed a deadlock of some sorts

@emhane
Copy link
Member

emhane commented Sep 23, 2024

have you upgraded to 1.0.7 @michaelsproul ?

@michaelsproul
Copy link
Contributor Author

@emhane Yeah upgraded a few days ago and it's been smooth sailing. Happy to close this issue!

@github-project-automation github-project-automation bot moved this from Todo to Done in Reth Tracker Sep 27, 2024
@michaelsproul
Copy link
Contributor Author

Even though I haven't got any more deadlocks, performance has been quite poor with engine.experimental on v1.0.7, so I'm turning it off.

Lighthouse's metrics report:

execution_layer_request_times_bucket{method="forkchoice_updated",le="0.01"} 73178
execution_layer_request_times_bucket{method="forkchoice_updated",le="0.02"} 73401
execution_layer_request_times_bucket{method="forkchoice_updated",le="0.05"} 73444
execution_layer_request_times_bucket{method="forkchoice_updated",le="0.1"} 73490
execution_layer_request_times_bucket{method="forkchoice_updated",le="0.2"} 73562
execution_layer_request_times_bucket{method="forkchoice_updated",le="0.5"} 73770
execution_layer_request_times_bucket{method="forkchoice_updated",le="1"} 73983
execution_layer_request_times_bucket{method="forkchoice_updated",le="2"} 74168
execution_layer_request_times_bucket{method="forkchoice_updated",le="5"} 74272
execution_layer_request_times_bucket{method="forkchoice_updated",le="10"} 74284
execution_layer_request_times_bucket{method="forkchoice_updated",le="20"} 74287
execution_layer_request_times_bucket{method="forkchoice_updated",le="50"} 74367
execution_layer_request_times_bucket{method="forkchoice_updated",le="+Inf"} 74367
execution_layer_request_times_sum{method="forkchoice_updated"} 4304.689681778987
execution_layer_request_times_count{method="forkchoice_updated"} 74367
execution_layer_request_times_bucket{method="new_payload",le="0.01"} 292
execution_layer_request_times_bucket{method="new_payload",le="0.02"} 344
execution_layer_request_times_bucket{method="new_payload",le="0.05"} 439
execution_layer_request_times_bucket{method="new_payload",le="0.1"} 666
execution_layer_request_times_bucket{method="new_payload",le="0.2"} 1629
execution_layer_request_times_bucket{method="new_payload",le="0.5"} 22295
execution_layer_request_times_bucket{method="new_payload",le="1"} 67355
execution_layer_request_times_bucket{method="new_payload",le="2"} 85217
execution_layer_request_times_bucket{method="new_payload",le="5"} 91210
execution_layer_request_times_bucket{method="new_payload",le="10"} 91825
execution_layer_request_times_bucket{method="new_payload",le="20"} 91970
execution_layer_request_times_bucket{method="new_payload",le="50"} 95292
execution_layer_request_times_bucket{method="new_payload",le="+Inf"} 95292
execution_layer_request_times_sum{method="new_payload"} 217682.050178281
execution_layer_request_times_count{method="new_payload"} 95292

fcU is quick at 57ms on average, but newPayload is taking 2.28s.

@mattsse
Copy link
Collaborator

mattsse commented Oct 7, 2024

we fixed one issue in 1.0.8 that was likely responsible for this

newPayload on this server is now steady ~100ms "Engine API newPayload Latency"

https://reth.paradigm.xyz/d/2k8BXz24x/reth?orgId=1&refresh=30s

image

@michaelsproul
Copy link
Contributor Author

ohhh I didn't even see 1.0.8, will try! thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-consensus Related to the consensus engine C-bug An unexpected or incorrect behavior S-needs-investigation This issue requires detective work to figure out what's going wrong
Projects
Archived in project
Development

No branches or pull requests

5 participants