[TxPool] Broadcasting stuck transaction #3666

JackyWYX · 2021-04-21T07:37:28Z

Issue

Observed some transactions are missing during broadcasting in pub-sub. This fix is to broadcast "stuck" executable transactions periodically to the network.

Test

Tested locally with code: https://github.com/JackyWYX/harmony/tree/txpool_staking_check_test.

Broadcasting print message is observed when transaction get stuck.

LeoHChen · 2021-04-23T05:44:05Z

Overall, I think it is a better design to replace the old plain retry logic using a smart retry mechanism. However, how do you know the tx is missing from pubsub? Do you have any data collected on mainnet? Did you test it on mainnet to push the stuck tx? I also suggest you add a metrics to keep track of the stuck tx if any on mainnet before we merge this PR.

JackyWYX · 2021-04-23T06:05:38Z

How do you know the tx is missing from pubsub?

I checked the log on both "probe" node and leader node. Here is the observations:

Local node receive transactions from pub-sub and has some stuck transaction in txPool for some time when running hmy blockchain pool
Running the same command for dumping leader's transaction pool found that some of this transaction does not exist.
Grep the transaction hash in leader's log, not a single message is found.

Because the transaction hash will be print in log at least once at txPool.add (core/tx_pool.go:916) when processing transaction pub-sub message, so I infered that the specific transaction message is not received by leader node.

Did you test it on mainnet to push the stuck tx? I also suggest you add a metrics to keep track of the stuck tx if any on mainnet before we merge this PR.

Currently no. And prometheus is a really good point, we'd better do "probe" with the prometheus turned on. Let me update a minute later.

Do you have any data collected on mainnet?

I discarded data after investigation. My bad. Let me collect some data to support my first point later.

rlan35 · 2021-04-26T19:45:30Z

How do you know the tx is missing from pubsub?

I checked the log on both "probe" node and leader node. Here is the observations:

Local node receive transactions from pub-sub and has some stuck transaction in txPool for some time when running hmy blockchain pool

Running the same command for dumping leader's transaction pool found that some of this transaction does not exist.

Grep the transaction hash in leader's log, not a single message is found.

Because the transaction hash will be print in log at least once at txPool.add (core/tx_pool.go:916) when processing transaction pub-sub message, so I infered that the specific transaction message is not received by leader node.

Did you test it on mainnet to push the stuck tx? I also suggest you add a metrics to keep track of the stuck tx if any on mainnet before we merge this PR.

Currently no. And prometheus is a really good point, we'd better do "probe" with the prometheus turned on. Let me update a minute later.

Do you have any data collected on mainnet?

I discarded data after investigation. My bad. Let me collect some data to support my first point later.

Did you make sure that the txn was indeed broadcasted by the probe node in the first place? Maybe the initial broadcasting failed or didn't happen?

JackyWYX · 2021-04-26T20:32:03Z

Did you make sure that the txn was indeed broadcasted by the probe node in the first place? Maybe the initial broadcasting failed or didn't happen?

Not very likely. I grepped error log in three of the shard 0 explorer nodes, none of which gives the corresponding error.

cat zerolog-harmony.log | grep "Error when trying to broadcast"

LeoHChen · 2021-05-05T23:22:37Z

Can we just try to replace one mainnet node using the new logic? will it work? I see you added metrics, we can setup some alert so that we also knows it is happening on the mainnet.

JackyWYX · 2021-05-13T21:07:41Z

Can we just try to replace one mainnet node using the new logic? will it work? I see you added metrics, we can setup some alert so that we also knows it is happening on the mainnet.

This is the metrics collected from a probe node starting 2 days ago running txpool code:

# HELP hmy_txPool_invalid transactions failed validation
# TYPE hmy_txPool_invalid counter
hmy_txPool_invalid{err="exceeds block gas limit"} 93
hmy_txPool_invalid{err="insufficient funds for gas * price + value"} 87594
hmy_txPool_invalid{err="intrinsic gas too low"} 22
hmy_txPool_invalid{err="nonce too low"} 4
hmy_txPool_invalid{err="staking transaction validation failed"} 3136
hmy_txPool_invalid{err="transaction underpriced"} 732
# HELP hmy_txPool_known number of known transaction received
# TYPE hmy_txPool_known counter
hmy_txPool_known 8175
# HELP hmy_txPool_low_nonce number of transactions removed because of low nonce (including processed transactions)
# TYPE hmy_txPool_low_nonce counter
hmy_txPool_low_nonce 921
# HELP hmy_txPool_pending number of executable transactions
# TYPE hmy_txPool_pending gauge
hmy_txPool_pending 4088
# HELP hmy_txPool_queued number of queued non-executable transactions
# TYPE hmy_txPool_queued gauge
hmy_txPool_queued 1024
# HELP hmy_txPool_received number of transactions received
# TYPE hmy_txPool_received counter
hmy_txPool_received 244793
# HELP hmy_txPool_stuck number of transactions observed to be stuck and broadcast again
# TYPE hmy_txPool_stuck counter
hmy_txPool_stuck 163

The node runs well and we can see 163 transactions is stuck and broadcasted. Other than than, the pending pool seems a little overwhelmed. The reason behind this is that the node stopped syncing about one day ago. Let me restart and obtain the metrics later.

JackyWYX · 2021-05-17T18:02:43Z

Latest metrics:

hmy_txPool_invalid{err="insufficient funds for gas * price + value"} 109213
hmy_txPool_invalid{err="intrinsic gas too low"} 29
hmy_txPool_invalid{err="invalid sender"} 17
hmy_txPool_invalid{err="nonce too low"} 1892
hmy_txPool_invalid{err="staking transaction validation failed"} 4370
hmy_txPool_invalid{err="transaction underpriced"} 1970
# HELP hmy_txPool_known number of known transaction received
# TYPE hmy_txPool_known counter
hmy_txPool_known 48421
# HELP hmy_txPool_low_nonce number of transactions removed because of low nonce (including processed transactions)
# TYPE hmy_txPool_low_nonce counter
hmy_txPool_low_nonce 213526
# HELP hmy_txPool_pending number of executable transactions
# TYPE hmy_txPool_pending gauge
hmy_txPool_pending 0
# HELP hmy_txPool_queued number of queued non-executable transactions
# TYPE hmy_txPool_queued gauge
hmy_txPool_queued 18
# HELP hmy_txPool_received number of transactions received
# TYPE hmy_txPool_received counter
hmy_txPool_received 457465
# HELP hmy_txPool_stuck number of transactions observed to be stuck and broadcast again
# TYPE hmy_txPool_stuck counter
hmy_txPool_stuck 45

This time node is catched up, and the data are all good.

LeoHChen

@JackyWYX, is your data collected on mainnet show a better result with the new algorithm?

JackyWYX · 2021-06-08T21:15:22Z

@JackyWYX, is your data collected on mainnet show a better result with the new algorithm?

Yes. From to the data I collected, there are "stuck" transactions re-broadcasted, and be eventually added to blocks. Previously, these transactions has never got the second chance to be broadcasted to leader node when leader missed this transaction. And the user need to wait the transaction to be timed out at explorer node before sending out the next one.

rlan35 · 2021-06-08T22:02:30Z

leader/validator are seeing lots of invalid txns:

{"level":"warn","error":"transaction nonce is 52739: nonce too low","hash":"0x85263f6058e1f17736a1d097ad85d0a1f15a86bea3de9345a6d0a2155f1ce4f1","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.804417907Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.804510896Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 56980: nonce too low","hash":"0xd666419e4e3d29e5ea5d957caa8d5c69e306f800fd648f0f670c508830defab8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.818869918Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.818981274Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40365: nonce too low","hash":"0x191eba4ae51804df7cf934d7a6e51f8ecded7ada7d33e72200431891be0cbbed","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.820340037Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.82039532Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40370: nonce too low","hash":"0x0185fa2d1f6542cc8711296a21ad11350bb67d35489b21ee4aea901094135386","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.821678753Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.821717203Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40374: nonce too low","hash":"0x34d3715e4e7f57ce42aab17684ad411645ccf93e26603a869bfda4526b19c6c8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.823051551Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.823081838Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 48677: nonce too low","hash":"0x7407b7bc448f0abe8d5c829db8b0ff627de749c617ef0d1511e95363f8ade538","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.826876763Z","message":"Discarding invalid transaction"}

I think the explorer nodes are broadcasting invalid txns, which is spamming the consensus nodes. We should fix this ASAP since shard 0 block time is being affected.

rlan35 · 2021-06-08T22:13:10Z

leader/validator are seeing lots of invalid txns:

{"level":"warn","error":"transaction nonce is 52739: nonce too low","hash":"0x85263f6058e1f17736a1d097ad85d0a1f15a86bea3de9345a6d0a2155f1ce4f1","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.804417907Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.804510896Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 56980: nonce too low","hash":"0xd666419e4e3d29e5ea5d957caa8d5c69e306f800fd648f0f670c508830defab8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.818869918Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.818981274Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40365: nonce too low","hash":"0x191eba4ae51804df7cf934d7a6e51f8ecded7ada7d33e72200431891be0cbbed","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.820340037Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.82039532Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40370: nonce too low","hash":"0x0185fa2d1f6542cc8711296a21ad11350bb67d35489b21ee4aea901094135386","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.821678753Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.821717203Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40374: nonce too low","hash":"0x34d3715e4e7f57ce42aab17684ad411645ccf93e26603a869bfda4526b19c6c8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.823051551Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.823081838Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 48677: nonce too low","hash":"0x7407b7bc448f0abe8d5c829db8b0ff627de749c617ef0d1511e95363f8ade538","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.826876763Z","message":"Discarding invalid transaction"}

I think the explorer nodes are broadcasting invalid txns, which is spamming the consensus nodes. We should fix this ASAP since shard 0 block time is being affected.

However, it may not be the case though, the txn may be broadcasted directly through p2p network if the sender is not using the rpc endpoint. Some evidence shows it may be the case because the leader received the invalid txn before a explorer node.

JackyWYX · 2021-06-08T22:17:22Z

However, it may not be the case though, the txn may be broadcasted directly through p2p network if the sender is not using the rpc endpoint. Some evidence shows it may be the case because the leader received the invalid txn before a explorer node.

This is another issue, and can be tricky. The code change to not broadcasting invalid transctions is simple, but this will cause explorer nodes to drop transactions if the explorer node is left behind in sync.

rlan35 · 2021-06-08T22:19:40Z

However, it may not be the case though, the txn may be broadcasted directly through p2p network if the sender is not using the rpc endpoint. Some evidence shows it may be the case because the leader received the invalid txn before a explorer node.

This is another issue, and can be tricky. The code change to not broadcasting invalid transctions is simple, but this will cause explorer nodes to drop transactions if the explorer node is left behind in sync.

compared to block time being delayed. It's less of a issue if some drops gets dropped. And also after the sync issue is fixed, we don't expect explorer nodes to fall out of sync often. So it shouldn't be a big problem.

JackyWYX · 2021-06-08T22:21:06Z

compared to block time being delayed. It's less of a issue if some drops gets dropped. And also after the sync issue is fixed, we don't expect explorer nodes to fall out of sync often. So it shouldn't be a big problem.

Agreed. Sure, let me work out a fix now.

JackyWYX added 4 commits April 21, 2021 00:14

[txPool] broadcasting stuck transaction in txPool

01f7a94

[txPool] fix one small bug

ecac8a5

[txPool] added broadcasting function in txpool

ef6118b

[txpool] small fix on code format

8876af7

JackyWYX requested review from rlan35 and LeoHChen April 21, 2021 07:37

[txPool] fix a small bug

f1c3e1e

JackyWYX added 2 commits April 23, 2021 01:32

[txPool] added prometheus metrics

edff340

[txPool] bug fix for duplicate stats function

79bea3d

LeoHChen reviewed Jun 8, 2021

View reviewed changes

JackyWYX mentioned this pull request Jun 8, 2021

[cmd] stop broadcasting invalid transaction #3744

Merged

ONECasey closed this May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TxPool] Broadcasting stuck transaction #3666

[TxPool] Broadcasting stuck transaction #3666

JackyWYX commented Apr 21, 2021

LeoHChen commented Apr 23, 2021 •

edited

Loading

JackyWYX commented Apr 23, 2021

rlan35 commented Apr 26, 2021

JackyWYX commented Apr 26, 2021

LeoHChen commented May 5, 2021

JackyWYX commented May 13, 2021 •

edited

Loading

JackyWYX commented May 17, 2021

LeoHChen left a comment

JackyWYX commented Jun 8, 2021

rlan35 commented Jun 8, 2021

rlan35 commented Jun 8, 2021

JackyWYX commented Jun 8, 2021 •

edited

Loading

rlan35 commented Jun 8, 2021

JackyWYX commented Jun 8, 2021

[TxPool] Broadcasting stuck transaction #3666

[TxPool] Broadcasting stuck transaction #3666

Conversation

JackyWYX commented Apr 21, 2021

Issue

Test

LeoHChen commented Apr 23, 2021 • edited Loading

JackyWYX commented Apr 23, 2021

rlan35 commented Apr 26, 2021

JackyWYX commented Apr 26, 2021

LeoHChen commented May 5, 2021

JackyWYX commented May 13, 2021 • edited Loading

JackyWYX commented May 17, 2021

LeoHChen left a comment

Choose a reason for hiding this comment

JackyWYX commented Jun 8, 2021

rlan35 commented Jun 8, 2021

rlan35 commented Jun 8, 2021

JackyWYX commented Jun 8, 2021 • edited Loading

rlan35 commented Jun 8, 2021

JackyWYX commented Jun 8, 2021

LeoHChen commented Apr 23, 2021 •

edited

Loading

JackyWYX commented May 13, 2021 •

edited

Loading

JackyWYX commented Jun 8, 2021 •

edited

Loading