Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TxPool] Broadcasting stuck transaction #3666

Closed
wants to merge 7 commits into from

Conversation

JackyWYX
Copy link
Contributor

Issue

Further fix to harmony-one/go-sdk#242

Observed some transactions are missing during broadcasting in pub-sub. This fix is to broadcast "stuck" executable transactions periodically to the network.

Test

Tested locally with code: https://github.com/JackyWYX/harmony/tree/txpool_staking_check_test.

Broadcasting print message is observed when transaction get stuck.

@JackyWYX JackyWYX requested review from rlan35 and LeoHChen April 21, 2021 07:37
@LeoHChen
Copy link
Contributor

LeoHChen commented Apr 23, 2021

Overall, I think it is a better design to replace the old plain retry logic using a smart retry mechanism. However, how do you know the tx is missing from pubsub? Do you have any data collected on mainnet? Did you test it on mainnet to push the stuck tx? I also suggest you add a metrics to keep track of the stuck tx if any on mainnet before we merge this PR.

@JackyWYX
Copy link
Contributor Author

How do you know the tx is missing from pubsub?

I checked the log on both "probe" node and leader node. Here is the observations:

  1. Local node receive transactions from pub-sub and has some stuck transaction in txPool for some time when running hmy blockchain pool
  2. Running the same command for dumping leader's transaction pool found that some of this transaction does not exist.
  3. Grep the transaction hash in leader's log, not a single message is found.

Because the transaction hash will be print in log at least once at txPool.add (core/tx_pool.go:916) when processing transaction pub-sub message, so I infered that the specific transaction message is not received by leader node.

Did you test it on mainnet to push the stuck tx? I also suggest you add a metrics to keep track of the stuck tx if any on mainnet before we merge this PR.

Currently no. And prometheus is a really good point, we'd better do "probe" with the prometheus turned on. Let me update a minute later.

Do you have any data collected on mainnet?

I discarded data after investigation. My bad. Let me collect some data to support my first point later.

@rlan35
Copy link
Contributor

rlan35 commented Apr 26, 2021

How do you know the tx is missing from pubsub?

I checked the log on both "probe" node and leader node. Here is the observations:

  1. Local node receive transactions from pub-sub and has some stuck transaction in txPool for some time when running hmy blockchain pool
  2. Running the same command for dumping leader's transaction pool found that some of this transaction does not exist.
  3. Grep the transaction hash in leader's log, not a single message is found.

Because the transaction hash will be print in log at least once at txPool.add (core/tx_pool.go:916) when processing transaction pub-sub message, so I infered that the specific transaction message is not received by leader node.

Did you test it on mainnet to push the stuck tx? I also suggest you add a metrics to keep track of the stuck tx if any on mainnet before we merge this PR.

Currently no. And prometheus is a really good point, we'd better do "probe" with the prometheus turned on. Let me update a minute later.

Do you have any data collected on mainnet?

I discarded data after investigation. My bad. Let me collect some data to support my first point later.

Did you make sure that the txn was indeed broadcasted by the probe node in the first place? Maybe the initial broadcasting failed or didn't happen?

@JackyWYX
Copy link
Contributor Author

Did you make sure that the txn was indeed broadcasted by the probe node in the first place? Maybe the initial broadcasting failed or didn't happen?

Not very likely. I grepped error log in three of the shard 0 explorer nodes, none of which gives the corresponding error.

cat zerolog-harmony.log | grep "Error when trying to broadcast"

@LeoHChen
Copy link
Contributor

LeoHChen commented May 5, 2021

Can we just try to replace one mainnet node using the new logic? will it work? I see you added metrics, we can setup some alert so that we also knows it is happening on the mainnet.

@JackyWYX
Copy link
Contributor Author

JackyWYX commented May 13, 2021

Can we just try to replace one mainnet node using the new logic? will it work? I see you added metrics, we can setup some alert so that we also knows it is happening on the mainnet.

This is the metrics collected from a probe node starting 2 days ago running txpool code:

# HELP hmy_txPool_invalid transactions failed validation
# TYPE hmy_txPool_invalid counter
hmy_txPool_invalid{err="exceeds block gas limit"} 93
hmy_txPool_invalid{err="insufficient funds for gas * price + value"} 87594
hmy_txPool_invalid{err="intrinsic gas too low"} 22
hmy_txPool_invalid{err="nonce too low"} 4
hmy_txPool_invalid{err="staking transaction validation failed"} 3136
hmy_txPool_invalid{err="transaction underpriced"} 732
# HELP hmy_txPool_known number of known transaction received
# TYPE hmy_txPool_known counter
hmy_txPool_known 8175
# HELP hmy_txPool_low_nonce number of transactions removed because of low nonce (including processed transactions)
# TYPE hmy_txPool_low_nonce counter
hmy_txPool_low_nonce 921
# HELP hmy_txPool_pending number of executable transactions
# TYPE hmy_txPool_pending gauge
hmy_txPool_pending 4088
# HELP hmy_txPool_queued number of queued non-executable transactions
# TYPE hmy_txPool_queued gauge
hmy_txPool_queued 1024
# HELP hmy_txPool_received number of transactions received
# TYPE hmy_txPool_received counter
hmy_txPool_received 244793
# HELP hmy_txPool_stuck number of transactions observed to be stuck and broadcast again
# TYPE hmy_txPool_stuck counter
hmy_txPool_stuck 163

The node runs well and we can see 163 transactions is stuck and broadcasted. Other than than, the pending pool seems a little overwhelmed. The reason behind this is that the node stopped syncing about one day ago. Let me restart and obtain the metrics later.

@JackyWYX
Copy link
Contributor Author

Latest metrics:

hmy_txPool_invalid{err="insufficient funds for gas * price + value"} 109213
hmy_txPool_invalid{err="intrinsic gas too low"} 29
hmy_txPool_invalid{err="invalid sender"} 17
hmy_txPool_invalid{err="nonce too low"} 1892
hmy_txPool_invalid{err="staking transaction validation failed"} 4370
hmy_txPool_invalid{err="transaction underpriced"} 1970
# HELP hmy_txPool_known number of known transaction received
# TYPE hmy_txPool_known counter
hmy_txPool_known 48421
# HELP hmy_txPool_low_nonce number of transactions removed because of low nonce (including processed transactions)
# TYPE hmy_txPool_low_nonce counter
hmy_txPool_low_nonce 213526
# HELP hmy_txPool_pending number of executable transactions
# TYPE hmy_txPool_pending gauge
hmy_txPool_pending 0
# HELP hmy_txPool_queued number of queued non-executable transactions
# TYPE hmy_txPool_queued gauge
hmy_txPool_queued 18
# HELP hmy_txPool_received number of transactions received
# TYPE hmy_txPool_received counter
hmy_txPool_received 457465
# HELP hmy_txPool_stuck number of transactions observed to be stuck and broadcast again
# TYPE hmy_txPool_stuck counter
hmy_txPool_stuck 45

This time node is catched up, and the data are all good.

Copy link
Contributor

@LeoHChen LeoHChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackyWYX, is your data collected on mainnet show a better result with the new algorithm?

@JackyWYX
Copy link
Contributor Author

JackyWYX commented Jun 8, 2021

@JackyWYX, is your data collected on mainnet show a better result with the new algorithm?

Yes. From to the data I collected, there are "stuck" transactions re-broadcasted, and be eventually added to blocks. Previously, these transactions has never got the second chance to be broadcasted to leader node when leader missed this transaction. And the user need to wait the transaction to be timed out at explorer node before sending out the next one.

@rlan35
Copy link
Contributor

rlan35 commented Jun 8, 2021

leader/validator are seeing lots of invalid txns:

{"level":"warn","error":"transaction nonce is 52739: nonce too low","hash":"0x85263f6058e1f17736a1d097ad85d0a1f15a86bea3de9345a6d0a2155f1ce4f1","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.804417907Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.804510896Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 56980: nonce too low","hash":"0xd666419e4e3d29e5ea5d957caa8d5c69e306f800fd648f0f670c508830defab8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.818869918Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.818981274Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40365: nonce too low","hash":"0x191eba4ae51804df7cf934d7a6e51f8ecded7ada7d33e72200431891be0cbbed","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.820340037Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.82039532Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40370: nonce too low","hash":"0x0185fa2d1f6542cc8711296a21ad11350bb67d35489b21ee4aea901094135386","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.821678753Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.821717203Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40374: nonce too low","hash":"0x34d3715e4e7f57ce42aab17684ad411645ccf93e26603a869bfda4526b19c6c8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.823051551Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.823081838Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 48677: nonce too low","hash":"0x7407b7bc448f0abe8d5c829db8b0ff627de749c617ef0d1511e95363f8ade538","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.826876763Z","message":"Discarding invalid transaction"}

I think the explorer nodes are broadcasting invalid txns, which is spamming the consensus nodes. We should fix this ASAP since shard 0 block time is being affected.

@rlan35
Copy link
Contributor

rlan35 commented Jun 8, 2021

leader/validator are seeing lots of invalid txns:

{"level":"warn","error":"transaction nonce is 52739: nonce too low","hash":"0x85263f6058e1f17736a1d097ad85d0a1f15a86bea3de9345a6d0a2155f1ce4f1","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.804417907Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.804510896Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 56980: nonce too low","hash":"0xd666419e4e3d29e5ea5d957caa8d5c69e306f800fd648f0f670c508830defab8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.818869918Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.818981274Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40365: nonce too low","hash":"0x191eba4ae51804df7cf934d7a6e51f8ecded7ada7d33e72200431891be0cbbed","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.820340037Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.82039532Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40370: nonce too low","hash":"0x0185fa2d1f6542cc8711296a21ad11350bb67d35489b21ee4aea901094135386","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.821678753Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.821717203Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 40374: nonce too low","hash":"0x34d3715e4e7f57ce42aab17684ad411645ccf93e26603a869bfda4526b19c6c8","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.823051551Z","message":"Discarding invalid transaction"}
{"level":"info","err":[{}],"length of newTxs":1,"totalPending":90,"totalQueued":2,"caller":"/mnt/jenkins/workspace/harmony-release/harmony/node/node.go:216","time":"2021-06-08T21:59:03.823081838Z","message":"[addPendingTransactions] Adding more transactions"}
{"level":"warn","error":"transaction nonce is 48677: nonce too low","hash":"0x7407b7bc448f0abe8d5c829db8b0ff627de749c617ef0d1511e95363f8ade538","caller":"/mnt/jenkins/workspace/harmony-release/harmony/core/tx_pool.go:930","time":"2021-06-08T21:59:03.826876763Z","message":"Discarding invalid transaction"}

I think the explorer nodes are broadcasting invalid txns, which is spamming the consensus nodes. We should fix this ASAP since shard 0 block time is being affected.

However, it may not be the case though, the txn may be broadcasted directly through p2p network if the sender is not using the rpc endpoint. Some evidence shows it may be the case because the leader received the invalid txn before a explorer node.

@JackyWYX
Copy link
Contributor Author

JackyWYX commented Jun 8, 2021

However, it may not be the case though, the txn may be broadcasted directly through p2p network if the sender is not using the rpc endpoint. Some evidence shows it may be the case because the leader received the invalid txn before a explorer node.

This is another issue, and can be tricky. The code change to not broadcasting invalid transctions is simple, but this will cause explorer nodes to drop transactions if the explorer node is left behind in sync.

@rlan35
Copy link
Contributor

rlan35 commented Jun 8, 2021

However, it may not be the case though, the txn may be broadcasted directly through p2p network if the sender is not using the rpc endpoint. Some evidence shows it may be the case because the leader received the invalid txn before a explorer node.

This is another issue, and can be tricky. The code change to not broadcasting invalid transctions is simple, but this will cause explorer nodes to drop transactions if the explorer node is left behind in sync.

compared to block time being delayed. It's less of a issue if some drops gets dropped. And also after the sync issue is fixed, we don't expect explorer nodes to fall out of sync often. So it shouldn't be a big problem.

@JackyWYX
Copy link
Contributor Author

JackyWYX commented Jun 8, 2021

compared to block time being delayed. It's less of a issue if some drops gets dropped. And also after the sync issue is fixed, we don't expect explorer nodes to fall out of sync often. So it shouldn't be a big problem.

Agreed. Sure, let me work out a fix now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants