-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TxPool] Broadcasting stuck transaction #3666
Conversation
Overall, I think it is a better design to replace the old plain retry logic using a smart retry mechanism. However, how do you know the tx is missing from pubsub? Do you have any data collected on mainnet? Did you test it on mainnet to push the stuck tx? I also suggest you add a metrics to keep track of the stuck tx if any on mainnet before we merge this PR. |
I checked the log on both "probe" node and leader node. Here is the observations:
Because the transaction hash will be print in log at least once at
Currently no. And prometheus is a really good point, we'd better do "probe" with the prometheus turned on. Let me update a minute later.
I discarded data after investigation. My bad. Let me collect some data to support my first point later. |
Did you make sure that the txn was indeed broadcasted by the probe node in the first place? Maybe the initial broadcasting failed or didn't happen? |
Not very likely. I grepped error log in three of the shard 0 explorer nodes, none of which gives the corresponding error.
|
Can we just try to replace one mainnet node using the new logic? will it work? I see you added metrics, we can setup some alert so that we also knows it is happening on the mainnet. |
This is the metrics collected from a probe node starting 2 days ago running txpool code:
The node runs well and we can see 163 transactions is stuck and broadcasted. Other than than, the pending pool seems a little overwhelmed. The reason behind this is that the node stopped syncing about one day ago. Let me restart and obtain the metrics later. |
Latest metrics:
This time node is catched up, and the data are all good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackyWYX, is your data collected on mainnet show a better result with the new algorithm?
Yes. From to the data I collected, there are "stuck" transactions re-broadcasted, and be eventually added to blocks. Previously, these transactions has never got the second chance to be broadcasted to leader node when leader missed this transaction. And the user need to wait the transaction to be timed out at explorer node before sending out the next one. |
leader/validator are seeing lots of invalid txns:
I think the explorer nodes are broadcasting invalid txns, which is spamming the consensus nodes. We should fix this ASAP since shard 0 block time is being affected. |
However, it may not be the case though, the txn may be broadcasted directly through p2p network if the sender is not using the rpc endpoint. Some evidence shows it may be the case because the leader received the invalid txn before a explorer node. |
This is another issue, and can be tricky. The code change to not broadcasting invalid transctions is simple, but this will cause explorer nodes to drop transactions if the explorer node is left behind in sync. |
compared to block time being delayed. It's less of a issue if some drops gets dropped. And also after the sync issue is fixed, we don't expect explorer nodes to fall out of sync often. So it shouldn't be a big problem. |
Agreed. Sure, let me work out a fix now. |
Issue
Further fix to harmony-one/go-sdk#242
Observed some transactions are missing during broadcasting in pub-sub. This fix is to broadcast "stuck" executable transactions periodically to the network.
Test
Tested locally with code: https://github.com/JackyWYX/harmony/tree/txpool_staking_check_test.
Broadcasting print message is observed when transaction get stuck.