-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequence number caching in Hermes is unreliable #1264
Comments
I've done a preliminary investigation. Here's some thoughts on what might be happening The sequence is incremented at the of send_tx here https://github.com/informalsystems/ibc-rs/blob/142c7dc29c7636d1c9ece51ce8fd6f7c50ca3477/relayer/src/chain/cosmos.rs#L384 then the response from the I believe the sequence is only retrieved once and cached here https://github.com/informalsystems/ibc-rs/blob/142c7dc29c7636d1c9ece51ce8fd6f7c50ca3477/relayer/src/chain/cosmos.rs#L521 An account is returned in the logic below and the sequence is retrieved from this account I believe this logic to cache the sequence was implemented to improve performance since for every transaction you'd need to make an additional query to get the sequence and could slow things down. Maybe another approach is to have an internal retry counter or logic to "refresh" the sequence before all retries are exhausted and Hermes gives up on that transaction. Looks like that if you re-start Hermes then it's able to relay the transaction because the sequence might be retrieved from the chain. |
I think what me might try to do in Hermes is to parse the exact error message and update the cached sequence number on-the-fly, only when an error occurs. The message looks as follows, and should be easy for Hermes to interpret and update the cached value:
|
I managed to reproduce this issue using a slightly more complex scenario, with help from Greg and Mircea. The goal we're trying to reach in the scenario I'll describe below, which mimics production issues, is as follows:
Notice that the sequence number fetched is The only solution we know to solve this is to flush the mempool of this node. Even this solution will partially fix it, because this same node might have broadcasted the problematic tx (stuck with s.n. To reproduce this problem, the steps are as follows, using
|
Would it be useful if Tendermint full-nodes communicated a cryptographic commitment/hash of their (consensus-related) config to other nodes and warned users if there was a mismatch? |
I think the problem is that there are multiple levels of configurations. The tendermint-specific configuration is not at fault here, but the app.toml -- the application-specific configuration -- is causing transactions hanging in the mempool in the scenario outlined above. Not sure if a commitment proof would help there. But note that there may be different reasons why tx-es could remain stuck in the mempool (connectivity issues, misbehavior). |
Looks like this goes a bit deeper. I've just discovered that the sequence number issue comes up even with a fully synced node. Not sure why. I've attached a couple of log snippets. This log shows the timeline from a successful osmosis node TX to the next TX which now throws a sequence number error.
The second one shows that, even after 35 minutes, Hermes was still getting the same error which in turn was causing all Osmosis tx submissions to fail and relaying to stop. Only after a Hermes restart did the issue clear.
|
You did not try restarting the Osmosis node, right (i.e., the mempool flushing approach)? Only Hermes was restarted. Looking at the logs, it's possible that the error message is trustworthy:
At the top of log you shared shows a transaction that has s.n. We might consider reviving Soares' PR #1349 which had a account refresh method (ref). Though it's not clear from these logs if that would fix it. But this is all speculation. We need to test if we can trust either the |
I see several ways:
|
The node was not restarted. Only Hermes was restarted and resumed operations normally. |
Why block mode not recomended ? I still think use increment account sequence is not fully correct with broadcast-mode sync. |
I don't have a complete answer to this question. We simply noticed it was not reliable, so we switched away from
Indeed. Part of the reason is because Hermes can be quite aggressive with submitting transactions, and as this comment hints cosmos/cosmos-sdk#4186 (comment), it is prone to failure, we just haven't found a more robust option yet.
I think we'll resurrect the refresh-on-error approach and give that a go. |
A bit more context here. If the node is slow in any way to gossip transactions, Hermes will keep on submitting new ones and increasing the counter in the cache. Even after the mempool is clear, Hermes will still now have an incorrect sequence number and keeps trying to submit txs using it. No way to recover from this unless Hermes is restarted. |
Thanks Mircea! We would like to move forward with a potential solution and we'll prepare a branch for testing that. We'd appreciate if you can help us test it! |
Seems that upgrading to IBC 2.0 is causing a lot more sequence number issues. All networks that have been upgraded to the latest version of the SDK now generate these errors where they were not before. |
PR #1349 is ready for testing. Would you consider trying to run from that dev branch Mircea? We're also waiting on feedback from another relayer operator who was willing to try this branch. Would be great to have you also on it! |
When restart osmosis node I get a situation when timeout txn do infinite loop for hermes instance:
|
Thanks for the detailed answer. The reasons for the problems with the sequence number are now more clear. |
Since you are noticing the sequence number caching problem (according to your logs above), consider running Hermes from this dev branch #1349, and let us know if it resolves the problem. Appreciate any feedback! |
Crate
Hermes
Summary
Query the state for every chain in config every X sec for the relayer account's seq number, and update cached seq number with it. Make X a configurable param.
Problem Definition
At some point (around height 1008 on Cosmos hub), for some reason, the cached seq number on our relayer instance got incremented, but the actual on-chain seq number wasn't. From there, our relayer hasn't been able to relay, producing the following error:
While we don't know what caused this mismatch to happen, we couldn't detect it and thought our relayer was running properly. It would be good to figure out why the mismatch happened, but it would also be good to have a failsafe that updates the cached seq number with the actual on-chain seq number periodically so that the relayer can recover on its own.
For Admin Use
The text was updated successfully, but these errors were encountered: