Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry send_tx on account sequence mismatch #1349

Merged
merged 16 commits into from
Dec 21, 2021

Conversation

soareschen
Copy link
Contributor

@soareschen soareschen commented Sep 14, 2021

Closes: #1264

Description

This PR makes Hermes retry on send_tx for up to 5 times, in the event that the submission failed with a incorrect account sequence error. This error can occur either at the gas estimation step (send_tx_simulate), or at the transaction submission step (broadcast_tx_sync). In both cases, we parse the error message to match on its details and infer if it is account sequence mismatch.

  • For the case when the error appears at the gas estimation step, we match on the "account sequence mismatch" status message. We do not retry in the runtime, but rather at the worker level (step function).

  • For the case when the account sequence mismatch occurs at transaction submission step, we detect the error by checking for the error code 32 defined in Cosmos SDK, and then perform the retry accordingly directly in the runtime.


For contributor use:

  • Added a changelog entry, using unclog.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

@soareschen soareschen mentioned this pull request Sep 14, 2021
10 tasks
@romac romac mentioned this pull request Sep 14, 2021
11 tasks
relayer/src/chain/cosmos.rs Outdated Show resolved Hide resolved
relayer/src/chain/cosmos.rs Outdated Show resolved Hide resolved
@soareschen
Copy link
Contributor Author

were you able to reproduce the error, and could you describe the scenario? Is it as simple as using two instances of hermes, a start and an ft-transfer concurrently, or did you stumble upon more advanced scenarios that we should be aware while testing?

Yes, you can test it out with a two raw tx transfers using the same wallet as the relayer. In the log you should see the relayer refetch the account sequence when the first attempt failed.

Sep 15 11:50:46.655 DEBUG send_messages_and_wait_check_tx with 3 messages
Sep 15 11:50:46.655 DEBUG [ibc-0] send_tx: sending 3 messages using nonce 12
Sep 15 11:50:46.670 DEBUG [ibc-0] send_tx: broadcast_tx_sync: Response { code: Err(32), data: Data([]), log: Log("account sequence mismatch, expected 13, got 12: incorrect account sequence"), hash: transaction::Hash(8AD74142F7A0EF604FBFE91291F212DE4353CC88B21BE2130210556CA4A58FDA) }
Sep 15 11:50:46.670  WARN send_tx failed with incorrect account sequence. retrying with account sequence refetched from the chain.
Sep 15 11:50:46.674 DEBUG [ibc-0] send_tx: retrieved account sequence=13 number=0
Sep 15 11:50:46.674 DEBUG [ibc-0] send_tx: sending 3 messages using nonce 13
Sep 15 11:50:46.690 DEBUG [ibc-0] send_tx: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(AD0F9E06FDCED226E19FD78A60F4AD2BD3AD59C9FB95128D7D6801B275578EF7) }

relayer/src/chain/cosmos.rs Outdated Show resolved Hide resolved
relayer/src/chain/cosmos.rs Outdated Show resolved Hide resolved
relayer/src/chain/cosmos.rs Outdated Show resolved Hide resolved
@adizere
Copy link
Member

adizere commented Sep 20, 2021

FYI --
there is a slightly more complex scenario that occurs in production, which is very painful for the ops team to deal with. I detailed a way to reproduce the problem easily. I think the fix in this PR should be informed by the problem documented there, and at the very least we should adjust Hermes logs to describe the issue more accurately. Concretely, it would be ideal if we follow one or more of these ideas:

@adizere
Copy link
Member

adizere commented Sep 29, 2021

I tried to summarize the status of the issue in here #1397.

The solution in the present PR is incomplete, and we'll deprioritize this until we investigate further on the tm-go side. In the meantim, let's close this PR. We'll reuse some of the code later if it proves useful.

Let's keep the branch around!

@adizere adizere changed the title Retry send_tx on account sequence mismatch Retry send_tx & estimate_gas on account sequence mismatch Dec 15, 2021
@karzak
Copy link

karzak commented Dec 17, 2021

@adizere - thanks for your work on this!

We're running this branch on a public testnet and it has been avoiding the account sequence deadlock that we were seeing on v0.9.0 so far.

On v0.9.0 whenever we got into this state:

2021-12-17T18:55:38.202492Z ERROR ThreadId(20) [loan-wars] estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed 2021-12-17T18:55:38.202492Z ERROR ThreadId(20) [loan-wars] estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 10190, got 10191: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"date": "Fri, 17 Dec 2021 18:55:38 GMT", "content-type": "application/grpc", "server": "nginx/1.18.0 (Ubuntu)", "x-cosmos-block-height": "31204"} }
2021-12-17T18:55:38.202539Z  WARN ThreadId(20) send_tx failed at estimate_gas step mismatching account sequence: dropping the tx & refreshing account sequence number
2021-12-17T18:55:38.215376Z ERROR ThreadId(26) [loan-wars-ibc:transfer/channel-0 -> loan-wars] worker: schedule execution encountered error: failed with underlying error: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 10190, got 10191: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"date": "Fri, 17 Dec 2021 18:55:38 GMT", "content-type": "application/grpc", "server": "nginx/1.18.0 (Ubuntu)", "x-cosmos-block-height": "31204"} } status: status: Unknown, message: "account sequence mismatch, expected 10190, got 10191: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"date": "Fri, 17 Dec 2021 18:55:38 GMT", "content-type": "application/grpc", "server": "nginx/1.18.0 (Ubuntu)", "x-cosmos-block-height": "31204"} }

It was un-resolvable without restarting the hermes process. Running this branch, I've seen what looks like graceful recovery from this state. Will keep monitoring and update if that changes.

@adizere adizere changed the title Retry send_tx & estimate_gas on account sequence mismatch Retry send_tx on account sequence mismatch Dec 20, 2021
Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got positive feedback from 2 relayer operators, still waiting on 1 of them. Specifically, the feedback is that this PR seems to alleviate the problem of needing to restart Hermes when acct. seq. mismatch occurs. I will merge this within the day, unless a relayer operator highlights any issue.

@adizere adizere merged commit c06dc34 into master Dec 21, 2021
@adizere adizere deleted the soares/retry-account-sequence branch December 21, 2021 12:35
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Retry send_tx on account sequence mismatch

* Move MAX_ACCOUNT_SEQUENCE_RETRY to retry_strategy module

* Introduce INCORRECT_ACCOUNT_SEQUENCE_ERR constant for error code 32

* Inline constant matching

* Retry on simulation failure

* Refactored retry to catch both estimate and bcast errors

* Fixed off-by-one error in re-fetching logic

* Basic backoff mechanism

* More idiomatic account seq. incremental

* Adapted to catch acct.seq more generally

* retry counter logging

* Bumped the multipler from 100 to 200

* Non-retrying fix upon gas estimation failure

* Adjusted retry params; added ref. to informalsystems#1153.

* changelog

Co-authored-by: Adi Seredinschi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sequence number caching in Hermes is unreliable
4 participants