Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes refresh_schedule bug #1143

Closed
5 of 6 tasks
adizere opened this issue Jul 1, 2021 · 0 comments · Fixed by #1144
Closed
5 of 6 tasks

Hermes refresh_schedule bug #1143

adizere opened this issue Jul 1, 2021 · 0 comments · Fixed by #1144
Assignees
Labels
A: bug Admin: something isn't working
Milestone

Comments

@adizere
Copy link
Member

adizere commented Jul 1, 2021

Crate

ibc-relayer

Summary of Bug

The refresh_schedule method in the relayer library has the job to trim out events which are scheduled for sending but which are outdated (because they were handled in the meantime). It does so by removing some of the TransitMessages that are registered in self.dst_operational_data.

The way it is currently implemented, this method can exit in two ways:

  • an early return in case no new timeouts message have been detected (that's line 1360 below).
  • If timeouts have been detected, then the method effectively removes items from self.dst_operational_data (line 1381).

The problem is that the modification for line 1381 should always be applied, regardless of the early exit or not.

https://github.com/informalsystems/ibc-rs/blob/c78b793d99571df4781cec4c2cfcb18ed68098d1/relayer/src/link.rs#L1360-L1381

Version

c509f88

This was observed in production while doing a log walkthrough with @mircea-c and @greg-szabo.

The original logs capturing how this problem manifested are as follows:

Jun 29 13:50:24 hermes hermes[2534]: Jun 29 13:50:24.298  INFO [cosmoshub-4:transfer/channel-141 -> osmosis-1] scheduling op. data with 2 msg(s) for Destination chain (height 4-6757828)
Jun 29 13:50:24 hermes hermes[2534]: Jun 29 13:50:24.318 DEBUG [cosmoshub-4:transfer/channel-141 -> osmosis-1] WriteAcknowledgement - h:4-6757827, seq:10169, path:channel-0/transfer->channel-141/transfer, toh:4-6757911, tos:Timestamp(NoTimestamp)) already handled
Jun 29 13:50:24 hermes hermes[2534]: Jun 29 13:50:24.365  INFO [cosmoshub-4:transfer/channel-141 -> osmosis-1] relay op. data to Destination, proofs height 4-6757827, (delayed by: 67.138219ms) [try 1/5]

These logs are imperfect, but roughly speaking on the second line Hermes detects that the packet with sn 10169 was already handled it should avoid sending it (by removing that packet from its operational data). On the last line in the log, Hermes proceed to send the same packet (so the packet was never removed from op. data).

Acceptance Criteria

  • in case of early exit, refresh_schedule should account for updating self.dst_operational_data.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere adizere added the A: bug Admin: something isn't working label Jul 1, 2021
@adizere adizere self-assigned this Jul 1, 2021
@adizere adizere added this to the 06.2021 milestone Jul 1, 2021
adizere added a commit that referenced this issue Jul 1, 2021
adizere added a commit that referenced this issue Jul 7, 2021
* Adjusted logs for better msg correlation.

* Fixes #1143.

* Selective backoff in path worker.

* Fmt & clippy

* Adjusted the backoff mechanism to depend on current chan length

* Apply @romac's suggestion

* Changelog entry
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this issue Sep 13, 2022
* Adjusted logs for better msg correlation.

* Fixes informalsystems#1143.

* Selective backoff in path worker.

* Fmt & clippy

* Adjusted the backoff mechanism to depend on current chan length

* Apply @romac's suggestion

* Changelog entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant