Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes operational data processing time #1362

Closed
5 tasks
mircea-c opened this issue Sep 17, 2021 · 5 comments · Fixed by #1379
Closed
5 tasks

Hermes operational data processing time #1362

mircea-c opened this issue Sep 17, 2021 · 5 comments · Fixed by #1379
Labels
I: logic Internal: related to the relaying logic
Milestone

Comments

@mircea-c
Copy link

Crate

Hermes ibc_rs v0.7.1

Summary of Bug

Hermes is only able to process ~3 transactions / minute. This is woefully inadequate for periods of high traffic where > 10 transactions are submitted every minute.

This is more of a design flaw than a bug, but I don't think it qualifies as a feature request.

Version

hermes 0.7.1+5d38857a

Steps to Reproduce

Run hermes start submit a transaction to either of the chains hermes is relaying on. Notice, in this example log output, the time difference between prepending the client update and batch assembly:

Sep 16 22:11:18 hermes hermes-osmosis[2426]: Sep 17 02:11:18.004  INFO [osmosis-1:transfer/channel-0 -> cosmoshub-4] relay op. data of 1 msgs(s) to Destination (height 1-1214262), delayed by: 
9.499863ms [try 1/5]
Sep 16 22:11:18 hermes hermes-osmosis[2426]: Sep 17 02:11:18.004  INFO [osmosis-1:transfer/channel-0 -> cosmoshub-4] prepending Destination client update @ height 1-1214262

Sep 16 22:11:37 hermes hermes-osmosis[2426]: Sep 17 02:11:37.477  INFO [osmosis-1:transfer/channel-0 -> cosmoshub-4] assembled batch of 2 message(s)
Sep 16 22:11:37 hermes hermes-osmosis[2426]: Sep 17 02:11:37.507 DEBUG send_messages_and_wait_check_tx with 2 messages
Sep 16 22:11:37 hermes hermes-osmosis[2426]: Sep 17 02:11:37.508 DEBUG [cosmoshub-4] send_tx: sending 2 messages using nonce 85816
Sep 16 22:11:37 hermes hermes-osmosis[2426]: Sep 17 02:11:37.523 DEBUG [cosmoshub-4] send_tx: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(14473E99A9B3523A069F8CFEE9AC5E71FDB2E5FF668AE7ECB76EA70F5F5C27F4) }
Sep 16 22:11:37 hermes hermes-osmosis[2426]: Sep 17 02:11:37.524  INFO [Async~>cosmoshub-4] response(s): 1; Ok:14473E99A9B3523A069F8CFEE9AC5E71FDB2E5FF668AE7ECB76EA70F5F5C27F4
Sep 16 22:11:37 hermes hermes-osmosis[2426]: 
Sep 16 22:11:37 hermes hermes-osmosis[2426]: Sep 17 02:11:37.524  INFO [osmosis-1:transfer/channel-0 -> cosmoshub-4] success 

Acceptance Criteria

Hermes processes op data significantly faster.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere adizere modified the milestones: 09.2021, 10.2021 Sep 17, 2021
@adizere adizere added the I: logic Internal: related to the relaying logic label Sep 17, 2021
@adizere
Copy link
Member

adizere commented Sep 17, 2021

To add more context here: Hermes spends 19 seconds between these two subsequence actions:

Sep 16 22:11:18 hermes hermes-osmosis[2426]: Sep 17 02:11:18.004  INFO [osmosis-1:transfer/channel-0 -> cosmoshub-4] prepending Destination client update @ height 1-1214262
Sep 16 22:11:37 hermes hermes-osmosis[2426]: Sep 17 02:11:37.477  INFO [osmosis-1:transfer/channel-0 -> cosmoshub-4] assembled batch of 2 message(s)

This time is spent doing roughly four queries, all of them involved in building the message to update the client hosted on hub-4 which is underlying channel transfer/channel-0 that connects osmosis-1 with hub-4.

  1. [from osmosis] query the latest height https://github.com/informalsystems/ibc-rs/blob/63cf1922cb903f27dc130d2d737c1275b58330a0/relayer/src/foreign_client.rs#L565
  2. [from hub-4] query the client's state https://github.com/informalsystems/ibc-rs/blob/63cf1922cb903f27dc130d2d737c1275b58330a0/relayer/src/foreign_client.rs#L579
  3. [from hub-4] query all of client's consensus stateshttps://github.com/informalsystems/ibc-rs/blob/63cf1922cb903f27dc130d2d737c1275b58330a0/relayer/src/foreign_client.rs#L590
  4. [from osmosis] query a block for building the client update header https://github.com/informalsystems/ibc-rs/blob/63cf1922cb903f27dc130d2d737c1275b58330a0/relayer/src/foreign_client.rs#L622

Step 3 is the most time intensive and wasteful (because we pull all consensus states indiscriminately but only use the list of obtained states to figure out the trusted height. We should either make smarted use of pagination or optimistically query for the height we think we need.

@ancazamfir
Copy link
Collaborator

Have you done time measurements for the steps above? And how? (ideally we should run with https://hermes.informal.systems/help.html#profiling instead of checking log timestamps). Is the 19 sec above typical? Or only in some cases?
I would expect that in normal operations (passive relayer reacting on packet events) step 1 to take most of the time: the event is received from src chain at height H and we wait for the src chain to advance at height H+1. So that can be around 8-10 seconds which is the typical block time on the live chains hermes operates on. There is not much we can do here.

re: 3. - we can optimize and do the all-consensus-state queries only if the client state latest_height is higher than the target height, otherwise we can use that as trusted height. This should be the case for passive packet relaying in 99% of cases.
In addition could also ask SDK team for a gRPC query for highest lower height consensus state.
But again, can you confirm this takes a long time? I didn't think that was the case.

@ancazamfir
Copy link
Collaborator

ancazamfir commented Sep 21, 2021

One more thing, we discussed this before. Workers should be able to query the chain directly. There is no reason for the queries to be serialized via the runtime, this is only required for Tx.

@mircea-c
Copy link
Author

This is all based on log timestamps, but it is consistent. Every single operation takes ~20s, regardless of the chain it is on.

@ancazamfir
Copy link
Collaborator

ancazamfir commented Sep 21, 2021

This is all based on log timestamps, but it is consistent. Every single operation takes ~20s, regardless of the chain it is on.

Ok, thanks. then we should be able to do a quick run with profiling enabled to see where exactly we spend the 20sec. We initially try with a simple hermes update client.. CLI and check the logs.
(need to check that we have profiling set in the right places) -> all good

@mircea-c , could you build from relayer-cli on master with:

cd relayer-cli
cargo build --features=profiling

and then hermes update client... for one client on cosmoshub or osmosis? Also maybe with trace level?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: logic Internal: related to the relaying logic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants