Report on slow/stalled channel traffic #2175
Labels
E: osmosis
External: related to Osmosis
I: CLI
Internal: related to the relayer's CLI
I: logic
Internal: related to the relaying logic
I: telemetry
Internal: related to Telemetry & metrics
O: usability
Objective: cause to improve the user experience (UX) and ease using the product
Milestone
Summary
User story: I need to better understand when a channel is being relayed properly. For monitoring and alerting purposes, I need to be able to query the oldest sequence number that is still in the queue for a specific channel and find out how old (date) the packet is.
Problem Definition
We need to better monitor if a channel is being relayed properly or not. Out-of-band monitoring has the benefit of not relying on the technology actually doing the relaying but it has the disadvantage that it has to describe the infrastructure and application setup yet again from scratch. For example the hermes config details the relationship among networks so well, that if I say "channel-0" on the Osmosis network, everyone (including a program) understands exactly what that means (which endpoint represents it, what wallet can I use to manage it, etc).
Implementing this (and subsequent monitoring related) feature in Hermes takes advantage of the already existing configuration and library knowledge of endpoints. (Writing
curl
scripts to poll endpoint health is not fun. Especially, on gRPC.)Including this and similar requests makes Hermes "ready with batteries" for production use, including monitoring assets. (Well, prometheus endpoints or HTTP API calls or somesuch. The operator still need to gather the data somewhere and present it, say, using Grafana.)
Disadvantage of this kind of feature is that it opens up topic that is not strictly IBC as a protocol but more on the side of "IBC as a product used in servers". Personally, I think it shows the maturity of a project, but others might have differing opinions. This request is fairly specific which might be good (when everyone needs it) or not so good (when it only serves one specific use-case of an operator).
Proposal
There is a monitoring bot on Discord that essentially does something similar. The goal is to find out if a channel has "stuck" packets: we define "stuck" packets as packets that haven't been relayed for 5 minutes.
One implementation idea:
One or more prometheus metric(s) per-channel configured in Hermes, that displays the oldest sequence number on the channel still in the queue as well as the submission date associated with the sequence number. (Extra query to the channel.)
This could be picked up by any monitoring tool and alert on it every 5 minutes (or whatever the operator configures).
Alternatively, if this doesn't fit the prometeus metrics specs, it could be a HTTP web API call that responds with the data in a JSON object. Somehow, I feel prometheus should fit here, but we're open to other implementations (even a CLI command, if necessary). The one implementation that doesn't work for us is plugging this data into the log file. The data has to be independently queryable, mostly separated from the current operational state of Hermes. (As mentioned in the disadvantages.)
Acceptance Criteria
For Admin Use
The text was updated successfully, but these errors were encountered: