-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic health state in OpAMP spec #62
Comments
@pmm-sumo I am curious what you think about this. You had some ideas about own telemetry. |
I shared my thoughts few months ago and I think the concepts have evolved since quite a bit. There are several ideas floating around, let me try to summarize some of them (I'm sure there's more). Broadly speaking, we can assume there's a supervisor which can report one of the following:
Out of this, I find the last one especially compelling. It perhaps revolves around OpenTelemetry Collector but I think it's generic enough to be considered as part of OpAMP. The way it could work is we could extend the collector component interface with a function that could emit the current health status (either periodically, on each state change or such). Then we could have more fine-grained details which can make diagnostics much more efficient. It could solve following cases:
Many of the above could be solved by a mix of own metrics or logs BTW. The benefit is that as health events there would be certain associated semantics (component ID, severity, timestamp started, timestamp ended) which could then be used to quickly pinpoint collector outages, cause of problems, etc. This can follow immutable records approach and emit change only. The model could be something of following:
This is much more fine-granular than just up/down but then it is also much more actionable |
I think this is definitely worth exploring. One possible approach is to define a set of semantic conventions for component metrics. Then these metrics can be included in "own metrics" and don't require any special facility in OpAMP.
It is probably better to discuss component health/metrics in a separate issue. Going back to the original topic of this issue which is somewhat different. My question is do we want the basic health information in OpAMP protocol itself. |
I believe it will add value, especially for cases where Agent was not able to load the current config or had other issues starting. Supervisor could detect such problems and report back health status, which might used as a trigger to rollback deployment and such. |
An example of a logic that the Server may want to implement which requires basic health knowledge:
Note that logic like this cannot be implemented in the Supervisor since it requires aggregate knowledge of the fleet (90% in the example). |
Discussed this briefly in the workgroup and @andykellr also thought that this not a bad idea. I suggest that we post here the list of possible basic health metrics that we think is useful to have in OpAMP itself and then we can decide which of those we want to add to the spec. |
I think we can start by having Note that when using WebSocket transport the Agent doesn't have to report |
I think I would have a slight preference to use something like Computing Edit: Thinking about it more, clock skew is probably worse and latency would be low enough to be insignificant for reporting purposes. Just thought it might be worth mentioning. |
This is my feeling as well. The latency is going to be limited, typically under a second and likely never more than tens of seconds even if the request handling is very slow. Clock skew can be in theory arbitrarily large (although that also would be very atypical in a properly managed environment). |
This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62
This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62
I created a draft PR with basic health added. I wanted to see what it looks like to use it in the Supervisor. I ended up adding the following health message: message AgentHealth {
// The hash of the content of all other fields (even if the other fields are omitted
// for compression).
bytes hash = 1;
// Set to true if the Agent is up an running.
bool up = 2;
// Time since the Agent is up.
int64 uptime_nanoseconds = 3;
// Human readable error message if the Agent is in erroneous state. Typically set
// when up==false.
string last_error = 4;
} The I did not include the Here is the draft PR: open-telemetry/opamp-go#92 What do you think? |
I think that implementation looks good. I have a few small pieces of feedback for the PR if you think it's ready for review, but in general I like the implementation. |
Thanks for looking @andykellr @pmm-sumo I am also interested in your opinion. :-) |
As mentioned in the sig meeting, I think we should change |
Apologies for the late response, I think it's a nice idea to have. We've been actually discussing internally if OpAMP could be a replacement for "heartbeat" capability we currently have for our collector and the above would be a nice fit there |
Agreed. Another benefit of |
Draft PR updated to used StartTime (that's the name we use in OTLP so I went with it instead of StartedAt): open-telemetry/opamp-go#92 |
This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62
This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62
I updated the draft Go implementation to reflect the discussion above and recent changes to hashes/compression. It now uses the following health message: message AgentHealth {
// Set to true if the Agent is up and running.
bool up = 1;
// Timestamp since the Agent is up, i.e. when the agent was started.
// Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970.
// If "up" is false this field is unused.
fixed64 start_time_unix_nano = 2;
// Human readable error message if the Agent is in erroneous state. Typically set
// when up==false.
string last_error = 3;
} |
Resolves open-telemetry#62 Here is implementation in Go that demonstrates how health works: open-telemetry/opamp-go#92
Resolves #62 Here is implementation in Go that demonstrates how health works: open-telemetry/opamp-go#92
We previously had "health" as a concept in OpAMP. It was removed in favour of reporting it as a metric that was planned to be added to Otel spec. However the metric is still undefined in Otel spec and unclear when it will be.
On the other hand it may be beneficial to have basic health state reported as part of OpAMP spec. This can be useful if the OpAMP Server wants to make decisions based on the health state and does not want to depend on fetching this health state from the independent and potentially unavailable metric store where the Agent sends its metrics to.
For example if the Server perform a mass update of Agents to a newer version it may want to monitor the health of the fleet and make a decision about rolling back the update if a certain percentage of Agents is unhealthy after a period of time after the update is pushed to the Agents. For Server to be able to do so it probably needs some basic health information, such as Healthy (bool), Uptime (when did the Agent start), etc.
The text was updated successfully, but these errors were encountered: