Basic health state in OpAMP spec #62

tigrannajaryan · 2022-02-16T14:36:51Z

We previously had "health" as a concept in OpAMP. It was removed in favour of reporting it as a metric that was planned to be added to Otel spec. However the metric is still undefined in Otel spec and unclear when it will be.

On the other hand it may be beneficial to have basic health state reported as part of OpAMP spec. This can be useful if the OpAMP Server wants to make decisions based on the health state and does not want to depend on fetching this health state from the independent and potentially unavailable metric store where the Agent sends its metrics to.

For example if the Server perform a mass update of Agents to a newer version it may want to monitor the health of the fleet and make a decision about rolling back the update if a certain percentage of Agents is unhealthy after a period of time after the update is pushed to the Agents. For Server to be able to do so it probably needs some basic health information, such as Healthy (bool), Uptime (when did the Agent start), etc.

tigrannajaryan · 2022-02-16T21:00:28Z

@pmm-sumo I am curious what you think about this. You had some ideas about own telemetry.

pmm-sumo · 2022-02-18T16:50:17Z

I shared my thoughts few months ago and I think the concepts have evolved since quite a bit.

There are several ideas floating around, let me try to summarize some of them (I'm sure there's more). Broadly speaking, we can assume there's a supervisor which can report one of the following:

whether agent is healthy or not (+/- the removed HealthStatus idea)
agent up(time)/down metrics (the one that was supposed to be added)
agent's metrics, which can be used to determine the status (own metrics reporting)
agent's logs/events on each significant state change (own logs reporting)
each agent's component status (component status)

Out of this, I find the last one especially compelling. It perhaps revolves around OpenTelemetry Collector but I think it's generic enough to be considered as part of OpAMP.

The way it could work is we could extend the collector component interface with a function that could emit the current health status (either periodically, on each state change or such). Then we could have more fine-grained details which can make diagnostics much more efficient. It could solve following cases:

report everything is operational
report if receiver observes some number of failed requests (e.g. number of dropped spans is above some threshold) or cannot access some resources (e.g. filelog receiver does not have permissions to open a log)
report if processor is unable to operate (e.g. k8sattributes processor cannot connect to the cluster API)
report if collector is having significant cpu or memory usage
report if exporter is observing elevated number of requests

Many of the above could be solved by a mix of own metrics or logs BTW. The benefit is that as health events there would be certain associated semantics (component ID, severity, timestamp started, timestamp ended) which could then be used to quickly pinpoint collector outages, cause of problems, etc.

This can follow immutable records approach and emit change only. The model could be something of following:

Event status: WARNING
Event time: Feb 18th, 17:28:01.000
Component: exporters/otlphttp
Name: Elevated number of errors when attempting to export traces
Attributes: { "endpoint": "http://example.com:4318/v1/traces", "count": 3 }

Event status: FAILURE
Event time: Feb 18th, 17:30:01.000
Component: exporters/otlphttp
Name: High number of errors when attempting to export traces
Attributes: { "endpoint": "http://example.com:4318/v1/traces", "count": 45 }

Event status: OK
Event time: Feb 18th, 17:35:01.000
Component: exporters/otlphttp
Name: No errors observed
Attributes: { "endpoint": "http://example.com:4318/v1/traces" }

This is much more fine-granular than just up/down but then it is also much more actionable

tigrannajaryan · 2022-02-22T13:46:14Z

each agent's component status (component status)

I think this is definitely worth exploring. One possible approach is to define a set of semantic conventions for component metrics. Then these metrics can be included in "own metrics" and don't require any special facility in OpAMP.

The benefit is that as health events there would be certain associated semantics (component ID, severity, timestamp started, timestamp ended)

component ID, severity, can be attributes defined by semantic conventions. Timestamps are regular part of metrics.

It is probably better to discuss component health/metrics in a separate issue.

Going back to the original topic of this issue which is somewhat different. My question is do we want the basic health information in OpAMP protocol itself.

pmm-sumo · 2022-02-22T14:01:21Z

Going back to the original topic of this issue which is somewhat different. My question is do we want the basic health information in OpAMP protocol itself.

I believe it will add value, especially for cases where Agent was not able to load the current config or had other issues starting. Supervisor could detect such problems and report back health status, which might used as a trigger to rollback deployment and such.

tigrannajaryan · 2022-02-22T18:55:16Z

An example of a logic that the Server may want to implement which requires basic health knowledge:

When doing rolling updates of the Agent fleet if 90% of Agents don't become healthy 5 mins after the new package version is pushed and installed then assume the package is broken and suspend the rollout of the new version (and also alert the user or do rollback to previous version, etc).

Note that logic like this cannot be implemented in the Supervisor since it requires aggregate knowledge of the fleet (90% in the example).

tigrannajaryan · 2022-02-22T19:17:17Z

Discussed this briefly in the workgroup and @andykellr also thought that this not a bad idea.

I suggest that we post here the list of possible basic health metrics that we think is useful to have in OpAMP itself and then we can decide which of those we want to add to the spec.

tigrannajaryan · 2022-02-23T14:44:35Z

I think we can start by having up (boolean), uptime (seconds) and healthy (boolean) flags.

Note that when using WebSocket transport the Agent doesn't have to report uptime periodically. Reporting it once should be sufficient. The Server can just infer the up-to-date value of uptime from the last reported value plus time passed since then and assuming there was no report of up=false. The Agent is responsible to report up=false as necessary and disconnecting the WebSocket can be treated as up=false after certain timeout (unless reconnected).

andykellr · 2022-02-23T15:01:19Z

I think I would have a slight preference to use something like started_at as a timestamp and if provided uptime, I would probably compute and store a time based on now-uptime so that uptime can be accurately reflected in the UI as now-started_at.

Computing now-uptime suffers from latency (uptime reported by agent, now computed by server) and started_at suffers from clock skew between agent and server system. I'm not sure which is worse.

Edit: Thinking about it more, clock skew is probably worse and latency would be low enough to be insignificant for reporting purposes. Just thought it might be worth mentioning.

tigrannajaryan · 2022-02-23T17:49:55Z

Thinking about it more, clock skew is probably worse and latency would be low enough to be insignificant for reporting purposes. Just thought it might be worth mentioning.

This is my feeling as well. The latency is going to be limited, typically under a second and likely never more than tens of seconds even if the request handling is very slow. Clock skew can be in theory arbitrarily large (although that also would be very atypical in a properly managed environment).

This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62

tigrannajaryan · 2022-06-22T18:58:14Z

I created a draft PR with basic health added. I wanted to see what it looks like to use it in the Supervisor. I ended up adding the following health message:

message AgentHealth {
    // The hash of the content of all other fields (even if the other fields are omitted
    // for compression).
    bytes hash = 1;

    // Set to true if the Agent is up an running.
    bool up = 2;

    // Time since the Agent is up.
    int64 uptime_nanoseconds = 3;

    // Human readable error message if the Agent is in erroneous state. Typically set
    // when up==false.
    string last_error = 4;
}

The last_error is new, we did not discuss it in this thread. However, I found that almost everywhere were I want to report "up=false" I also have an appropriate error message to include. I think it is useful to include it for basic troubleshooting purposes, when the full logs are not available (like they are not available in our simple example).

I did not include the healthy bool flag mentioned above since I did not have a good way to populate in our Supervisor. This can be added later if we are sure it is necessary.

Here is the draft PR: open-telemetry/opamp-go#92

What do you think?

andykellr · 2022-06-22T19:04:56Z

I think that implementation looks good. I have a few small pieces of feedback for the PR if you think it's ready for review, but in general I like the implementation.

tigrannajaryan · 2022-06-23T19:37:03Z

Thanks for looking @andykellr

@pmm-sumo I am also interested in your opinion. :-)

andykellr · 2022-06-28T18:09:17Z

As mentioned in the sig meeting, I think we should change uptime_nanoseconds to a started_at value to allow for compression using the hash.

pmm-sumo · 2022-06-28T19:00:59Z

Apologies for the late response, I think it's a nice idea to have. We've been actually discussing internally if OpAMP could be a replacement for "heartbeat" capability we currently have for our collector and the above would be a nice fit there

tigrannajaryan · 2022-06-28T19:30:16Z

As mentioned in the sig meeting, I think we should change uptime_nanoseconds to a started_at value to allow for compression using the hash.

Agreed. Another benefit of started_at is that it makes easy to detect restarts.

tigrannajaryan · 2022-06-28T19:44:04Z

Draft PR updated to used StartTime (that's the name we use in OTLP so I went with it instead of StartedAt): open-telemetry/opamp-go#92

This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62

tigrannajaryan · 2022-07-04T16:00:59Z

I updated the draft Go implementation to reflect the discussion above and recent changes to hashes/compression. It now uses the following health message:

message AgentHealth {
    // Set to true if the Agent is up and running.
    bool up = 1;

    // Timestamp since the Agent is up, i.e. when the agent was started.
    // Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970.
    // If "up" is false this field is unused.
    fixed64 start_time_unix_nano = 2;

    // Human readable error message if the Agent is in erroneous state. Typically set
    // when up==false.
    string last_error = 3;
}

Resolves open-telemetry#62 Here is implementation in Go that demonstrates how health works: open-telemetry/opamp-go#92

Resolves #62 Here is implementation in Go that demonstrates how health works: open-telemetry/opamp-go#92

tigrannajaryan added the required-for-stable Required to be resolved before 1.0 label May 31, 2022

tigrannajaryan added a commit to tigrannajaryan/opamp-go that referenced this issue Jun 22, 2022

Add agent health

73e319b

This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62

tigrannajaryan added a commit to tigrannajaryan/opamp-go that referenced this issue Jun 22, 2022

Add agent health

01ddb8d

This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62

tigrannajaryan mentioned this issue Jun 22, 2022

Add basic Agent Health reporting open-telemetry/opamp-go#92

Merged

tigrannajaryan self-assigned this Jun 22, 2022

tigrannajaryan added a commit to tigrannajaryan/opamp-go that referenced this issue Jun 29, 2022

Add agent health

7ee15f9

This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62

tigrannajaryan added a commit to tigrannajaryan/opamp-go that referenced this issue Jul 4, 2022

Add agent health

6d7edbb

This is a draft implementation to help with the spec discussion: open-telemetry/opamp-spec#62

tigrannajaryan added a commit to tigrannajaryan/opamp-spec that referenced this issue Jul 4, 2022

Add basic Agent Health reporting

6a4c56c

Resolves open-telemetry#62 Here is implementation in Go that demonstrates how health works: open-telemetry/opamp-go#92

tigrannajaryan mentioned this issue Jul 4, 2022

Add basic Agent Health reporting #103

Merged

tigrannajaryan closed this as completed in #103 Jul 5, 2022

tigrannajaryan added a commit that referenced this issue Jul 5, 2022

Add basic Agent Health reporting (#103)

bd52334

Resolves #62 Here is implementation in Go that demonstrates how health works: open-telemetry/opamp-go#92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic health state in OpAMP spec #62

Basic health state in OpAMP spec #62

tigrannajaryan commented Feb 16, 2022

tigrannajaryan commented Feb 16, 2022

pmm-sumo commented Feb 18, 2022 •

edited

Loading

tigrannajaryan commented Feb 22, 2022

pmm-sumo commented Feb 22, 2022

tigrannajaryan commented Feb 22, 2022

tigrannajaryan commented Feb 22, 2022

tigrannajaryan commented Feb 23, 2022

andykellr commented Feb 23, 2022 •

edited

Loading

tigrannajaryan commented Feb 23, 2022

tigrannajaryan commented Jun 22, 2022

andykellr commented Jun 22, 2022

tigrannajaryan commented Jun 23, 2022

andykellr commented Jun 28, 2022

pmm-sumo commented Jun 28, 2022

tigrannajaryan commented Jun 28, 2022

tigrannajaryan commented Jun 28, 2022

tigrannajaryan commented Jul 4, 2022

Basic health state in OpAMP spec #62

Basic health state in OpAMP spec #62

Comments

tigrannajaryan commented Feb 16, 2022

tigrannajaryan commented Feb 16, 2022

pmm-sumo commented Feb 18, 2022 • edited Loading

tigrannajaryan commented Feb 22, 2022

pmm-sumo commented Feb 22, 2022

tigrannajaryan commented Feb 22, 2022

tigrannajaryan commented Feb 22, 2022

tigrannajaryan commented Feb 23, 2022

andykellr commented Feb 23, 2022 • edited Loading

tigrannajaryan commented Feb 23, 2022

tigrannajaryan commented Jun 22, 2022

andykellr commented Jun 22, 2022

tigrannajaryan commented Jun 23, 2022

andykellr commented Jun 28, 2022

pmm-sumo commented Jun 28, 2022

tigrannajaryan commented Jun 28, 2022

tigrannajaryan commented Jun 28, 2022

tigrannajaryan commented Jul 4, 2022

pmm-sumo commented Feb 18, 2022 •

edited

Loading

andykellr commented Feb 23, 2022 •

edited

Loading