Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/kubeletstats] Add pod and container state metrics #29157

Closed
wants to merge 3 commits into from

Conversation

sirianni
Copy link
Contributor

@sirianni sirianni commented Nov 13, 2023

Description

Add the following state metrics to kubeletstats receiver:

  • k8s.pod.state
  • k8s.container.state
  • k8s.container.last_termination_state

Note

There is potential overlap here with the following metrics from k8sclusterreceiver

The above metrics don't work well for our use case for two main reasons:

  1. These metrics encode the enumeration in the metric value, instead of as an attribute. See [k8sclusterreceiver] refactoring pod status phase #24425 for more details.
  2. There are scalability issues in collecting this from the k8sclusterreciver singleton. Collecting these metrics directly from the kubelet via a daemonset OTel Collector deployment pattern scales nicely with the size of the k8s cluster.

Testing

A new unit test was added using the pods.json test fixture.

This was also tested manually using the debug exporter.

k8s.container.state metric

ResourceMetrics #11
Resource SchemaURL: 
Resource attributes:
     -> k8s.pod.uid: Str(317c571a-affc-4039-8fba-7fa6992b84ed)
     -> k8s.pod.name: Str(nginx-75bf547d49-27jhb)
     -> k8s.namespace.name: Str(default)
     -> k8s.container.name: Str(nginx)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/kubeletstatsreceiver development
Metric #0
Descriptor:
     -> Name: k8s.container.state
     -> Description: Current state of the container
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> state: Str(waiting)
     -> reason: Str(ImagePullBackOff)
StartTimestamp: 2023-10-25 15:09:54.571128664 +0000 UTC
Timestamp: 2023-10-25 15:09:55.612832344 +0000 UTC
Value: 1

k8s.pod.state metric

ResourceMetrics #20
Resource SchemaURL: 
Resource attributes:
     -> k8s.pod.uid: Str(e4becd6e-cf43-4b24-9724-3e7faf757124)
     -> k8s.pod.name: Str(sirianni-test)
     -> k8s.namespace.name: Str(default)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/kubeletstatsreceiver development
Metric #0
Descriptor:
     -> Name: k8s.pod.state
     -> Description: Current state of the pod
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> phase: Str(Running)
     -> reason: Str()
StartTimestamp: 2023-10-25 15:35:28.270921058 +0000 UTC
Timestamp: 2023-10-25 15:35:29.318908321 +0000 UTC
Value: 1

My company is also running this code in production. Here are some screenshots showing how we use it in the Datadog UI
image
image

Documentation

New metrics are added automatically to documentation.md

* `k8s.pod.state`
* `k8s.container.state`
* `k8s.container.last_termination_state`
@TylerHelmuth
Copy link
Member

@sirianni are you available to bring this issue/pr up at a SIG meeting?

@sirianni
Copy link
Contributor Author

sirianni commented Nov 13, 2023

@sirianni are you available to bring this issue/pr up at a SIG meeting?

When is the meeting?

Found it

Every Wednesday at 09:00 PT plus monthly on first Wednesday at 00:00 PT and on third Wednesday at 16:00 PT

Yes, I can join this Wednesday 11/15.

@sirianni
Copy link
Contributor Author

Here is the Datadog Agent PR I referenced in today's SIG call that mentions the scalability issues using a cluster-wide collector (i.e. kube-state-metrics) for this data.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 30, 2023
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Dec 15, 2023
@cmergenthaler
Copy link
Contributor

Why has this been closed? If there are scaling issues with the current solution, wouldn't it make sense to include it in the kubelet receiver?

@povilasv
Copy link
Contributor

FYI this looks more like resource attributes not metrics and we recently added last terminated state to cluster receiver.

See #31282 for the discussion

@sirianni
Copy link
Contributor Author

this looks more like resource attributes not metrics

I'm not following.

Resources are not first-class signals in OTel. They are only relevant when attached to a metric, log, or trace. Without a metric, how do you envision the state change being actually encoded and transmitted? Would it piggyback on an existing metric? How would it be queried? Resources are nothing but metric tags in most vendors data models.

This seems to be leaking vendor-specific semantics into the OTel model. Splunk seems to be pushing some odd semantics with k8sclusterreceiver that are incompatible with how other systems model Kubernetes metrics (kube-state-metrics/prometheus and Datadog being two prominent examples).

@povilasv
Copy link
Contributor

IMO Resource model is well documented in otel, its not really a signal but the way otel represents entities.

A resource represents the entity producing telemetry as resource attributes. For example, a process producing telemetry that is running in a container on Kubernetes has a process name, a pod name, a namespace, and possibly a deployment name. All four of these attributes can be included in the resource.

Also Prometheus is working on supporting resource model https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry/#native-support-for-resource-attributes

@sirianni
Copy link
Contributor Author

IMO Resource model is well documented in otel, its not really a signal but the way otel represents entities.

Yes, I am aware of that. But can you answer my specific questions?

OTel does not specify a mechanism to track, encode, and export "state changes" to resource attributes as first-class things.

@povilasv
Copy link
Contributor

I dont follow. So the resource attributes would change, we transmit them with logs, traces, metrics.

@sirianni
Copy link
Contributor Author

sirianni commented Jun 7, 2024

we transmit them with logs, traces, metrics.

But what metric would you transmit it with? All of them? And how would this be queried? In most backends you can't query "resource attributes" as first class things. Resource attributes are simply extra tags on metrics.

This is what I meant by "piggybacking of an existing metric".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants