-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kubernetes-cluster receiver #175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working through it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a shallow pass and left some comments.
Since this is pretty big chunk of code ported from SmartAgent please either point to the new code or break this down into smaller Prs.
#### node_conditions_to_report | ||
|
||
The node conditions this receiver should report. See [here](https://kubernetes.io/docs/concepts/architecture/nodes/#condition) | ||
for list of node conditions. If this option is set to `[Ready]`, a metrics called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to explain the format for this option. Is it an array? Can we list multiple values in the brackets? How are other condition names mapped to metric names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example depicting metrics created for a particular config value.
@@ -0,0 +1,215 @@ | |||
# Kubernetes Cluster Receiver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice doc!
) | ||
|
||
// TODO: Consider moving some of these constants to | ||
// https://github.com/open-telemetry/opentelemetry-collector/blob/master/translator/conventions/opentelemetry.go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And we probably also need to make a proposal to add some of these constants to OpenTelemetry semantic conventions.
} | ||
|
||
func (kr *kubernetesReceiver) Shutdown(context.Context) error { | ||
kr.cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Nice that cancellation is taken care of.
|
||
func (kr *kubernetesReceiver) Start(ctx context.Context, host component.Host) error { | ||
var c context.Context | ||
c, kr.cancel = context.WithCancel(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working on it, couple more things I noticed.
I've moved all refactored code from SignalFx Agent to a package called |
|
||
func (kr *kubernetesReceiver) dispatchMetricData(ctx context.Context) { | ||
for _, m := range kr.resourceWatcher.dataCollector.CollectMetricData() { | ||
kr.consumer.ConsumeMetricsData(ctx, m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can return an error. It needs to be handled using obsreport
. See for example prometheus receiver or opencensus receiver.
If we don't do this it will be silently ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to handle errors returned by ConsumeMetricsData. I noticed the other receivers look like the handling of these errors are done per ConsumeMetricsData
call but in those cases, it also seems like the metrics are from the same source? For this receiver metrics originate from various sources, so I've clubbed errors from all ConsumeMetricsData
calls in an interval. Let me know what you think.
Also, these internal metrics about number of timeseries and number of datapoints, are these send to the exporter too?
|
||
// Resource labels for container. | ||
containerKeyID = "container.id" | ||
containerKeySpecName = "container.spec.name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be just "container.name"
. See https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/resource/semantic_conventions#container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the name of the container provided in a Pod spec. I don't think this will reflect the name of the the container that's actually running. If I understand correctly container.name
should be the name of the actual running container, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand. Can you provide an example of the config + the name that will be create, something from real kubernetes.
k8sKeyResourceQuotaName = "k8s.resourcequota.name" | ||
|
||
// Resource labels for container. | ||
containerKeyID = "container.id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "container.id"
a k8s concept or is this generic enough? Currently this is not defined in the semantic conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is generic as far as I know. I know Docker's got it and also CRI-O. I think we could include it in the convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any documentation?
containerKeyStatus = "container.status" | ||
containerKeyStatusReason = "container.status.reason" | ||
|
||
// Values for container properties | ||
containerStatusRunning = "running" | ||
containerStatusWaiting = "waiting" | ||
containerStatusTerminated = "terminated" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a metric or attributes in the resource/properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container status would be part of the properties associated with a container.id
. That way one could group containers by status while doing some analytics on container metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? What is the rational to have that way vs a proper metric or event?
|
||
const ( | ||
// Keys for K8s properties | ||
k8sKeyWorkLoad = "k8s.workload" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be "k8s.workload.kind"? Also not sure where is this used, can you point me to where do we use this?
Based on the usage in this file this seems to be unnecessary, you do have the kind/type in the resourceIDKey
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, k8s.worload.kind
is better.
Will ping you offline with some details about these workload properties.
const ( | ||
// Keys for K8s properties | ||
k8sKeyWorkLoad = "k8s.workload" | ||
k8sKeyWorkLoadName = "k8s.workload.name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the k8s.{type}.name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is nice to have this key be resource type agnostic.
"container.id": "container-id", | ||
"container.spec.name": "container-name", | ||
"container.image.name": "container-image-name", | ||
"k8s.pod.uid": "test-pod-1-uid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very unclear to me when do we expect to see uid vs name in the resource. Why do we not have a k8s.node.uid
in this resource? May worth adding an issue to clarify when to use one or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node information for a Pod is obtained from the Pod spec. In this case we don't add k8s.node.uid
because it is not available in the Pod spec whereas the node name is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can I find what is available or not in the Pod Specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's file issues for the 3 remaining items and move forward with this PR. I don't want to block this work but it is good that we identify these inconsistencies. |
Also we should revisit the metric names to be consistent with opentelemetry conventions which are not yet defined |
@bogdandrutu I am merging this since you appear to be OK (issues were created). |
### Description Add `k8s_cluster` receiver. This receiver monitors resources in a cluster and collects metrics and metadata to correlate between resources. This receiver watches for changes using the K8s API. ##### Key Data Structures - `kubernetesReceiver`: Sends metrics along the pipeline - `resourceWatcher` : Handles events from the K8s API, collecting metrics and metadata from the cluster. This struct uses a `informers.SharedInformerFactory` to watch for events. - `dataCollector` : Handles collection of data from Kubernetes resources. The collector has a `metricsStore` to keep track of latest metrics representing the cluster state and a `metadataStore` (a wrapper around `cache.Store`) to track latest metadata from the cluster. ##### Workflow - **resourceWatcher setup** - setup SharedInformerFactory, add event handler to informers. The following methods:`onAdd`, `onUpdate` and `onDelete` on the `resourceWatcher` handle resource creations, deletions and updates. - **Event Handling** - _Add/Update_: On receiving an event corresponding to a resource creation or update, the latest metrics are collected by `syncMetrics` method on `dataCollector`. The collected metrics are cached in `metricsStore`. Methods responsible for collecting data from each supported Kubernetes resource type are prefixed with `getMetricsFor`. For example, `getMetricsForPod` collects metrics from Pod. - _Delete_: On deletion of resources, the cached entry will be removed from `metricsStore`. - Note that only metric collection is turned on right now. The metadata collection code is currently inactive (It is controlled by the `collectMedata` field). - **Metric Syncing**: Metrics from the `metricStore` are send along the pipeline once every `collection_interval` seconds. - **Metadata Syncing**: TODO (The metadata collection code is inactive) ### Testing - Unit tests for each resource type - Integration test for receiver - Manual testing with SignalFx exporter ### Documentation https://github.com/signalfx/opentelemetry-collector-contrib/blob/k8s-cluster/receiver/kubernetesclusterreceiver/README.md
…y-picked commits (open-telemetry#168)" (open-telemetry#175) This reverts commit c9246c3.
Description
Add
k8s_cluster
receiver. This receiver monitors resources in a cluster andcollects metrics and metadata to correlate between resources. This receiver
watches for changes using the K8s API.
Key Data Structures
kubernetesReceiver
: Sends metrics along the pipelineresourceWatcher
: Handles events from the K8s API, collecting metrics andmetadata from the cluster. This struct uses a
informers.SharedInformerFactory
to watch for events.
dataCollector
: Handles collection of data from Kubernetes resources. Thecollector has a
metricsStore
to keep track of latest metrics representing thecluster state and a
metadataStore
(a wrapper aroundcache.Store
) to tracklatest metadata from the cluster.
Workflow
methods:
onAdd
,onUpdate
andonDelete
on theresourceWatcher
handleresource creations, deletions and updates.
update, the latest metrics are collected by
syncMetrics
method ondataCollector
.The collected metrics are cached in
metricsStore
. Methods responsible forcollecting data from each supported Kubernetes resource type are prefixed with
getMetricsFor
. For example,getMetricsForPod
collects metrics from Pod.metricsStore
.is currently inactive (It is controlled by the
collectMedata
field).metricStore
are send along the pipeline onceevery
collection_interval
seconds.Testing
Documentation
https://github.com/signalfx/opentelemetry-collector-contrib/blob/k8s-cluster/receiver/kubernetesclusterreceiver/README.md