Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kubernetes-cluster receiver #175

Merged
merged 20 commits into from
Apr 27, 2020
Merged

Add kubernetes-cluster receiver #175

merged 20 commits into from
Apr 27, 2020

Conversation

asuresh4
Copy link
Member

Description

Add k8s_cluster receiver. This receiver monitors resources in a cluster and
collects metrics and metadata to correlate between resources. This receiver
watches for changes using the K8s API.

Key Data Structures
  • kubernetesReceiver: Sends metrics along the pipeline
  • resourceWatcher : Handles events from the K8s API, collecting metrics and
    metadata from the cluster. This struct uses a informers.SharedInformerFactory
    to watch for events.
  • dataCollector : Handles collection of data from Kubernetes resources. The
    collector has a metricsStore to keep track of latest metrics representing the
    cluster state and a metadataStore (a wrapper around cache.Store) to track
    latest metadata from the cluster.
Workflow
  • resourceWatcher setup
    • setup SharedInformerFactory, add event handler to informers. The following
      methods:onAdd, onUpdate and onDelete on the resourceWatcher handle
      resource creations, deletions and updates.
  • Event Handling
    • Add/Update: On receiving an event corresponding to a resource creation or
      update, the latest metrics are collected by syncMetrics method on dataCollector.
      The collected metrics are cached in metricsStore. Methods responsible for
      collecting data from each supported Kubernetes resource type are prefixed with
      getMetricsFor. For example, getMetricsForPod collects metrics from Pod.
    • Delete: On deletion of resources, the cached entry will be removed from metricsStore.
    • Note that only metric collection is turned on right now. The metadata collection code
      is currently inactive (It is controlled by the collectMedata field).
  • Metric Syncing: Metrics from the metricStore are send along the pipeline once
    every collection_interval seconds.
  • Metadata Syncing: TODO (The metadata collection code is inactive)

Testing

  • Unit tests for each resource type
  • Integration test for receiver
  • Manual testing with SignalFx exporter

Documentation

https://github.com/signalfx/opentelemetry-collector-contrib/blob/k8s-cluster/receiver/kubernetesclusterreceiver/README.md

Copy link
Contributor

@jrcamp jrcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through it.

receiver/kubernetesclusterreceiver/README.md Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/README.md Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/collector.go Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/collector.go Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/collector.go Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/collector.go Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/metricsstore.go Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/watcher.go Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/watcher.go Outdated Show resolved Hide resolved
receiver/kubernetesclusterreceiver/watcher.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jrcamp jrcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on it

receiver/k8sclusterreceiver/containers.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/daemonsets.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/deployments.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/factory.go Outdated Show resolved Hide resolved
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a shallow pass and left some comments.

Since this is pretty big chunk of code ported from SmartAgent please either point to the new code or break this down into smaller Prs.

#### node_conditions_to_report

The node conditions this receiver should report. See [here](https://kubernetes.io/docs/concepts/architecture/nodes/#condition)
for list of node conditions. If this option is set to `[Ready]`, a metrics called
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to explain the format for this option. Is it an array? Can we list multiple values in the brackets? How are other condition names mapped to metric names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example depicting metrics created for a particular config value.

@@ -0,0 +1,215 @@
# Kubernetes Cluster Receiver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice doc!

receiver/k8sclusterreceiver/containers.go Outdated Show resolved Hide resolved
)

// TODO: Consider moving some of these constants to
// https://github.com/open-telemetry/opentelemetry-collector/blob/master/translator/conventions/opentelemetry.go.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And we probably also need to make a proposal to add some of these constants to OpenTelemetry semantic conventions.

receiver/k8sclusterreceiver/collector.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/metricsstore.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/metricsstore.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/pods.go Outdated Show resolved Hide resolved
}

func (kr *kubernetesReceiver) Shutdown(context.Context) error {
kr.cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Nice that cancellation is taken care of.


func (kr *kubernetesReceiver) Start(ctx context.Context, host component.Host) error {
var c context.Context
c, kr.cancel = context.WithCancel(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jrcamp jrcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on it, couple more things I noticed.

receiver/k8sclusterreceiver/collection/metadata_test.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/collection/metadatastore.go Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/go.mod Outdated Show resolved Hide resolved
@asuresh4
Copy link
Member Author

I did a shallow pass and left some comments.

Since this is pretty big chunk of code ported from SmartAgent please either point to the new code or break this down into smaller Prs.

I've moved all refactored code from SignalFx Agent to a package called collection. The new code is in the k8sclusterreceiver package and it uses collection.DataCollector to interact with the refactored collection logic. Let me know if that'll make it easier to review.


func (kr *kubernetesReceiver) dispatchMetricData(ctx context.Context) {
for _, m := range kr.resourceWatcher.dataCollector.CollectMetricData() {
kr.consumer.ConsumeMetricsData(ctx, m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can return an error. It needs to be handled using obsreport. See for example prometheus receiver or opencensus receiver.
If we don't do this it will be silently ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to handle errors returned by ConsumeMetricsData. I noticed the other receivers look like the handling of these errors are done per ConsumeMetricsData call but in those cases, it also seems like the metrics are from the same source? For this receiver metrics originate from various sources, so I've clubbed errors from all ConsumeMetricsData calls in an interval. Let me know what you think.

Also, these internal metrics about number of timeseries and number of datapoints, are these send to the exporter too?


// Resource labels for container.
containerKeyID = "container.id"
containerKeySpecName = "container.spec.name"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the name of the container provided in a Pod spec. I don't think this will reflect the name of the the container that's actually running. If I understand correctly container.name should be the name of the actual running container, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. Can you provide an example of the config + the name that will be create, something from real kubernetes.

k8sKeyResourceQuotaName = "k8s.resourcequota.name"

// Resource labels for container.
containerKeyID = "container.id"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "container.id" a k8s concept or is this generic enough? Currently this is not defined in the semantic conventions.

Copy link
Member Author

@asuresh4 asuresh4 Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generic as far as I know. I know Docker's got it and also CRI-O. I think we could include it in the convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any documentation?

Comment on lines +28 to +34
containerKeyStatus = "container.status"
containerKeyStatusReason = "container.status.reason"

// Values for container properties
containerStatusRunning = "running"
containerStatusWaiting = "waiting"
containerStatusTerminated = "terminated"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a metric or attributes in the resource/properties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container status would be part of the properties associated with a container.id. That way one could group containers by status while doing some analytics on container metrics.

Copy link
Member

@bogdandrutu bogdandrutu Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? What is the rational to have that way vs a proper metric or event?


const (
// Keys for K8s properties
k8sKeyWorkLoad = "k8s.workload"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "k8s.workload.kind"? Also not sure where is this used, can you point me to where do we use this?

Based on the usage in this file this seems to be unnecessary, you do have the kind/type in the resourceIDKey.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, k8s.worload.kind is better.

Will ping you offline with some details about these workload properties.

receiver/k8sclusterreceiver/README.md Outdated Show resolved Hide resolved
receiver/k8sclusterreceiver/collection/metadata.go Outdated Show resolved Hide resolved
const (
// Keys for K8s properties
k8sKeyWorkLoad = "k8s.workload"
k8sKeyWorkLoadName = "k8s.workload.name"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the k8s.{type}.name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is nice to have this key be resource type agnostic.

"container.id": "container-id",
"container.spec.name": "container-name",
"container.image.name": "container-image-name",
"k8s.pod.uid": "test-pod-1-uid",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very unclear to me when do we expect to see uid vs name in the resource. Why do we not have a k8s.node.uid in this resource? May worth adding an issue to clarify when to use one or the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node information for a Pod is obtained from the Pod spec. In this case we don't add k8s.node.uid because it is not available in the Pod spec whereas the node name is available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find what is available or not in the Pod Specs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu
Copy link
Member

Let's file issues for the 3 remaining items and move forward with this PR. I don't want to block this work but it is good that we identify these inconsistencies.

@bogdandrutu
Copy link
Member

Also we should revisit the metric names to be consistent with opentelemetry conventions which are not yet defined

@tigrannajaryan
Copy link
Member

@bogdandrutu I am merging this since you appear to be OK (issues were created).

@tigrannajaryan tigrannajaryan merged commit 9ee65b1 into open-telemetry:master Apr 27, 2020
@asuresh4 asuresh4 deleted the k8s-cluster branch April 29, 2020 20:16
wyTrivail referenced this pull request in mxiamxia/opentelemetry-collector-contrib Jul 13, 2020
### Description

Add `k8s_cluster` receiver. This receiver monitors resources in a cluster and 
collects metrics and metadata to correlate between resources. This receiver 
watches for changes using the K8s API. 

##### Key Data Structures

- `kubernetesReceiver`: Sends metrics along the pipeline 
- `resourceWatcher` : Handles events from the K8s API, collecting metrics and 
metadata from the cluster. This struct uses a `informers.SharedInformerFactory` 
to watch for events.
- `dataCollector` : Handles collection of data from Kubernetes resources. The 
collector has a `metricsStore` to keep track of latest metrics representing the 
cluster state and a `metadataStore` (a wrapper around `cache.Store`) to track 
latest metadata from the cluster.

##### Workflow

- **resourceWatcher setup**
  - setup SharedInformerFactory, add event handler to informers. The following 
  methods:`onAdd`, `onUpdate` and `onDelete` on the `resourceWatcher` handle 
  resource creations, deletions and updates.
- **Event Handling**
  - _Add/Update_: On receiving an event corresponding to a resource creation or 
  update, the latest metrics are collected by `syncMetrics` method on `dataCollector`. 
  The collected metrics are cached in `metricsStore`. Methods responsible for
  collecting data from each supported Kubernetes resource type are prefixed with 
  `getMetricsFor`. For example, `getMetricsForPod` collects metrics from Pod.
  - _Delete_: On deletion of resources, the cached entry will be removed from `metricsStore`.
  - Note that only metric collection is turned on right now. The metadata collection code 
  is currently inactive (It is controlled by the `collectMedata` field).
- **Metric Syncing**: Metrics from the `metricStore` are send along the pipeline once 
every `collection_interval` seconds.
- **Metadata Syncing**: TODO (The metadata collection code is inactive)
  

### Testing

- Unit tests for each resource type
- Integration test for receiver
- Manual testing with SignalFx exporter

### Documentation

https://github.com/signalfx/opentelemetry-collector-contrib/blob/k8s-cluster/receiver/kubernetesclusterreceiver/README.md
bogdandrutu pushed a commit that referenced this pull request May 12, 2022
bjrara pushed a commit to bjrara/opentelemetry-collector-contrib that referenced this pull request Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants