Set ContinueOnError for prometheus http handler. #1679

Random-Liu · 2017-06-20T19:02:25Z

PR #1460 upgraded prometheus to v0.8.0.

However, prometheus v0.8.0 enforced some consistency check, which our current metrics could not pass, thus caused #1671 and kubernetes/kubernetes#47744.

This PR configure prometheus http handler to ContinueOnError so that the metrics could be collected/exposed regardless of the error. This is also the solution suggested in prometheus/client_golang#214.

This is only a quick fix for Kubernetes 1.7 release. We should figure out what in our metrics caused the inconsistency and fix it.

@dchen1107 @google/cadvisor @grobie

Random-Liu · 2017-06-20T19:04:50Z

Without this PR:

* collected metric container_tasks_state label:<name:"container_label_annotation_io_kubernetes_container_hash" value:"228e8d0c" > label:<name:"container_label_annotation_io_kubernetes_container_ports" value:"[{\"name\":\"dns-local\",\"containerPort\":10053,\"protocol\":\"UDP\"},{\"name\":\"dns-tcp-local\",\"containerPort\":10053,\"protocol\":\"TCP\"},{\"name\":\"metrics\",\"containerPort\":10055,\"protocol\":\"TCP\"}]" > label:<name:"container_label_annotation_io_kubernetes_container_restartCount" value:"0" > label:<name:"container_label_annotation_io_kubernetes_container_terminationMessagePath" value:"/dev/termination-log" > label:<name:"container_label_annotation_io_kubernetes_container_terminationMessagePolicy" value:"File" > label:<name:"container_label_annotation_io_kubernetes_pod_terminationGracePeriod" value:"30" > label:<name:"container_label_io_kubernetes_container_logpath" value:"/var/log/pods/432b46fd-55e0-11e7-b764-42010af00002/kubedns_0.log" > label:<name:"container_label_io_kubernetes_container_name" value:"kubedns" > label:<name:"container_label_io_kubernetes_docker_type" value:"container" > label:<name:"container_label_io_kubernetes_pod_name" value:"kube-dns-2673147055-32wcs" > label:<name:"container_label_io_kubernetes_pod_namespace" value:"kube-system" > label:<name:"container_label_io_kubernetes_pod_uid" value:"432b46fd-55e0-11e7-b764-42010af00002" > label:<name:"container_label_io_kubernetes_sandbox_id" value:"20c8e397df74c045ed8297bc53b91271dd84e9fbd23ba85d9007e9c21abba526" > label:<name:"id" value:"/kubepods/burstable/pod432b46fd-55e0-11e7-b764-42010af00002/893eb7e95242dd196284f7e10d5b04277c423ac2ff69c26284494b4706a92c21" > label:<name:"image" value:"sha256:ca8759c215c9c2377bee9425bb3ba547ebf85e511759652bdb5d2d980e4f4a21" > label:<name:"name" value:"k8s_kubedns_kube-dns-2673147055-32wcs_kube-system_432b46fd-55e0-11e7-b764-42010af00002_0" > label:<name:"state" value:"stopped" > gauge:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family

With this PR:

container_spec_cpu_quota{container_label_annotation_io_kubernetes_container_hash="d7c8bd4a",container_label_annotation_io_kubernetes_container_restartCount="0",container_label_annotation_io_kubernetes_container_terminationMessagePath="/dev/termination-log",container_label_annotation_io_kubernetes_container_terminationMessagePolicy="File",container_label_annotation_io_kubernetes_pod_terminationGracePeriod="30",container_label_io_kubernetes_container_logpath="/var/log/pods/51605191-55e0-11e7-b764-42010af00002/heapster-nanny_0.log",container_label_io_kubernetes_container_name="heapster-nanny",container_label_io_kubernetes_docker_type="container",container_label_io_kubernetes_pod_name="heapster-v1.3.0-1514755676-z24f5",container_label_io_kubernetes_pod_namespace="kube-system",container_label_io_kubernetes_pod_uid="51605191-55e0-11e7-b764-42010af00002",container_label_io_kubernetes_sandbox_id="89d4ce90493e28474622dc057c5132262198868b789e233e5a983632db580248",id="/kubepods/pod51605191-55e0-11e7-b764-42010af00002/457a9ce101c77b6bfa3a7f03c5072375853e11c7613a6f5bf1952abbcba2e64f",image="sha256:9b0815c8711889802a3081d1a609cc4251357e6ec0a28ac5963aac72bec67691",name="k8s_heapster-nanny_heapster-v1.3.0-1514755676-z24f5_kube-system_51605191-55e0-11e7-b764-42010af00002_0"} 5000

dchen1107 · 2017-06-20T19:07:03Z

/lgtm

Set ContinueOnError for prometheus http handler.

grobie · 2017-06-21T08:12:18Z

Thanks @Random-Liu. We should fix these metrics indeed.

bboreham · 2017-08-31T16:49:40Z

For the sake of anyone coming across this, I'll note this resulted in the output being reduced to some random subset of all metrics: it "continues" but it still drops the ones it doesn't like. See #1704 for more discussion.

Set ContinueOnError for prometheus http handler.

66c95c7

dchen1107 self-requested a review June 20, 2017 19:04

Random-Liu mentioned this pull request Jun 20, 2017

Stop using the global prometheus registry #1460

Merged

dchen1107 approved these changes Jun 20, 2017

View reviewed changes

dchen1107 merged commit f3d900b into google:master Jun 20, 2017

Random-Liu deleted the fix-metrics branch June 20, 2017 19:09

Random-Liu mentioned this pull request Jun 20, 2017

Set ContinueOnError for prometheus http handler. #1681

Merged

dchen1107 referenced this pull request Jun 20, 2017

Merge pull request #1681 from Random-Liu/cherrypick-v0.26-#1679

eb454b7

Set ContinueOnError for prometheus http handler.

dashpole mentioned this pull request Jul 6, 2017

cadvisor 0.26.0 /metrics is not work #1671

Closed

matthiasr mentioned this pull request Aug 16, 2017

Inconsistent container metrics in prometheus route #1704

Closed

derekwaynecarr mentioned this pull request Sep 18, 2017

cadvisor/runc updates openshift/origin#16419

Merged

brian-brazil mentioned this pull request Jan 29, 2018

Switch to promhttp handler and continue on error prometheus/collectd_exporter#57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set ContinueOnError for prometheus http handler. #1679

Set ContinueOnError for prometheus http handler. #1679

Random-Liu commented Jun 20, 2017

Random-Liu commented Jun 20, 2017

dchen1107 commented Jun 20, 2017

grobie commented Jun 21, 2017

bboreham commented Aug 31, 2017

Set ContinueOnError for prometheus http handler. #1679

Set ContinueOnError for prometheus http handler. #1679

Conversation

Random-Liu commented Jun 20, 2017

Random-Liu commented Jun 20, 2017

dchen1107 commented Jun 20, 2017

grobie commented Jun 21, 2017

bboreham commented Aug 31, 2017