[Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics #171

dzilbermanvmw · 2023-06-02T23:41:30Z

Welcome to the AWS Observability Accelerator

Yes, I've searched similar issues on GitHub and didn't find any.

AWS Observability Accelerator Release version

2.4.0

What is your environment, configuration and the example used?

EKS API v1.24.13-eks-0a21954
Managed Grafana: version 9.4
Managed Prometheus:
ADOT: v0.74.0 - eksbuild.1

What did you do and What did you see instead?

I have deployed the AWS Observability accelerator blueprint example following the instructions. Right after the restart ALL metrics are showing up OK, then in about 4 min they stop and exception like the following are observed in the ADOT pod (namespace adot-collector-kubeprometheus) :
2023-06-02T23:27:24.158Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1685748444141, "target_labels": "{__name__=\"up\", cluster=\"do-eks-tf-dz\", instance=\"10.11.3.29:9100\", job=\"node-exporter\", nodename=\"ip-10-11-3-29.ec2.internal\", region=\"us-east-1\"}"} 2023-06-02T23:27:25.083Z warn scrape/scrape.go:1372 Append failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "kubelet", "target": "https://kubernetes.default.svc.cluster.local:443/api/v1/nodes/ip-10-11-11-126.ec2.internal/proxy/metrics/cadvisor", "error": "invalid sample: non-unique label names: \"cluster\""} 2023-06-02T23:27:25.083Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1685748445038, "target_labels": "{__name__=\"up\", beta_kubernetes_io_arch=\"arm64\", beta_kubernetes_io_instance_type=\"c7g.4xlarge\", beta_kubernetes_io_os=\"linux\", cluster=\"do-eks-tf-dz\", eks_amazonaws_com_capacityType=\"ON_DEMAND\", eks_amazonaws_com_nodegroup=\"cpu-graviton-man\", eks_amazonaws_com_nodegroup_image=\"ami-06326b7ef5c114349\", failure_domain_beta_kubernetes_io_region=\"us-east-1\", failure_domain_beta_kubernetes_io_zone=\"us-east-1a\", instance=\"ip-10-11-11-126.ec2.internal\", job=\"kubelet\", k8s_io_cloud_provider_aws=\"51d0ed1b12453098a108c272e71e962f\", kubernetes_io_arch=\"arm64\", kubernetes_io_hostname=\"ip-10-11-11-126.ec2.internal\", kubernetes_io_os=\"linux\", node_cpu=\"graviton\", node_kubernetes_io_instance_type=\"c7g.4xlarge\", node_role=\"compute\", region=\"us-east-1\", scale_model=\"bert\", topology_kubernetes_io_region=\"us-east-1\", topology_kubernetes_io_zone=

Additional Information

The root cause is likely in the following [source file]:(https://github.com/aws-observability/terraform-aws-observability-accelerator/blob/main/modules/eks-monitoring/otel-config/templates/opentelemetrycollector.yaml) There are the sections for job_name: 'kube-state-metrics' and job_name: 'kubernetes-kubelet'  particularly in the subsection of relabel_configs:

The text was updated successfully, but these errors were encountered:

dzilbermanvmw · 2023-06-02T23:43:54Z

Source file location: modules/eks-monitoring/otel-config/templates/opentelemetrycollector.yaml

ktibi · 2023-06-05T16:04:41Z

I opened same issue here : aws-observability/aws-otel-collector#2091

bonclay7 · 2023-06-05T16:56:51Z

We will find a workaround for that in a PR shortly, but thanks for opening on the ADOT repo @ktibi

bonclay7 · 2023-06-06T09:42:27Z

PR submitted, might need to run this for a few days to confirm @dzilbermanvmw @ktibi

ktibi · 2023-06-07T11:26:36Z

@bonclay7 I deployed your fix since this morning. For the moment all is good.

2023-06-07T11:25:37.708Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 286, "data points": 624}
2023-06-07T11:25:47.127Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 316, "data points": 2181}
2023-06-07T11:25:54.482Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 72, "data points": 2055}
2023-06-07T11:25:59.680Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 336, "data points": 909}

bonclay7 · 2023-06-07T11:37:18Z

Awesome #174 (comment)

bonclay7 · 2023-06-07T16:40:17Z

We merged a fix #174, this will go in the next release. Please reopen if you observe the same behaviour

dzilbermanvmw · 2023-06-23T17:48:29Z

@bonclay7 I deployed your fix since this morning. For the moment all is good.

2023-06-07T11:25:37.708Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 286, "data points": 624}
2023-06-07T11:25:47.127Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 316, "data points": 2181}
2023-06-07T11:25:54.482Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 72, "data points": 2055}
2023-06-07T11:25:59.680Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 336, "data points": 909}

Many thanks @bonclay7 and team for addressing this issue, happy to confirm it is working just fine and I'm seeing uninterrupted stream of metrics all the way from ADOT to AMP and AMG!

dzilbermanvmw added the bug Something isn't working label Jun 2, 2023

dzilbermanvmw changed the title ~~[Bug]: After a short time after (re)start, metrics flow to managed Prometheus stops for most of metrics~~ [Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics Jun 2, 2023

bonclay7 self-assigned this Jun 5, 2023

ktibi mentioned this issue Jun 5, 2023

Scrape append failed aws-observability/aws-otel-collector#2091

Closed

bonclay7 mentioned this issue Jun 6, 2023

Fix: collector failing to scrape targets #174

Merged

5 tasks

bonclay7 closed this as completed Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics #171

[Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics #171

dzilbermanvmw commented Jun 2, 2023

dzilbermanvmw commented Jun 2, 2023

ktibi commented Jun 5, 2023

bonclay7 commented Jun 5, 2023

bonclay7 commented Jun 6, 2023

ktibi commented Jun 7, 2023

bonclay7 commented Jun 7, 2023

bonclay7 commented Jun 7, 2023

dzilbermanvmw commented Jun 23, 2023

[Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics #171

[Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics #171

Comments

dzilbermanvmw commented Jun 2, 2023

Welcome to the AWS Observability Accelerator

AWS Observability Accelerator Release version

What is your environment, configuration and the example used?

What did you do and What did you see instead?

Additional Information

dzilbermanvmw commented Jun 2, 2023

ktibi commented Jun 5, 2023

bonclay7 commented Jun 5, 2023

bonclay7 commented Jun 6, 2023

ktibi commented Jun 7, 2023

bonclay7 commented Jun 7, 2023

bonclay7 commented Jun 7, 2023

dzilbermanvmw commented Jun 23, 2023