Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics #171

Closed
1 task done
dzilbermanvmw opened this issue Jun 2, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@dzilbermanvmw
Copy link

Welcome to the AWS Observability Accelerator

  • Yes, I've searched similar issues on GitHub and didn't find any.

AWS Observability Accelerator Release version

2.4.0

What is your environment, configuration and the example used?

EKS API v1.24.13-eks-0a21954
Managed Grafana: version 9.4
Managed Prometheus:
ADOT: v0.74.0 - eksbuild.1

What did you do and What did you see instead?

I have deployed the AWS Observability accelerator blueprint example following the instructions. Right after the restart ALL metrics are showing up OK, then in about 4 min they stop and exception like the following are observed in the ADOT pod (namespace adot-collector-kubeprometheus) :
2023-06-02T23:27:24.158Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1685748444141, "target_labels": "{__name__=\"up\", cluster=\"do-eks-tf-dz\", instance=\"10.11.3.29:9100\", job=\"node-exporter\", nodename=\"ip-10-11-3-29.ec2.internal\", region=\"us-east-1\"}"} 2023-06-02T23:27:25.083Z warn scrape/scrape.go:1372 Append failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "kubelet", "target": "https://kubernetes.default.svc.cluster.local:443/api/v1/nodes/ip-10-11-11-126.ec2.internal/proxy/metrics/cadvisor", "error": "invalid sample: non-unique label names: \"cluster\""} 2023-06-02T23:27:25.083Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1685748445038, "target_labels": "{__name__=\"up\", beta_kubernetes_io_arch=\"arm64\", beta_kubernetes_io_instance_type=\"c7g.4xlarge\", beta_kubernetes_io_os=\"linux\", cluster=\"do-eks-tf-dz\", eks_amazonaws_com_capacityType=\"ON_DEMAND\", eks_amazonaws_com_nodegroup=\"cpu-graviton-man\", eks_amazonaws_com_nodegroup_image=\"ami-06326b7ef5c114349\", failure_domain_beta_kubernetes_io_region=\"us-east-1\", failure_domain_beta_kubernetes_io_zone=\"us-east-1a\", instance=\"ip-10-11-11-126.ec2.internal\", job=\"kubelet\", k8s_io_cloud_provider_aws=\"51d0ed1b12453098a108c272e71e962f\", kubernetes_io_arch=\"arm64\", kubernetes_io_hostname=\"ip-10-11-11-126.ec2.internal\", kubernetes_io_os=\"linux\", node_cpu=\"graviton\", node_kubernetes_io_instance_type=\"c7g.4xlarge\", node_role=\"compute\", region=\"us-east-1\", scale_model=\"bert\", topology_kubernetes_io_region=\"us-east-1\", topology_kubernetes_io_zone=

Additional Information

The root cause is likely in the following [source file]:(https://github.com/aws-observability/terraform-aws-observability-accelerator/blob/main/modules/eks-monitoring/otel-config/templates/opentelemetrycollector.yaml) There are the sections for job_name: 'kube-state-metrics' and job_name: 'kubernetes-kubelet'  particularly in the subsection of relabel_configs:
@dzilbermanvmw dzilbermanvmw added the bug Something isn't working label Jun 2, 2023
@dzilbermanvmw dzilbermanvmw changed the title [Bug]: After a short time after (re)start, metrics flow to managed Prometheus stops for most of metrics [Bug]: After a short time after (re)start of ADOT pod, metrics flow to managed Prometheus stops for most of metrics Jun 2, 2023
@bonclay7 bonclay7 self-assigned this Jun 5, 2023
@ktibi
Copy link
Contributor

ktibi commented Jun 5, 2023

I opened same issue here : aws-observability/aws-otel-collector#2091

@bonclay7
Copy link
Member

bonclay7 commented Jun 5, 2023

We will find a workaround for that in a PR shortly, but thanks for opening on the ADOT repo @ktibi

@bonclay7
Copy link
Member

bonclay7 commented Jun 6, 2023

PR submitted, might need to run this for a few days to confirm @dzilbermanvmw @ktibi

@ktibi
Copy link
Contributor

ktibi commented Jun 7, 2023

@bonclay7 I deployed your fix since this morning. For the moment all is good.

2023-06-07T11:25:37.708Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 286, "data points": 624}
2023-06-07T11:25:47.127Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 316, "data points": 2181}
2023-06-07T11:25:54.482Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 72, "data points": 2055}
2023-06-07T11:25:59.680Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 336, "data points": 909}

@bonclay7
Copy link
Member

bonclay7 commented Jun 7, 2023

Awesome #174 (comment)

@bonclay7
Copy link
Member

bonclay7 commented Jun 7, 2023

We merged a fix #174, this will go in the next release. Please reopen if you observe the same behaviour

@bonclay7 bonclay7 closed this as completed Jun 7, 2023
@dzilbermanvmw
Copy link
Author

@bonclay7 I deployed your fix since this morning. For the moment all is good.

2023-06-07T11:25:37.708Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 286, "data points": 624}
2023-06-07T11:25:47.127Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 3, "metrics": 316, "data points": 2181}
2023-06-07T11:25:54.482Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 72, "data points": 2055}
2023-06-07T11:25:59.680Z	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 336, "data points": 909}

Many thanks @bonclay7 and team for addressing this issue, happy to confirm it is working just fine and I'm seeing uninterrupted stream of metrics all the way from ADOT to AMP and AMG!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants