eks-prow-build-cluster: Monitoring solution #5165

xmudrii · 2023-04-25T15:23:45Z

We have a simple monitoring solution based on Prometheus and Grafana in eks-prow-build-cluster. However, that monitoring stack is not exposed at all and we should look into unifying monitoring for GKE and EKS clusters.

Tasks

Give feedback

Initial monitoring stack for eks-prow-build-cluster
Considering deploying node-problem-detector (npd)
Expose the monitoring stack (e.g. Grafana) so it can be accessed publicly
Integrate the existing GKE monitoring stack with the new monitoring stack
Options

/priority important-longterm

xmudrii · 2023-04-25T15:50:47Z

/milestone v1.28
/sig k8s-infra
/area infra
/area infra/aws
/kind cleanup

memetics19 · 2023-05-12T22:21:03Z

@xmudrii Can i pickup this ?

xmudrii · 2023-05-13T12:36:21Z

@memetics19 As far as I know @wozniakjan is already working on this.
/assign @wozniakjan

wozniakjan · 2023-05-13T13:51:14Z

hey, yeah thanks for assigning me. I haven't found the time yet but next week should finally have the capacity to move this forward.

wozniakjan · 2023-05-17T09:20:17Z

Considering deploying node-problem-detector (npd)

~~draft PR under work~~ under review #5291

wozniakjan · 2023-05-17T10:47:22Z

Expose the monitoring stack (e.g. Grafana) so it can be accessed publicly

I guess exposing it as documented in https://repost.aws/knowledge-center/eks-kubernetes-services-cluster should be sufficient, or do we want to get a dedicated domain for it as well?

xmudrii · 2023-05-17T12:49:17Z

@wozniakjan I think we can get a dedicated domain for it.

wozniakjan · 2023-05-23T15:22:18Z

Expose the monitoring stack (e.g. Grafana) so it can be accessed publicly

~~draft PR #5316, figuring out how to TLS~~

TLS figured out by #5320 (thanks @xmudrii!), I guess #5316 is ready for review.

wozniakjan · 2023-05-24T12:05:18Z

Integrate the existing GKE monitoring stack with the new monitoring stack

@xmudrii, @ameukam is this the GKE monitoring stack mentioned above?https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/k8s-infra-monitoring
I guess it's https://cloud.google.com/monitoring, correct?

I will take a look if there are any tools to aggregate GCP cloud monitoring with the self-hosted Prometheus we use in AWS. Judging from the wording of the task, it's desired to display GCP metrics in the AWS Prometheus, not the other way around.

ameukam · 2023-05-24T13:20:27Z

This is i

Integrate the existing GKE monitoring stack with the new monitoring stack

@xmudrii, @ameukam is this the GKE monitoring stack mentioned above?https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/k8s-infra-monitoring I guess it's https://cloud.google.com/monitoring, correct?

I will take a look if there are any tools to aggregate GCP cloud monitoring with the self-hosted Prometheus we use in AWS. Judging from the wording of the task, it's desired to display GCP metrics in the AWS Prometheus, not the other way around.

This is inaccurate. This link you provided is only for specific resources and not related the build clusters.

We already aggregate metrics in https://monitoring.prow.k8s.io. you can find resources for this monitoring stack here:https://github.com/kubernetes/test-infra/tree/master/config/prow/cluster/monitoring

wozniakjan · 2023-05-24T13:35:59Z

We already aggregate metrics in https://monitoring.prow.k8s.io

that is perfect, thank you very much!

wozniakjan · 2023-05-26T14:50:10Z

I browsed the monitoring stack for GKE prow build clusters and I think before connecting both stacks, it could make sense to get a parity between the dashboards. I am currently working on trying to see how many of the original dashboards make sense here for the EKS build cluster #5324.

Then my idea is to leverage Prometheus remote-write capability. One of the Prometheus instances would provide a single pane of glass and the other would expose its metrics for scraping. I considered Prometheus agent mode and I think in this case it's not that important and having the remote-write Prometheus capable of serving its metrics as well (which wouldn't be possible in the agent mode) has a bigger value than the resource optimization. Especially, since we are already exposing it through its own grafana.

wozniakjan · 2023-06-07T13:47:53Z

increased parity between GKE and EKS grafana is getting merged in #5324.

However, after #5316 merged, a side quest popped up. There is a desire to restrict some boards for public access. Namely, https://monitoring-eks.prow.k8s.io/d/node-exporter-full/node-exporter-full is considered to be potentially oversharing. There was also a valid opinion that this dashboard could be very useful and we shouldn't get rid of it entirely. @pkprzekwas and I had the following idea:

create a public org with a view for only a subset of dashboards that are considered fully public and safe
change the unauthenticated anonymous to that org
keep the too-sensitive boards visible only for the authenticated users in the original grafana org.

wozniakjan · 2023-06-08T08:14:20Z

There is a desire to restrict some boards for public access. Namely, https://monitoring-eks.prow.k8s.io/d/node-exporter-full/node-exporter-full is considered to be potentially oversharing.

disabling sensitive boards in #5387

wozniakjan · 2023-06-08T14:00:08Z

Integrate the existing GKE monitoring stack with the new monitoring stack

@pkprzekwas @xmudrii, I got a much easier idea about monitoring integration than Prometheus remote-write capability. How about we just embed grafana dashboards as iframes?

Given anonymous access is enabled (for readonly) on both, we could just set this on the exporting dashboard:

[security]
allow_embedding = true

and on the importing dashboard (the single pane of glass) set this:

[panels]
disable_sanitize_html = true

the independent panels can be then integrated as iframes, for example:

{
  "type": "text",
  "content": "<iframe src=\"https://monitoring-eks.prow.k8s.io/d-solo/g4Okc0_4k/boskos-server-dashboard?orgId=1&panelId=2\" width=\"450\" height=\"200\" frameborder=\"0\"></iframe>",
  "mode": "html"
}

we wouldn't be able to query across multiple clusters but it could be ok first step to just have a common place to see what is going on.

pkprzekwas · 2023-06-09T13:29:12Z

That's a decent low hanging fruit. As our grafana instances are pubic and read only, there shouldn't be much difference between interacting with original ones and facaded with embedded iframes.

wozniakjan · 2023-06-23T09:27:03Z

kubernetes/test-infra#29920 proposing to allow dashboard embedding, let's see how that goes.

k8s-triage-robot · 2024-01-26T19:35:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xmudrii · 2024-01-28T20:22:59Z

/remove-lifecycle stale

xmudrii · 2024-02-12T15:32:38Z

Most of tasks are done. We're yet to come up with a single plane of glass monitoring solution, but I'll create a new issue to track that
/close

k8s-ci-robot · 2024-02-12T15:32:44Z

@xmudrii: Closing this issue.

In response to this:

Most of tasks are done. We're yet to come up with a single plane of glass monitoring solution, but I'll create a new issue to track that
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 25, 2023

xmudrii mentioned this issue Apr 25, 2023

eks-prow-bulild-cluster improvements and enhancements #5169

Closed

k8s-ci-robot added this to the v1.28 milestone Apr 25, 2023

k8s-ci-robot assigned wozniakjan May 13, 2023

wozniakjan mentioned this issue May 17, 2023

eks-prow-build-cluster: install npd #5291

Merged

wozniakjan mentioned this issue May 23, 2023

eks-prow-build-cluster: expose grafana as lb svc #5316

Merged

wozniakjan mentioned this issue Jun 23, 2023

grafana: allow dashboard embedding kubernetes/test-infra#29920

Closed

wozniakjan removed their assignment Aug 22, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2024

k8s-ci-robot closed this as completed Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eks-prow-build-cluster: Monitoring solution #5165

eks-prow-build-cluster: Monitoring solution #5165

xmudrii commented Apr 25, 2023 •

edited

Loading

Tasks

xmudrii commented Apr 25, 2023

memetics19 commented May 12, 2023

xmudrii commented May 13, 2023

wozniakjan commented May 13, 2023 •

edited

Loading

wozniakjan commented May 17, 2023 •

edited

Loading

wozniakjan commented May 17, 2023

xmudrii commented May 17, 2023

wozniakjan commented May 23, 2023 •

edited

Loading

wozniakjan commented May 24, 2023

ameukam commented May 24, 2023 •

edited

Loading

wozniakjan commented May 24, 2023

wozniakjan commented May 26, 2023

wozniakjan commented Jun 7, 2023

wozniakjan commented Jun 8, 2023

wozniakjan commented Jun 8, 2023 •

edited

Loading

pkprzekwas commented Jun 9, 2023

wozniakjan commented Jun 23, 2023

k8s-triage-robot commented Jan 26, 2024

xmudrii commented Jan 28, 2024

xmudrii commented Feb 12, 2024

k8s-ci-robot commented Feb 12, 2024

eks-prow-build-cluster: Monitoring solution #5165

eks-prow-build-cluster: Monitoring solution #5165

Comments

xmudrii commented Apr 25, 2023 • edited Loading

Tasks

xmudrii commented Apr 25, 2023

memetics19 commented May 12, 2023

xmudrii commented May 13, 2023

wozniakjan commented May 13, 2023 • edited Loading

wozniakjan commented May 17, 2023 • edited Loading

wozniakjan commented May 17, 2023

xmudrii commented May 17, 2023

wozniakjan commented May 23, 2023 • edited Loading

wozniakjan commented May 24, 2023

ameukam commented May 24, 2023 • edited Loading

wozniakjan commented May 24, 2023

wozniakjan commented May 26, 2023

wozniakjan commented Jun 7, 2023

wozniakjan commented Jun 8, 2023

wozniakjan commented Jun 8, 2023 • edited Loading

pkprzekwas commented Jun 9, 2023

wozniakjan commented Jun 23, 2023

k8s-triage-robot commented Jan 26, 2024

xmudrii commented Jan 28, 2024

xmudrii commented Feb 12, 2024

k8s-ci-robot commented Feb 12, 2024

xmudrii commented Apr 25, 2023 •

edited

Loading

wozniakjan commented May 13, 2023 •

edited

Loading

wozniakjan commented May 17, 2023 •

edited

Loading

wozniakjan commented May 23, 2023 •

edited

Loading

ameukam commented May 24, 2023 •

edited

Loading

wozniakjan commented Jun 8, 2023 •

edited

Loading