Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks-prow-build-cluster: Monitoring solution #5165

Closed
3 of 4 tasks
Tracked by #5169
xmudrii opened this issue Apr 25, 2023 · 21 comments
Closed
3 of 4 tasks
Tracked by #5169

eks-prow-build-cluster: Monitoring solution #5165

xmudrii opened this issue Apr 25, 2023 · 21 comments
Labels
area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure area/infra Infrastructure management, infrastructure design, code in infra/ kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Milestone

Comments

@xmudrii
Copy link
Member

xmudrii commented Apr 25, 2023

We have a simple monitoring solution based on Prometheus and Grafana in eks-prow-build-cluster. However, that monitoring stack is not exposed at all and we should look into unifying monitoring for GKE and EKS clusters.

Tasks

Preview Give feedback

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 25, 2023
@xmudrii
Copy link
Member Author

xmudrii commented Apr 25, 2023

/milestone v1.28
/sig k8s-infra
/area infra
/area infra/aws
/kind cleanup

@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Apr 25, 2023
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. area/infra Infrastructure management, infrastructure design, code in infra/ area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 25, 2023
@memetics19
Copy link

@xmudrii Can i pickup this ?

@xmudrii
Copy link
Member Author

xmudrii commented May 13, 2023

@memetics19 As far as I know @wozniakjan is already working on this.
/assign @wozniakjan

@wozniakjan
Copy link
Member

wozniakjan commented May 13, 2023

hey, yeah thanks for assigning me. I haven't found the time yet but next week should finally have the capacity to move this forward.

@wozniakjan
Copy link
Member

wozniakjan commented May 17, 2023

  • Considering deploying node-problem-detector (npd)

draft PR under work under review #5291

@wozniakjan
Copy link
Member

Expose the monitoring stack (e.g. Grafana) so it can be accessed publicly

I guess exposing it as documented in https://repost.aws/knowledge-center/eks-kubernetes-services-cluster should be sufficient, or do we want to get a dedicated domain for it as well?

@xmudrii
Copy link
Member Author

xmudrii commented May 17, 2023

@wozniakjan I think we can get a dedicated domain for it.

@wozniakjan
Copy link
Member

wozniakjan commented May 23, 2023

Expose the monitoring stack (e.g. Grafana) so it can be accessed publicly

draft PR #5316, figuring out how to TLS

TLS figured out by #5320 (thanks @xmudrii!), I guess #5316 is ready for review.

@wozniakjan
Copy link
Member

  • Integrate the existing GKE monitoring stack with the new monitoring stack

@xmudrii, @ameukam is this the GKE monitoring stack mentioned above?https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/k8s-infra-monitoring
I guess it's https://cloud.google.com/monitoring, correct?

I will take a look if there are any tools to aggregate GCP cloud monitoring with the self-hosted Prometheus we use in AWS. Judging from the wording of the task, it's desired to display GCP metrics in the AWS Prometheus, not the other way around.

@ameukam
Copy link
Member

ameukam commented May 24, 2023

This is i

  • Integrate the existing GKE monitoring stack with the new monitoring stack

@xmudrii, @ameukam is this the GKE monitoring stack mentioned above?https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/k8s-infra-monitoring I guess it's https://cloud.google.com/monitoring, correct?

I will take a look if there are any tools to aggregate GCP cloud monitoring with the self-hosted Prometheus we use in AWS. Judging from the wording of the task, it's desired to display GCP metrics in the AWS Prometheus, not the other way around.

This is inaccurate. This link you provided is only for specific resources and not related the build clusters.

We already aggregate metrics in https://monitoring.prow.k8s.io. you can find resources for this monitoring stack here:https://github.com/kubernetes/test-infra/tree/master/config/prow/cluster/monitoring

@wozniakjan
Copy link
Member

We already aggregate metrics in https://monitoring.prow.k8s.io

that is perfect, thank you very much!

@wozniakjan
Copy link
Member

I browsed the monitoring stack for GKE prow build clusters and I think before connecting both stacks, it could make sense to get a parity between the dashboards. I am currently working on trying to see how many of the original dashboards make sense here for the EKS build cluster #5324.

Then my idea is to leverage Prometheus remote-write capability. One of the Prometheus instances would provide a single pane of glass and the other would expose its metrics for scraping. I considered Prometheus agent mode and I think in this case it's not that important and having the remote-write Prometheus capable of serving its metrics as well (which wouldn't be possible in the agent mode) has a bigger value than the resource optimization. Especially, since we are already exposing it through its own grafana.

@wozniakjan
Copy link
Member

increased parity between GKE and EKS grafana is getting merged in #5324.

However, after #5316 merged, a side quest popped up. There is a desire to restrict some boards for public access. Namely, https://monitoring-eks.prow.k8s.io/d/node-exporter-full/node-exporter-full is considered to be potentially oversharing. There was also a valid opinion that this dashboard could be very useful and we shouldn't get rid of it entirely. @pkprzekwas and I had the following idea:

  1. create a public org with a view for only a subset of dashboards that are considered fully public and safe
  2. change the unauthenticated anonymous to that org
  3. keep the too-sensitive boards visible only for the authenticated users in the original grafana org.

@wozniakjan
Copy link
Member

There is a desire to restrict some boards for public access. Namely, https://monitoring-eks.prow.k8s.io/d/node-exporter-full/node-exporter-full is considered to be potentially oversharing.

disabling sensitive boards in #5387

@wozniakjan
Copy link
Member

wozniakjan commented Jun 8, 2023

  • Integrate the existing GKE monitoring stack with the new monitoring stack

@pkprzekwas @xmudrii, I got a much easier idea about monitoring integration than Prometheus remote-write capability. How about we just embed grafana dashboards as iframes?

Given anonymous access is enabled (for readonly) on both, we could just set this on the exporting dashboard:

[security]
allow_embedding = true

and on the importing dashboard (the single pane of glass) set this:

[panels]
disable_sanitize_html = true

the independent panels can be then integrated as iframes, for example:

{
  "type": "text",
  "content": "<iframe src=\"https://monitoring-eks.prow.k8s.io/d-solo/g4Okc0_4k/boskos-server-dashboard?orgId=1&panelId=2\" width=\"450\" height=\"200\" frameborder=\"0\"></iframe>",
  "mode": "html"
}

we wouldn't be able to query across multiple clusters but it could be ok first step to just have a common place to see what is going on.

@pkprzekwas
Copy link
Contributor

That's a decent low hanging fruit. As our grafana instances are pubic and read only, there shouldn't be much difference between interacting with original ones and facaded with embedded iframes.

@wozniakjan
Copy link
Member

kubernetes/test-infra#29920 proposing to allow dashboard embedding, let's see how that goes.

@wozniakjan wozniakjan removed their assignment Aug 22, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@xmudrii
Copy link
Member Author

xmudrii commented Jan 28, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2024
@xmudrii
Copy link
Member Author

xmudrii commented Feb 12, 2024

Most of tasks are done. We're yet to come up with a single plane of glass monitoring solution, but I'll create a new issue to track that
/close

@k8s-ci-robot
Copy link
Contributor

@xmudrii: Closing this issue.

In response to this:

Most of tasks are done. We're yet to come up with a single plane of glass monitoring solution, but I'll create a new issue to track that
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure area/infra Infrastructure management, infrastructure design, code in infra/ kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
None yet
Development

No branches or pull requests

7 participants