Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard support in Kueue #940

Open
2 of 3 tasks
kerthcet opened this issue Jul 3, 2023 · 34 comments · Fixed by #3727
Open
2 of 3 tasks

Dashboard support in Kueue #940

kerthcet opened this issue Jul 3, 2023 · 34 comments · Fixed by #3727
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@kerthcet
Copy link
Contributor

kerthcet commented Jul 3, 2023

What would you like to be added:

It would be great if we can have an insight about what's our queueing system looks like at real time

  • for administrators, it helps to understand the total resource usages in-between cluster queues and whether we should make them a cohort
  • for batch users, they will have an overview about the job queueing, how many jobs are pending for scheduling, how long jobs are waiting.

Overall, it's a great enhancement especially for production env.

Why is this needed:

A big enhancement and a great insight of kueue system.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Some advices here:

  • For administrators:
    • clusterQueue resource groups
    • clusterQueue resource utilizations
    • localQueues
    • cohorts
  • For users:
    • jobs with their status
@kerthcet kerthcet added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 3, 2023
@alculquicondor
Copy link
Contributor

Do you have a high level idea of how to get there?

For example, would metrics + grafana yamls be enough for the administrative side?

For end users, certainly grafana wouldn't be viable. But what could be the MVP that would keep kueue largely non-opinionated and reusable (so you can integrate it with your own UI, if you already have one). Could we offer a CLI instead?

@moficodes
Copy link
Contributor

Kueue already spits out prometheus metrics. Building a UI based on that can be useful and the UI should be optional to deploy.

I do wonder if it is more useful for us to provide general purpose grafana dashboard and make it available in https://grafana.com/grafana/dashboards/

@kerthcet
Copy link
Contributor Author

Building on metrics is helpful I think, but the dashboard is more than that, like it will display the basic information about the system, how many queues there, what their names are, how many jobs inside the queue, it can be interactive. We can get the information via the apis directly or we can have a lightweight database inside for cache, like sqlite.

We may need some frontend volunteers if we want to finish this work. As a MVP, IMHO, I think it should include

  • Most of the API objects(clusterQueue, localQueue, Job, resourceFlavor, workload) at least, also including their relationships
  • Some exported metrics

@ahg-g
Copy link
Contributor

ahg-g commented Jul 11, 2023

+1000, this is a much needed experience gap, I would be happy to review proposals.

@kerthcet may be we start by looking at similar batch schedulers and see what "screens" they offer to inform and help seed what we need to build?

@kerthcet
Copy link
Contributor Author

@kerthcet may be we start by looking at similar batch schedulers and see what "screens" they offer to inform and help seed what we need to build?

Yes, that's a good approach, let me make a research first and then I'll share with your guys. I know YuniKorn has a dashboard. Also cc @BinL233

@zeddit
Copy link

zeddit commented Nov 3, 2023

@ahg-g
it's a core feature which will make kueue easy to use.
A alternative would be airflow, and its UI looks like below
image
it contains task status, code, and audit logs which would be useful information for user to inspect their jobs.
besides, airflow integrates with idp like ldap or oidc and provides access control and permission managment features.

However, airflow doesn't provide an API to submit one-time run jobs like ml training jobs, which is the core application for kueue.

@kerthcet
Copy link
Contributor Author

kerthcet commented Nov 6, 2023

cc @B1F030 we also did some research around the popular queueing systems, I think we can provide a summary about the essential elements in dashboard, or even a prototype. Can you help with this @B1F030 ?

@samzong
Copy link

samzong commented Nov 7, 2023

Hi guys, I want to try involved in the prototyping part of the dashbaord Desgin, and provide the prototype like figma.

@kerthcet
Copy link
Contributor Author

kerthcet commented Nov 7, 2023

Thanks @samzong
We can provide a based design, and share with the community for feedbacks and then involve the developments, any concerns? @alculquicondor

@alculquicondor
Copy link
Contributor

Maybe we can start with a list of views you would like to have and do priority sorting

@kerthcet
Copy link
Contributor Author

kerthcet commented Nov 8, 2023

Maybe we can start with a list of views you would like to have and do priority sorting

@B1F030 is doing this.

@Sharpz7
Copy link

Sharpz7 commented Dec 11, 2023

Hey folks, https://github.com/armadaproject/armada has a UI in the form of lookout.

Our demo UI is here: https://ui.demo.armadaproject.io/

Let us know what you think of it - I think many parts of it could be suitable for lookout and we would be interested in contributing.

Thanks!

@kerthcet
Copy link
Contributor Author

Thanks @Sharpz7 that's helpful, and we have a general idea now, @samzong is doing the prototyping, once we've done, we'll share a google doc/figma with your guys, hope to work together.

@Sharpz7
Copy link

Sharpz7 commented Dec 11, 2023

Great, Thank you :))

@alculquicondor
Copy link
Contributor

@Sharpz7 what is a lookout in this context?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2024
@tenzen-y
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2024
@kerthcet
Copy link
Contributor Author

Are you still following this @samzong ?

@samzong
Copy link

samzong commented Jun 12, 2024

@kerthcet

Absolutely, I'm still very interested in this. The good news is that I'll have more time to contribute to open source in the coming period. I'll make sure to push this forward as soon as possible.

@trashadewan
Copy link

Hey,
there is a grafana dashboard shown in https://www.youtube.com/watch?v=B63vT2_UYE4, would you know if this is avialble for use. somewhere?

@alculquicondor
Copy link
Contributor

@alizaidis
Copy link
Contributor

Yep that's the one!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 17, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@akram
Copy link
Contributor

akram commented Nov 18, 2024

Hi everyone,

As a side project to just run a demo I have contributed kueue-viz : as a kueue dashboard.
The project is availabe here: https://github.com/akram/kueue-viz

It is still very basic , but every contribution and feedback are welcome.

image

@kannon92
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Nov 18, 2024
@k8s-ci-robot
Copy link
Contributor

@kannon92: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kannon92
Copy link
Contributor

@mwielgus has mentioned that this is still of interest to the Kueue project.

@kerthcet
Copy link
Contributor Author

Thanks @akram for the work!

@mimowo
Copy link
Contributor

mimowo commented Dec 5, 2024

/reopen
let's use this issue to continue with the next steps of productazing the kueue-viz sub-project. I post some steps in the comment:

  • I think it would be good to publish the image
  • move it out of the cmd/experimental
  • align the release process with the main Kueue
    (possibly more)
  • reference the project from the kueue main documentation page (https://kueue.sigs.k8s.io/), for starter we can do similar as for kueuectl plugin, but later it deserves more advertisement I think
  • e2e tests for the dashboard backend and frontend

I'm also open to track the improvements as dedicated issue, just listing them here as a starting point.

@k8s-ci-robot
Copy link
Contributor

@mimowo: Reopened this issue.

In response to this:

/reopen
let's use this issue to continue with the next steps of productazing the kueue-viz sub-project. I post some steps in the comment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot reopened this Dec 5, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 5, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.