-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus metric for memory/time used per task prefix #7341
Comments
Currently queued/processing per prefix might also be interesting. |
For (1) if we can get that as a cumulative total, that would be great. For the memory gauge, what's the time resolution at which that's useful data? |
I think that this is already stored in
I think instantaneous is fine. For example when looking at a cluster that appears to be globally paused (what we saw earlier) I'm curious, which task prefixes are responsible for all of this data? Which task prefixes have yet to run? |
@fjetter any sense of where this fits into your next few days/weeks/months? |
Do we already have number of completed tasks per prefix? We need it in order to calculate mean task runtime grouped by prefix |
Yes
We actually already record an exponentially weighted moving average of task runtime by prefix. This was actually the very first motivation for task prefixes. It's pretty core to the scheduling heuristics. This functionality actually pre-dates the TaskPrefix class itself 🙂 |
@mrocklin is your answer specific to Prometheus, or are you talking about internal metrics in the scheduler? |
Internal metrics on the scheduler. |
We need to export the count of completed tasks per prefix to Prometheus, so that grafana/whatever can calculate mean runtimes. |
We have some metrics around number of tasks completed over time. I think that we could use a couple of others:
This would be useful for after-the-fact debugging of clusters. @dchudz was curious about this. @fjetter is this easy for someone on your team to do?
The text was updated successfully, but these errors were encountered: