Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metric for memory/time used per task prefix #7341

Closed
Tracked by #7345
mrocklin opened this issue Nov 21, 2022 · 9 comments · Fixed by #7406
Closed
Tracked by #7345

Prometheus metric for memory/time used per task prefix #7341

mrocklin opened this issue Nov 21, 2022 · 9 comments · Fixed by #7406
Assignees

Comments

@mrocklin
Copy link
Member

We have some metrics around number of tasks completed over time. I think that we could use a couple of others:

  1. Amount of time spent in each prefix. I think that we already store this data in all_durations (I brought this up when @ntabris did his work before as probably what we actually wanted, rather than number of tasks completed).
  2. Amount of memory currently used by each prefix. This is probably not a counter, but instead an instantaneous measurement of current state

This would be useful for after-the-fact debugging of clusters. @dchudz was curious about this. @fjetter is this easy for someone on your team to do?

@mrocklin
Copy link
Member Author

Currently queued/processing per prefix might also be interesting.

@ntabris
Copy link
Contributor

ntabris commented Nov 21, 2022

For (1) if we can get that as a cumulative total, that would be great.

For the memory gauge, what's the time resolution at which that's useful data?

@mrocklin
Copy link
Member Author

For (1) if we can get that as a cumulative total, that would be great.

I think that this is already stored in all_durations on the TaskGroup/Prefix objects

For the memory gauge, what's the time resolution at which that's useful data

I think instantaneous is fine. For example when looking at a cluster that appears to be globally paused (what we saw earlier) I'm curious, which task prefixes are responsible for all of this data? Which task prefixes have yet to run?

@dchudz
Copy link
Contributor

dchudz commented Dec 5, 2022

@fjetter any sense of where this fits into your next few days/weeks/months?

@fjetter fjetter assigned graingert and unassigned graingert Dec 13, 2022
graingert added a commit to graingert/distributed that referenced this issue Dec 14, 2022
graingert added a commit to graingert/distributed that referenced this issue Dec 14, 2022
graingert added a commit to graingert/distributed that referenced this issue Dec 14, 2022
@crusaderky
Copy link
Collaborator

Do we already have number of completed tasks per prefix? We need it in order to calculate mean task runtime grouped by prefix

@mrocklin
Copy link
Member Author

Do we already have number of completed tasks per prefix?

Yes

We need it in order to calculate mean task runtime grouped by prefix

We actually already record an exponentially weighted moving average of task runtime by prefix. This was actually the very first motivation for task prefixes. It's pretty core to the scheduling heuristics. This functionality actually pre-dates the TaskPrefix class itself 🙂

@crusaderky
Copy link
Collaborator

@mrocklin is your answer specific to Prometheus, or are you talking about internal metrics in the scheduler?

@mrocklin
Copy link
Member Author

Internal metrics on the scheduler.

@crusaderky
Copy link
Collaborator

crusaderky commented Dec 22, 2022

We need to export the count of completed tasks per prefix to Prometheus, so that grafana/whatever can calculate mean runtimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants