Prometheus metric for memory/time used per task prefix #7341

mrocklin · 2022-11-21T21:33:04Z

We have some metrics around number of tasks completed over time. I think that we could use a couple of others:

Amount of time spent in each prefix. I think that we already store this data in all_durations (I brought this up when @ntabris did his work before as probably what we actually wanted, rather than number of tasks completed).
Amount of memory currently used by each prefix. This is probably not a counter, but instead an instantaneous measurement of current state

This would be useful for after-the-fact debugging of clusters. @dchudz was curious about this. @fjetter is this easy for someone on your team to do?

mrocklin · 2022-11-21T21:36:27Z

Currently queued/processing per prefix might also be interesting.

ntabris · 2022-11-21T23:50:49Z

For (1) if we can get that as a cumulative total, that would be great.

For the memory gauge, what's the time resolution at which that's useful data?

mrocklin · 2022-11-22T14:08:12Z

For (1) if we can get that as a cumulative total, that would be great.

I think that this is already stored in all_durations on the TaskGroup/Prefix objects

For the memory gauge, what's the time resolution at which that's useful data

I think instantaneous is fine. For example when looking at a cluster that appears to be globally paused (what we saw earlier) I'm curious, which task prefixes are responsible for all of this data? Which task prefixes have yet to run?

dchudz · 2022-12-05T18:51:23Z

@fjetter any sense of where this fits into your next few days/weeks/months?

partial dask#7341

crusaderky · 2022-12-22T14:04:39Z

Do we already have number of completed tasks per prefix? We need it in order to calculate mean task runtime grouped by prefix

mrocklin · 2022-12-22T14:05:59Z

Do we already have number of completed tasks per prefix?

Yes

We need it in order to calculate mean task runtime grouped by prefix

We actually already record an exponentially weighted moving average of task runtime by prefix. This was actually the very first motivation for task prefixes. It's pretty core to the scheduling heuristics. This functionality actually pre-dates the TaskPrefix class itself 🙂

crusaderky · 2022-12-22T14:08:18Z

@mrocklin is your answer specific to Prometheus, or are you talking about internal metrics in the scheduler?

mrocklin · 2022-12-22T14:09:59Z

Internal metrics on the scheduler.

crusaderky · 2022-12-22T14:51:36Z

We need to export the count of completed tasks per prefix to Prometheus, so that grafana/whatever can calculate mean runtimes.

partial dask#7341

Fixes dask#7341

fjetter mentioned this issue Nov 25, 2022

Prometheus metrics improvements #7345

Open

9 tasks

fjetter added the diagnostics label Nov 25, 2022

hayesgb assigned graingert Dec 7, 2022

fjetter assigned graingert and unassigned graingert Dec 13, 2022

graingert added a commit to graingert/distributed that referenced this issue Dec 14, 2022

add prometheus metrix for time used per task prefix

04ff2d1

partial dask#7341

graingert mentioned this issue Dec 14, 2022

add prometheus metric for time and memory used per task prefix #7406

Merged

2 tasks

graingert added a commit to graingert/distributed that referenced this issue Dec 14, 2022

add prometheus metric for time used per task prefix

127bdab

partial dask#7341

crusaderky assigned crusaderky and graingert and unassigned graingert and crusaderky Dec 14, 2022

graingert added a commit to graingert/distributed that referenced this issue Dec 14, 2022

add prometheus metric for time used per task prefix

6b43edf

partial dask#7341

graingert added a commit to graingert/distributed that referenced this issue Jan 4, 2023

add prometheus metric for time used per task prefix

40a9cd8

partial dask#7341

graingert added a commit to graingert/distributed that referenced this issue Jan 4, 2023

add prometheus metric for time used per task prefix

d345ad0

partial dask#7341

graingert added a commit to graingert/distributed that referenced this issue Jan 5, 2023

add prometheus metric for time and memory used per task prefix

6e96a7e

Fixes dask#7341

graingert added a commit to graingert/distributed that referenced this issue Jan 5, 2023

add prometheus metric for time and memory used per task prefix

5fc94b8

Fixes dask#7341

graingert added a commit to graingert/distributed that referenced this issue Jan 5, 2023

add prometheus metric for time and memory used per task prefix

f973dee

Fixes dask#7341

crusaderky mentioned this issue Jan 20, 2023

Post-mortem: why an easy workflow was horribly non-performant, and what we could do to make it easier for users to write fast dask code dask/community#301

Open

crusaderky closed this as completed in #7406 Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus metric for memory/time used per task prefix #7341

Prometheus metric for memory/time used per task prefix #7341

mrocklin commented Nov 21, 2022

mrocklin commented Nov 21, 2022

ntabris commented Nov 21, 2022

mrocklin commented Nov 22, 2022

dchudz commented Dec 5, 2022

crusaderky commented Dec 22, 2022

mrocklin commented Dec 22, 2022

crusaderky commented Dec 22, 2022

mrocklin commented Dec 22, 2022

crusaderky commented Dec 22, 2022 •

edited

Loading

Prometheus metric for memory/time used per task prefix #7341

Prometheus metric for memory/time used per task prefix #7341

Comments

mrocklin commented Nov 21, 2022

mrocklin commented Nov 21, 2022

ntabris commented Nov 21, 2022

mrocklin commented Nov 22, 2022

dchudz commented Dec 5, 2022

crusaderky commented Dec 22, 2022

mrocklin commented Dec 22, 2022

crusaderky commented Dec 22, 2022

mrocklin commented Dec 22, 2022

crusaderky commented Dec 22, 2022 • edited Loading

crusaderky commented Dec 22, 2022 •

edited

Loading