Tighter coupling of task metadata to associated worker metrics #5288

charlesbluca · 2021-08-30T18:05:53Z

As part of the work on dask-sql, there has been some demand for machine readable logs of worker metrics (such as GPU utilization / memory usage) coupled with the tasks these workers are currently running / have recently run (along with some additional metadata of these tasks, such as when they were scheduled/started/completed); with this data readily available, it would be easier to diagnose why certain tasks are bottlenecks in a given computation by tracking what was happening with the worker while the task was running.

To give an idea of what might be wanted here, some RAPIDS folk have developed and are currently using dask-metrics for this purpose, which is able to generate per-worker CSV files containing this information (with only GPU-relevant metrics). @jakirkham also suggested adding something like an "N slowest running tasks" table to the performance reports, although I think we would want the granular data as well.

For context, all of this information is readily available through the scheduler, though it would need to be merged together manually:

cluster.scheduler.get_task_stream()  # gives us task metadata along with workers running the tasks
await cluster.scheduler.get_worker_monitor_info()  # gives us timestamped worker metrics

Some options I've considered for this:

Adding task metadata to the SystemMonitor or WorkerState metrics; not sure if/how this could be done but would make it easier to stream this data somewhere
Adding a scheduler function to merge the existing task/worker metadata and return it in a machine readable format

It would be nice to have some discussion on if this is doable and worthwhile for troubleshooting performance in Distributed.

cc @randerzander

The text was updated successfully, but these errors were encountered:

jakirkham · 2021-08-30T18:26:21Z

Thanks for writing this up Charles! 😄

cc @jrbourbeau @mrocklin @quasiben

mrocklin · 2021-08-30T18:57:13Z

Thank you for writing this up. If you all aren't already aware, you might want to take a look at the log_event system and the recent client-side handlers that are being added in #5217

The plugin/handler system you're using looks sensible to me, and it's nice that we can build add-ons this easily. I just bring up log_event and client-side handlers in case those serve some of your needs. Those may provide a different, perhaps higher level, base on which to build.

jrbourbeau · 2021-08-30T21:21:42Z

+1 for Matt's suggestion on checking out the structured log event system (here are the relevant docs https://distributed.dask.org/en/latest/logging.html#structured-logs).

there has been some demand for machine readable logs

If you have thoughts on this topic, I'm sure @fjetter (who's out of the office this week) would welcome any feedback over in #4762

jrbourbeau · 2021-10-14T14:23:24Z

Just checking in here. @charlesbluca did you get a chance to try out the structured logging system?

charlesbluca · 2021-10-14T14:34:16Z

Not yet, I've been distracted with dask-sql / sorting stuff, still hope to look into this though

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighter coupling of task metadata to associated worker metrics #5288

Tighter coupling of task metadata to associated worker metrics #5288

charlesbluca commented Aug 30, 2021

jakirkham commented Aug 30, 2021

mrocklin commented Aug 30, 2021

jrbourbeau commented Aug 30, 2021

jrbourbeau commented Oct 14, 2021

charlesbluca commented Oct 14, 2021

Tighter coupling of task metadata to associated worker metrics #5288

Tighter coupling of task metadata to associated worker metrics #5288

Comments

charlesbluca commented Aug 30, 2021

jakirkham commented Aug 30, 2021

mrocklin commented Aug 30, 2021

jrbourbeau commented Aug 30, 2021

jrbourbeau commented Oct 14, 2021

charlesbluca commented Oct 14, 2021