Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighter coupling of task metadata to associated worker metrics #5288

Open
charlesbluca opened this issue Aug 30, 2021 · 5 comments
Open

Tighter coupling of task metadata to associated worker metrics #5288

charlesbluca opened this issue Aug 30, 2021 · 5 comments

Comments

@charlesbluca
Copy link
Member

As part of the work on dask-sql, there has been some demand for machine readable logs of worker metrics (such as GPU utilization / memory usage) coupled with the tasks these workers are currently running / have recently run (along with some additional metadata of these tasks, such as when they were scheduled/started/completed); with this data readily available, it would be easier to diagnose why certain tasks are bottlenecks in a given computation by tracking what was happening with the worker while the task was running.

To give an idea of what might be wanted here, some RAPIDS folk have developed and are currently using dask-metrics for this purpose, which is able to generate per-worker CSV files containing this information (with only GPU-relevant metrics). @jakirkham also suggested adding something like an "N slowest running tasks" table to the performance reports, although I think we would want the granular data as well.

For context, all of this information is readily available through the scheduler, though it would need to be merged together manually:

cluster.scheduler.get_task_stream()  # gives us task metadata along with workers running the tasks
await cluster.scheduler.get_worker_monitor_info()  # gives us timestamped worker metrics 

Some options I've considered for this:

  • Adding task metadata to the SystemMonitor or WorkerState metrics; not sure if/how this could be done but would make it easier to stream this data somewhere
  • Adding a scheduler function to merge the existing task/worker metadata and return it in a machine readable format

It would be nice to have some discussion on if this is doable and worthwhile for troubleshooting performance in Distributed.

cc @randerzander

@jakirkham
Copy link
Member

Thanks for writing this up Charles! 😄

cc @jrbourbeau @mrocklin @quasiben

@mrocklin
Copy link
Member

Thank you for writing this up. If you all aren't already aware, you might want to take a look at the log_event system and the recent client-side handlers that are being added in #5217

The plugin/handler system you're using looks sensible to me, and it's nice that we can build add-ons this easily. I just bring up log_event and client-side handlers in case those serve some of your needs. Those may provide a different, perhaps higher level, base on which to build.

@jrbourbeau
Copy link
Member

+1 for Matt's suggestion on checking out the structured log event system (here are the relevant docs https://distributed.dask.org/en/latest/logging.html#structured-logs).

there has been some demand for machine readable logs

If you have thoughts on this topic, I'm sure @fjetter (who's out of the office this week) would welcome any feedback over in #4762

@jrbourbeau
Copy link
Member

Just checking in here. @charlesbluca did you get a chance to try out the structured logging system?

@charlesbluca
Copy link
Member Author

Not yet, I've been distracted with dask-sql / sorting stuff, still hope to look into this though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants