-
-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tighter coupling of task metadata to associated worker metrics #5288
Comments
Thanks for writing this up Charles! 😄 |
Thank you for writing this up. If you all aren't already aware, you might want to take a look at the log_event system and the recent client-side handlers that are being added in #5217 The plugin/handler system you're using looks sensible to me, and it's nice that we can build add-ons this easily. I just bring up log_event and client-side handlers in case those serve some of your needs. Those may provide a different, perhaps higher level, base on which to build. |
+1 for Matt's suggestion on checking out the structured log event system (here are the relevant docs https://distributed.dask.org/en/latest/logging.html#structured-logs).
If you have thoughts on this topic, I'm sure @fjetter (who's out of the office this week) would welcome any feedback over in #4762 |
Just checking in here. @charlesbluca did you get a chance to try out the structured logging system? |
Not yet, I've been distracted with dask-sql / sorting stuff, still hope to look into this though |
As part of the work on dask-sql, there has been some demand for machine readable logs of worker metrics (such as GPU utilization / memory usage) coupled with the tasks these workers are currently running / have recently run (along with some additional metadata of these tasks, such as when they were scheduled/started/completed); with this data readily available, it would be easier to diagnose why certain tasks are bottlenecks in a given computation by tracking what was happening with the worker while the task was running.
To give an idea of what might be wanted here, some RAPIDS folk have developed and are currently using dask-metrics for this purpose, which is able to generate per-worker CSV files containing this information (with only GPU-relevant metrics). @jakirkham also suggested adding something like an "N slowest running tasks" table to the performance reports, although I think we would want the granular data as well.
For context, all of this information is readily available through the scheduler, though it would need to be merged together manually:
Some options I've considered for this:
SystemMonitor
orWorkerState
metrics; not sure if/how this could be done but would make it easier to stream this data somewhereIt would be nice to have some discussion on if this is doable and worthwhile for troubleshooting performance in Distributed.
cc @randerzander
The text was updated successfully, but these errors were encountered: