Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine performance metrics: Break down idle time on the Scheduler #7672

Open
crusaderky opened this issue Mar 17, 2023 · 0 comments
Open

Fine performance metrics: Break down idle time on the Scheduler #7672

crusaderky opened this issue Mar 17, 2023 · 0 comments

Comments

@crusaderky
Copy link
Collaborator

With #7671 done, we know how much time we spend with workers idle because they are not getting enough Compute messages from the scheduler.
This can be further reclassified on the scheduler side, by adding negative corrections to Scheduler.cumulative_worker_metrics["execute", "n/a", "idle", "seconds"].

On the scheduler, we know for each worker:

  • time spent with tasks in processing state. The delta between this and the sum of worker metrics other than 'idle' shows e.g. time spent on imperfectly pipelined RTTs between worker and scheduler, e.g. it should increase when distributed.scheduler.worker-saturation is too low.

  • time spent with not enough tasks in processing state on the worker, but at least one task processing somewhere on the cluster, e.g. the workload is not fully parallelisable

  • time spent with zero tasks in processing state anywhere on the cluster, e.g. waiting for the Client. This should include the initial decision time between the moment the scheduler receives update_graph and when it releases the event loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant