-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observability of resource utilization #182
Comments
https://mozilla-hub.atlassian.net/browse/RELOPS-636 tracks more progress here, with two subtickets:
|
We decided:
|
https://bugzilla.mozilla.org/show_bug.cgi?id=1876326 is tracking rolling out a new image that has the GCP agent installed, and should fix this. |
I'm testing out the new image right now. The GPU & RAM monitoring appears to be working well. One thing I noticed, however, is that the data seems to become inaccessible after the instance is deleted (which happens shortly after it shuts down). I'm going to see if there's a way to work around this, as it greatly reduces the usefulness of this monitoring if we can't, eg: assess the hardware utilization of a training task that ran overnight. |
I figured out a way around this, as described in #398. With that fixed, what do you want to do with this issue @eu9ene? I think all of the worker side stuff is done for now. (We may try to push data to Grafana at some point - but that's not planned for the moment.) |
It looks good. It would be also useful to add disk throughput to the dashboard to have a better picture of what's happening on a machine. Otherwise we can close this issue and create the new ones for W&B and on-prem infrastructure as needed. |
We should be able to see CPU, GPU, RAM, and IO utilization in real-time while running the tasks. A GCP dashboard for the workers would work initially.
It would be great to see the utilization dashboards not only per worker but also per TC task in real-time.
Later on, we could integrate it with an experiment tracking platform (see #164) to see everything in one place including the historical utilization of resources per task.
The text was updated successfully, but these errors were encountered: