Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability of resource utilization #182

Closed
Tracked by #311
eu9ene opened this issue Sep 5, 2023 · 7 comments
Closed
Tracked by #311

Observability of resource utilization #182

eu9ene opened this issue Sep 5, 2023 · 7 comments
Assignees
Labels
cost & perf Speeding up and lowering cost for the pipeline taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Sep 5, 2023

We should be able to see CPU, GPU, RAM, and IO utilization in real-time while running the tasks. A GCP dashboard for the workers would work initially.

It would be great to see the utilization dashboards not only per worker but also per TC task in real-time.

Later on, we could integrate it with an experiment tracking platform (see #164) to see everything in one place including the historical utilization of resources per task.

@eu9ene eu9ene added taskcluster Issues related to the Taskcluster implementation of the training pipeline cost & perf Speeding up and lowering cost for the pipeline p1 labels Sep 5, 2023
@bhearsum
Copy link
Collaborator

https://mozilla-hub.atlassian.net/browse/RELOPS-636 tracks more progress here, with two subtickets:

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 11, 2024

We decided:

  • to extend the GCP dashboard in the short term (we need GPU and RAM there)
  • use W&B tracking for per task utilization
  • add Grafana dashboards with per node info when we start supporting on-premises infrastructure since a node can be reused by multiple tasks there

@bhearsum bhearsum self-assigned this Jan 11, 2024
@bhearsum
Copy link
Collaborator

https://bugzilla.mozilla.org/show_bug.cgi?id=1876326 is tracking rolling out a new image that has the GCP agent installed, and should fix this.

@bhearsum
Copy link
Collaborator

I'm testing out the new image right now. The GPU & RAM monitoring appears to be working well. One thing I noticed, however, is that the data seems to become inaccessible after the instance is deleted (which happens shortly after it shuts down). I'm going to see if there's a way to work around this, as it greatly reduces the usefulness of this monitoring if we can't, eg: assess the hardware utilization of a training task that ran overnight.

@bhearsum
Copy link
Collaborator

I'm testing out the new image right now. The GPU & RAM monitoring appears to be working well. One thing I noticed, however, is that the data seems to become inaccessible after the instance is deleted (which happens shortly after it shuts down). I'm going to see if there's a way to work around this, as it greatly reduces the usefulness of this monitoring if we can't, eg: assess the hardware utilization of a training task that ran overnight.

I figured out a way around this, as described in #398. With that fixed, what do you want to do with this issue @eu9ene? I think all of the worker side stuff is done for now. (We may try to push data to Grafana at some point - but that's not planned for the moment.)

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 29, 2024

It looks good. It would be also useful to add disk throughput to the dashboard to have a better picture of what's happening on a machine. Otherwise we can close this issue and create the new ones for W&B and on-prem infrastructure as needed.

@bhearsum
Copy link
Collaborator

It looks good. It would be also useful to add disk throughput to the dashboard to have a better picture of what's happening on a machine. Otherwise we can close this issue and create the new ones for W&B and on-prem infrastructure as needed.

Those are available already (I just didn't add them to the initial dashboard):
image

You should be able to do that yourself though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

2 participants