Observability of resource utilization #182

eu9ene · 2023-09-05T18:54:20Z

We should be able to see CPU, GPU, RAM, and IO utilization in real-time while running the tasks. A GCP dashboard for the workers would work initially.

It would be great to see the utilization dashboards not only per worker but also per TC task in real-time.

Later on, we could integrate it with an experiment tracking platform (see #164) to see everything in one place including the historical utilization of resources per task.

bhearsum · 2023-11-20T20:05:12Z

https://mozilla-hub.atlassian.net/browse/RELOPS-636 tracks more progress here, with two subtickets:

https://mozilla-hub.atlassian.net/browse/RELOPS-729 is filed to get the GCP monitoring agent installed to allow GPU reporting to work
https://mozilla-hub.atlassian.net/browse/RELOPS-728 is filed for a more long term solution of getting all of this data into Grafana.

eu9ene · 2024-01-11T17:20:57Z

We decided:

to extend the GCP dashboard in the short term (we need GPU and RAM there)
use W&B tracking for per task utilization
add Grafana dashboards with per node info when we start supporting on-premises infrastructure since a node can be reused by multiple tasks there

bhearsum · 2024-01-24T17:08:38Z

https://bugzilla.mozilla.org/show_bug.cgi?id=1876326 is tracking rolling out a new image that has the GCP agent installed, and should fix this.

bhearsum · 2024-01-24T19:53:20Z

I'm testing out the new image right now. The GPU & RAM monitoring appears to be working well. One thing I noticed, however, is that the data seems to become inaccessible after the instance is deleted (which happens shortly after it shuts down). I'm going to see if there's a way to work around this, as it greatly reduces the usefulness of this monitoring if we can't, eg: assess the hardware utilization of a training task that ran overnight.

bhearsum · 2024-01-29T16:35:27Z

I'm testing out the new image right now. The GPU & RAM monitoring appears to be working well. One thing I noticed, however, is that the data seems to become inaccessible after the instance is deleted (which happens shortly after it shuts down). I'm going to see if there's a way to work around this, as it greatly reduces the usefulness of this monitoring if we can't, eg: assess the hardware utilization of a training task that ran overnight.

I figured out a way around this, as described in #398. With that fixed, what do you want to do with this issue @eu9ene? I think all of the worker side stuff is done for now. (We may try to push data to Grafana at some point - but that's not planned for the moment.)

eu9ene · 2024-01-29T16:45:58Z

It looks good. It would be also useful to add disk throughput to the dashboard to have a better picture of what's happening on a machine. Otherwise we can close this issue and create the new ones for W&B and on-prem infrastructure as needed.

bhearsum · 2024-01-29T17:28:09Z

It looks good. It would be also useful to add disk throughput to the dashboard to have a better picture of what's happening on a machine. Otherwise we can close this issue and create the new ones for W&B and on-prem infrastructure as needed.

Those are available already (I just didn't add them to the initial dashboard):

You should be able to do that yourself though!

eu9ene added taskcluster Issues related to the Taskcluster implementation of the training pipeline cost & perf Speeding up and lowering cost for the pipeline p1 labels Sep 5, 2023

gabrielBusta mentioned this issue Sep 21, 2023

Run the translations pipeline on fxci level 1 workers #206

Closed

gregtatum mentioned this issue Dec 21, 2023

[meta] Make the pipeline reliable enough to train many languages #311

Open

bhearsum self-assigned this Jan 11, 2024

bhearsum closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability of resource utilization #182

Observability of resource utilization #182

eu9ene commented Sep 5, 2023

bhearsum commented Nov 20, 2023

eu9ene commented Jan 11, 2024

bhearsum commented Jan 24, 2024

bhearsum commented Jan 24, 2024

bhearsum commented Jan 29, 2024

eu9ene commented Jan 29, 2024

bhearsum commented Jan 29, 2024

Observability of resource utilization #182

Observability of resource utilization #182

Comments

eu9ene commented Sep 5, 2023

bhearsum commented Nov 20, 2023

eu9ene commented Jan 11, 2024

bhearsum commented Jan 24, 2024

bhearsum commented Jan 24, 2024

bhearsum commented Jan 29, 2024

eu9ene commented Jan 29, 2024

bhearsum commented Jan 29, 2024