Skip to content

Commit

Permalink
Add documentation on how to monitor CPU, GPU, etc. on training instan…
Browse files Browse the repository at this point in the history
…ces (#398)
  • Loading branch information
bhearsum authored Jan 29, 2024
1 parent 6df9a73 commit 437ceac
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 0 deletions.
Binary file added docs/img/gcp-monitoring.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions docs/task-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,17 @@ We use [Taskcluster taskgraph](https://taskcluster-taskgraph.readthedocs.io/en/l

![All tasks](img/tc-all-tasks.png)

## Resource monitoring

CPU, GPU, RAM, and other metrics are available in GCP. The [Firefox Translations Worker Monitoring Dashboard](https://console.cloud.google.com/monitoring/dashboards/builder/a6c8749a-75e2-490a-a7ea-628960c70ea8;startTime=2024-01-25T14:43:04Z;endTime=2024-01-25T20:43:04Z?project=fxci-production-level1-workers) is a good starting point for observing resource utilization during training. You should filter this dashboard on the `name` of the instance running your training task. You can find this name at the top of the training log as the first part of the `public-hostname`. Eg:
```
[taskcluster 2024-01-24T18:43:50.869Z] "public-hostname": "translations-1-b-linux-v100-gpu-4-300g-uwfi5olorq6omun0mr1wgq.c.fxci-production-level1-workers.internal",
```

Once you have the name you can use the "Add filter" button near the top of the page to limit the data shown. You should end up with a dashboard similar to this when done:
![Firefox Translations Worker Monitoring Dashboard filtered to show CPU, RAM, and GPU usage of a single instance](img/gcp-monitoring.png).

If you want to customize your own dashboard with different widgets you can create a new Dashboard by clicking the "Firefox Translations Worker Monitoring" followed by "Create Dashboard". (A detailed tutorial on how to create these dashboards is out of scope for this document, but there are many resources available online, and the UI is fairly intuitive.)

## Rerunning

Expand Down

0 comments on commit 437ceac

Please sign in to comment.