diff --git a/docs/img/gcp-monitoring.png b/docs/img/gcp-monitoring.png new file mode 100644 index 000000000..fa0afe75c Binary files /dev/null and b/docs/img/gcp-monitoring.png differ diff --git a/docs/task-cluster.md b/docs/task-cluster.md index c6e21a81c..ab1773017 100644 --- a/docs/task-cluster.md +++ b/docs/task-cluster.md @@ -57,6 +57,17 @@ We use [Taskcluster taskgraph](https://taskcluster-taskgraph.readthedocs.io/en/l ![All tasks](img/tc-all-tasks.png) +## Resource monitoring + +CPU, GPU, RAM, and other metrics are available in GCP. The [Firefox Translations Worker Monitoring Dashboard](https://console.cloud.google.com/monitoring/dashboards/builder/a6c8749a-75e2-490a-a7ea-628960c70ea8;startTime=2024-01-25T14:43:04Z;endTime=2024-01-25T20:43:04Z?project=fxci-production-level1-workers) is a good starting point for observing resource utilization during training. You should filter this dashboard on the `name` of the instance running your training task. You can find this name at the top of the training log as the first part of the `public-hostname`. Eg: +``` +[taskcluster 2024-01-24T18:43:50.869Z] "public-hostname": "translations-1-b-linux-v100-gpu-4-300g-uwfi5olorq6omun0mr1wgq.c.fxci-production-level1-workers.internal", +``` + +Once you have the name you can use the "Add filter" button near the top of the page to limit the data shown. You should end up with a dashboard similar to this when done: +![Firefox Translations Worker Monitoring Dashboard filtered to show CPU, RAM, and GPU usage of a single instance](img/gcp-monitoring.png). + +If you want to customize your own dashboard with different widgets you can create a new Dashboard by clicking the "Firefox Translations Worker Monitoring" followed by "Create Dashboard". (A detailed tutorial on how to create these dashboards is out of scope for this document, but there are many resources available online, and the UI is fairly intuitive.) ## Rerunning