Skip to content

Burn-in scripts for IB and GPU testing, forked from imbue-ai/cluster-health

License

Notifications You must be signed in to change notification settings

sfcompute/cluster-health

 
 

Repository files navigation

Training Infrastructure Scripts

This repository contains various scripts developed at Imbue for managing a large cluster of H100s, detecting and fixing hardware issues, and generally ensuring smooth model training. You can read more about our process here

The code is organized as follows:

  • gpu_stress_test contains a check that the GPUs on each machine are able to allocate large tensors and perform standard operations.
  • health_checks contains various checks we use to determine which hosts are healthy, as well as automated solutions to common issues.
  • host_validation contains tests to check that the GPUs on a given machine are able to communicate with each other (via NVLink) and with GPUs on other machines (via InfiniBand).
  • ufm_events contains a script which parses the UFM event log and other logs, checks for relevant events, and determines which network ports should be disabled.
  • ib_burn contains a script for generating a comprehensive burn-in workload for IB fabrics, aiming to exercise every available link.

About

Burn-in scripts for IB and GPU testing, forked from imbue-ai/cluster-health

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.8%
  • Shell 5.2%