Skip to content

Latest commit

 

History

History
18 lines (13 loc) · 483 Bytes

distributed_training.md

File metadata and controls

18 lines (13 loc) · 483 Bytes
tags
ml
platform

Distributed training

Training ML models on multiple GPUs/servers is called distributed training. The vocabulary here is:

  • node -- the name for a single computing unit (server)
  • world size -- the number of processes (usually one process == one GPU) for all nodes
  • rank -- index of a particular process

So for training on 2 servers, each with 4 GPUs we have:

  • 2 nodes (== 2 servers)
  • 2 * 4 = 8 world size
  • ranks go [0, 1, 2, ..., 6, 7]