Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: adding EFA specific setup to distributed training runner for PT-XLA #143

Merged
merged 2 commits into from
Aug 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions src/sagemaker_training/pytorch_xla.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,19 @@ def _setup(self): # type: () -> None
logger.info("Starting distributed training through PT-XLA Runtime.")
self._check_compatibility()

# Set NCCL logging to info to debug customer issues
os.environ["NCCL_DEBUG"] = "info"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may potentially conflict with the settings in the docker files. We currently add this in the deep learning containers like so:
echo NCCL_DEBUG=INFO >> /etc/nccl.conf

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding the default is the intent. We want to explicitly enable debug logging for NCCL operations. This will enable customers to potentially share more detailed logs with us when they encounter failures with Training Compiler in distributed training scenario(, thereby allowing us to debug issues faster). In our internal benchmarks we observed that we can offer competitive performance even with this debug logging turned on. Also, SageMaker data parallel also explicitly enables debug logging for NCCL operations when it's enabled.


# Use `simple` protocol to handle the out-of-order data delivery from EFA
os.environ["NCCL_PROTO"] = "simple"

# Use GPU RDMA when available (available only in p4d.24xlarge)
os.environ["FI_EFA_USE_DEVICE_RDMA"] = "1"

# Use multiple connections per GPU to better saturate the EFA bandwidth
os.environ["OFI_NCCL_NIC_DUP_CONNS"] = str(self._num_gpus)

# Set cluster configuration for XLA runtime
os.environ["XRT_HOST_ORDINAL"] = str(self._rank)
os.environ["XRT_SHARD_WORLD_SIZE"] = str(self._num_hosts)
address = "localservice:{};{}:" + str(self.WORKER_PORT)
Expand Down
4 changes: 4 additions & 0 deletions test/unit/test_pytorch_xla.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,10 @@ def test_setup(self, cluster, cluster_size, master, instance_type, num_gpus, *pa
)
runner._check_compatibility = lambda: None
runner._setup()
assert os.environ["NCCL_DEBUG"] == "info"
assert os.environ["NCCL_PROTO"] == "simple"
assert os.environ["FI_EFA_USE_DEVICE_RDMA"] == "1"
assert os.environ["OFI_NCCL_NIC_DUP_CONNS"] == str(num_gpus)
assert os.environ["XRT_HOST_ORDINAL"] == str(rank)
assert os.environ["XRT_SHARD_WORLD_SIZE"] == str(cluster_size)
assert os.environ["XRT_WORKERS"] == "|".join(
Expand Down