Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] Running Mnist Tutorial distributed causes Runtime Errors and Hanging behavior #8915

Closed
samjenks opened this issue Feb 28, 2024 · 12 comments
Labels

Comments

@samjenks
Copy link

samjenks commented Feb 28, 2024

Describe the bug

I am attempting to test my configuration/setup by running the mnist tutorial in distributed mode across 1 agent with 8 gpus. I've set the experiment configuration to 8 slots_per_trial. When I do that parts of the training loop hang, eventually erroring out with a watchdog error.
On deeper inspection it appears that a runtime error is occurring called: DDP expects same model across all ranks, but Rank 0 has 8 params while rank 3 has inconsistent 0 params, It repeats this error for multiple ranks but always the Rank # has 8 params Rank # as 0 params.
Does anyone have any insight on to what is going on here? I assume there isn't enough data to shard across the model or that the model can't be correctly be broken up into 8 chunks? Is there a way to prevent this from happening prior to training or a fallback so it doesn't run this 5 times?

This also occurs on an agent with 8 slots and an experiment configuration of 7 slots per trial, but doesn't occur at 6 slots per trial

Reproduction Steps

  1. Download Mnist_pytorch.tgz from tutorial
  2. Set slots to 8 on an 8 gpu machine
  3. run experiment

Expected Behavior

I expected the model to run on 8 gpus without issue

Screenshot

N/A

Environment

  • Device or hardware: [e.g. iPhone6, Nvidia A100]
  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional Context

No response

@samjenks samjenks added the bug label Feb 28, 2024
@MikhailKardash
Copy link
Contributor

Thanks for the report. Can you share the full stacktrace for this error?

To avoid restarting 5 times, you can configure max_restarts in the config yaml file. https://docs.determined.ai/latest/reference/training/experiment-config-reference.html#max-restarts

@samjenks
Copy link
Author

samjenks commented Feb 28, 2024

experiment_138_trial_139_logs.txt
[2024-02-27T21:05:03.626766Z] || INFO: Scheduling Trial 139 (Experiment 138) (id: 138.4cae5cf7-4ce2-4b43-965c-38c2c097a554.1)
[2024-02-27T21:05:04.023266Z] || INFO: Trial 139 (Experiment 138) was assigned to an agent
[2024-02-27T21:05:04.028090Z] fa88a9df || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.27.1
[2024-02-27T21:05:04.178866Z] fa88a9df || INFO: copying files to container: /
[2024-02-27T21:05:04.194009Z] fa88a9df || INFO: copying files to container: /run/determined
[2024-02-27T21:05:04.207514Z] fa88a9df || INFO: copying files to container: /
[2024-02-27T21:05:04.220384Z] fa88a9df || INFO: copying files to container: /
[2024-02-27T21:05:04.269477Z] fa88a9df || INFO: copying files to container: /
[2024-02-27T21:05:04.281992Z] fa88a9df || INFO: copying files to container: /
[2024-02-27T21:05:04.295228Z] fa88a9df || INFO: copying files to container: /
[2024-02-27T21:05:04.306898Z] fa88a9df || INFO: copying files to container: /
[2024-02-27T21:05:05.216634Z] fa88a9df || INFO: Resources for Trial 139 (Experiment 138) have started
[2024-02-27T21:05:09.217845Z] fa88a9df || INFO: [27] determined: detected 8 gpus
[2024-02-27T21:05:09.373238Z] fa88a9df || INFO: [27] determined: detected 8 gpus
[2024-02-27T21:05:09.373333Z] fa88a9df || INFO: [27] determined: Running task container on agent_id=Neumann, hostname=5bff62ad6d4b with visible GPUs ['GPU-667d2478-75d9-c4c4-705d-1794fd9a156a', 'GPU-5d41e3bc-02fd-01b0-76db-e2ee9858a1a4', 'GPU-054e9832-1245-8c17-12d8-0706d5e0f73d', 'GPU-3aabe71e-e052-c4ee-a2b0-2f7b6f9a5753', 'GPU-441b6928-1fc7-bc07-565f-fe13d8681a7e', 'GPU-0a1f75db-5d4b-6823-b997-6189f5bed35d', 'GPU-11b6d352-6e56-06a5-7d2e-0285ff0378d6', 'GPU-cf6d6136-fc93-3917-b9b9-a6d1c527c98f']
[2024-02-27T21:05:09.796087Z] fa88a9df || + test -f startup-hook.sh
[2024-02-27T21:05:09.796171Z] fa88a9df || + set +x
[2024-02-27T21:05:10.425153Z] fa88a9df || INFO: [8] determined: New trial runner in (container fa88a9df-a6e7-44cf-b736-2469c7501394) on agent Neumann: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/tmp", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": "determined-checkpoint", "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "python3 -m determined.launch.torch_distributed python3 train.py", "environment": {"image": {"cpu": "determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.27.1", "cuda": "determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.27.1", "rocm": "determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.27.1"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dropout1": {"type": "const", "val": 0.25}, "dropout2": {"type": "const", "val": 0.5}, "learning_rate": {"type": "const", "val": 1}, "n_filters1": {"type": "const", "val": 32}, "n_filters2": {"type": "const", "val": 64}}, "labels": ["cluster-testing"], "log_policies": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "mnist_pytorch_distributed", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": true, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 1, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "cluster-testing", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1709067903}, "resources": {"max_slots": null, "slots_per_trial": 8, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "Martians", "priority": null, "is_single_node": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "validation_loss", "name": "single", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "SkyPro", "slurm": {}, "pbs": {}}
[2024-02-27T21:05:10.425310Z] fa88a9df || INFO: [8] determined: Validating checkpoint storage ...
[2024-02-27T21:05:10.426112Z] fa88a9df || INFO: [8] determined: Launching: ['sh', '-c', 'python3 -m determined.launch.torch_distributed python3 train.py']
[2024-02-27T21:05:13.225017Z] fa88a9df || WARNING: main:
[2024-02-27T21:05:13.225186Z] fa88a9df || *****************************************
[2024-02-27T21:05:13.225226Z] fa88a9df || Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-02-27T21:05:13.225252Z] fa88a9df || *****************************************
[2024-02-27T21:05:18.318720Z] fa88a9df [rank=5] || INFO: [180] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 5
[2024-02-27T21:05:18.482485Z] fa88a9df [rank=7] || INFO: [186] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 7
[2024-02-27T21:05:18.483550Z] fa88a9df [rank=0] || INFO: [184] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 0
[2024-02-27T21:05:18.618555Z] fa88a9df [rank=2] || INFO: [181] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 2
[2024-02-27T21:05:18.670399Z] fa88a9df [rank=3] || INFO: [182] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 3
[2024-02-27T21:05:18.681489Z] fa88a9df [rank=1] || INFO: [183] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 1
[2024-02-27T21:05:18.682858Z] fa88a9df [rank=6] || INFO: [187] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 6
[2024-02-27T21:05:18.692192Z] fa88a9df [rank=4] || INFO: [185] torch.distributed.distributed_c10d: Added key: store_based_barrier_key:1 to store for rank: 4
[2024-02-27T21:05:18.692661Z] fa88a9df [rank=1] || INFO: [183] torch.distributed.distributed_c10d: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:05:18.693189Z] fa88a9df [rank=4] || INFO: [185] torch.distributed.distributed_c10d: Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:05:18.694085Z] fa88a9df [rank=2] || INFO: [181] torch.distributed.distributed_c10d: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:05:18.694242Z] fa88a9df [rank=6] || INFO: [187] torch.distributed.distributed_c10d: Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:05:18.697361Z] fa88a9df [rank=7] || INFO: [186] torch.distributed.distributed_c10d: Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:05:18.698787Z] fa88a9df [rank=0] || INFO: [184] torch.distributed.distributed_c10d: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:05:18.701431Z] fa88a9df [rank=5] || INFO: [180] torch.distributed.distributed_c10d: Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:05:18.702756Z] fa88a9df [rank=3] || INFO: [182] torch.distributed.distributed_c10d: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-02-27T21:35:28.345299Z] fa88a9df [rank=5] || [E ProcessGroupNCCL.cpp:828] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804409 milliseconds before timing out.
[2024-02-27T21:35:28.418254Z] fa88a9df [rank=7] || Traceback (most recent call last):
[2024-02-27T21:35:28.418362Z] fa88a9df [rank=7] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.418396Z] fa88a9df [rank=7] || run(local=local_training)
[2024-02-27T21:35:28.418423Z] fa88a9df [rank=7] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.418449Z] fa88a9df [rank=7] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.418490Z] fa88a9df [rank=7] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.418515Z] fa88a9df [rank=7] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.418540Z] fa88a9df [rank=7] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.418563Z] fa88a9df [rank=7] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.418588Z] fa88a9df [rank=7] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.418631Z] fa88a9df [rank=7] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.418657Z] fa88a9df [rank=7] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.418680Z] fa88a9df [rank=7] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.418704Z] fa88a9df [rank=7] || RuntimeError: DDP expects same model across all ranks, but Rank 7 has 8 params, while rank 0 has inconsistent 0 params.
[2024-02-27T21:35:28.419514Z] fa88a9df [rank=5] || Traceback (most recent call last):
[2024-02-27T21:35:28.419561Z] fa88a9df [rank=5] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.419584Z] fa88a9df [rank=5] || run(local=local_training)
[2024-02-27T21:35:28.419608Z] fa88a9df [rank=5] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.419630Z] fa88a9df [rank=5] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.419652Z] fa88a9df [rank=5] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.419686Z] fa88a9df [rank=5] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.419711Z] fa88a9df [rank=5] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.419732Z] fa88a9df [rank=5] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.419755Z] fa88a9df [rank=5] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.419777Z] fa88a9df [rank=5] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.419801Z] fa88a9df [rank=5] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.419823Z] fa88a9df [rank=5] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.419846Z] fa88a9df [rank=5] || RuntimeError: DDP expects same model across all ranks, but Rank 5 has 8 params, while rank 0 has inconsistent 0 params.
[2024-02-27T21:35:28.419868Z] fa88a9df [rank=6] || Traceback (most recent call last):
[2024-02-27T21:35:28.419903Z] fa88a9df [rank=6] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.419925Z] fa88a9df [rank=0] || Traceback (most recent call last):
[2024-02-27T21:35:28.419947Z] fa88a9df [rank=6] || run(local=local_training)
[2024-02-27T21:35:28.419968Z] fa88a9df [rank=6] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.419995Z] fa88a9df [rank=0] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.420021Z] fa88a9df [rank=6] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.420042Z] fa88a9df [rank=6] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.420069Z] fa88a9df [rank=6] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.420093Z] fa88a9df [rank=6] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.420125Z] fa88a9df [rank=6] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.420147Z] fa88a9df [rank=0] || run(local=local_training)
[2024-02-27T21:35:28.420173Z] fa88a9df [rank=6] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.420196Z] fa88a9df [rank=0] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.420219Z] fa88a9df [rank=0] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.420241Z] fa88a9df [rank=0] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.420263Z] fa88a9df [rank=0] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.420287Z] fa88a9df [rank=0] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.420309Z] fa88a9df [rank=6] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.420337Z] fa88a9df [rank=6] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.420359Z] fa88a9df [rank=0] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.420381Z] fa88a9df [rank=6] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.420404Z] fa88a9df [rank=0] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.420427Z] fa88a9df [rank=6] || RuntimeError: DDP expects same model across all ranks, but Rank 6 has 8 params, while rank 0 has inconsistent 0 params.
[2024-02-27T21:35:28.420459Z] fa88a9df [rank=0] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.420482Z] fa88a9df [rank=0] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.420504Z] fa88a9df [rank=0] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.420527Z] fa88a9df [rank=0] || RuntimeError: DDP expects same model across all ranks, but Rank 0 has 8 params, while rank 3 has inconsistent 0 params.
[2024-02-27T21:35:28.500176Z] fa88a9df [rank=7] || [E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804566 milliseconds before timing out.
[2024-02-27T21:35:28.513756Z] fa88a9df [rank=0] || [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804578 milliseconds before timing out.
[2024-02-27T21:35:28.513942Z] fa88a9df [rank=5] || [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[2024-02-27T21:35:28.513992Z] fa88a9df [rank=5] || [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[2024-02-27T21:35:28.514037Z] fa88a9df [rank=5] || terminate called after throwing an instance of 'std::runtime_error'
[2024-02-27T21:35:28.514065Z] fa88a9df [rank=5] || what(): [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804409 milliseconds before timing out.
[2024-02-27T21:35:28.514089Z] fa88a9df [rank=5] || Fatal Python error: Aborted
[2024-02-27T21:35:28.514110Z] fa88a9df [rank=5] ||
[2024-02-27T21:35:28.514139Z] fa88a9df [rank=5] || Thread 0x00007f30db02e740 (most recent call first):
[2024-02-27T21:35:28.514161Z] fa88a9df [rank=5] ||
[2024-02-27T21:35:28.642064Z] fa88a9df [rank=2] || [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804705 milliseconds before timing out.
[2024-02-27T21:35:28.699004Z] fa88a9df [rank=3] || [E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804712 milliseconds before timing out.
[2024-02-27T21:35:28.705869Z] fa88a9df [rank=1] || [E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804775 milliseconds before timing out.
[2024-02-27T21:35:28.718657Z] fa88a9df [rank=4] || [E ProcessGroupNCCL.cpp:828] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804782 milliseconds before timing out.
[2024-02-27T21:35:28.726901Z] fa88a9df [rank=2] || Traceback (most recent call last):
[2024-02-27T21:35:28.727029Z] fa88a9df [rank=2] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.727090Z] fa88a9df [rank=2] || run(local=local_training)
[2024-02-27T21:35:28.727117Z] fa88a9df [rank=2] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.727142Z] fa88a9df [rank=2] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.727165Z] fa88a9df [rank=2] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.727188Z] fa88a9df [rank=2] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.727213Z] fa88a9df [rank=2] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.727235Z] fa88a9df [rank=2] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.727259Z] fa88a9df [rank=2] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.727281Z] fa88a9df [rank=2] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.727305Z] fa88a9df [rank=2] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.727328Z] fa88a9df [rank=2] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.727352Z] fa88a9df [rank=2] || RuntimeError: DDP expects same model across all ranks, but Rank 2 has 8 params, while rank 0 has inconsistent 0 params.
[2024-02-27T21:35:28.727473Z] fa88a9df [rank=4] || Traceback (most recent call last):
[2024-02-27T21:35:28.727536Z] fa88a9df [rank=4] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.727871Z] fa88a9df [rank=4] || run(local=local_training)
[2024-02-27T21:35:28.727907Z] fa88a9df [rank=4] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.727931Z] fa88a9df [rank=4] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.727953Z] fa88a9df [rank=4] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.727984Z] fa88a9df [rank=4] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.728008Z] fa88a9df [rank=4] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.728033Z] fa88a9df [rank=4] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.728058Z] fa88a9df [rank=4] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.728082Z] fa88a9df [rank=4] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.728108Z] fa88a9df [rank=4] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.728133Z] fa88a9df [rank=4] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.728159Z] fa88a9df [rank=4] || RuntimeError: DDP expects same model across all ranks, but Rank 4 has 8 params, while rank 0 has inconsistent 0 params.
[2024-02-27T21:35:28.828776Z] fa88a9df [rank=2] || [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[2024-02-27T21:35:28.828905Z] fa88a9df [rank=2] || [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[2024-02-27T21:35:28.828941Z] fa88a9df [rank=2] || terminate called after throwing an instance of 'std::runtime_error'
[2024-02-27T21:35:28.828971Z] fa88a9df [rank=2] || what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804705 milliseconds before timing out.
[2024-02-27T21:35:28.828995Z] fa88a9df [rank=2] || Fatal Python error: Aborted
[2024-02-27T21:35:28.829019Z] fa88a9df [rank=2] ||
[2024-02-27T21:35:28.829044Z] fa88a9df [rank=2] || Thread 0x00007f52d58f3740 (most recent call first):
[2024-02-27T21:35:28.829066Z] fa88a9df [rank=2] ||
[2024-02-27T21:35:28.893278Z] fa88a9df [rank=4] || [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[2024-02-27T21:35:28.893459Z] fa88a9df [rank=4] || [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[2024-02-27T21:35:28.893507Z] fa88a9df [rank=4] || terminate called after throwing an instance of 'std::runtime_error'
[2024-02-27T21:35:28.893536Z] fa88a9df [rank=4] || what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804782 milliseconds before timing out.
[2024-02-27T21:35:28.893559Z] fa88a9df [rank=4] || Fatal Python error: Aborted
[2024-02-27T21:35:28.893581Z] fa88a9df [rank=4] ||
[2024-02-27T21:35:28.893605Z] fa88a9df [rank=4] || Thread 0x00007eff6f28b740 (most recent call first):
[2024-02-27T21:35:28.893627Z] fa88a9df [rank=4] ||
[2024-02-27T21:35:28.926822Z] fa88a9df [rank=3] || Traceback (most recent call last):
[2024-02-27T21:35:28.926905Z] fa88a9df [rank=3] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.926973Z] fa88a9df [rank=3] || run(local=local_training)
[2024-02-27T21:35:28.927000Z] fa88a9df [rank=3] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.927025Z] fa88a9df [rank=3] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.927047Z] fa88a9df [rank=3] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.927072Z] fa88a9df [rank=3] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.927097Z] fa88a9df [rank=3] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.927120Z] fa88a9df [rank=3] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.927144Z] fa88a9df [rank=3] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.927175Z] fa88a9df [rank=3] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.927199Z] fa88a9df [rank=3] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.927221Z] fa88a9df [rank=3] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.927244Z] fa88a9df [rank=3] || RuntimeError: [3]: params[0] in this process with sizes [32, 1, 3, 3] appears not to match sizes of the same param in process 0.
[2024-02-27T21:35:28.958073Z] fa88a9df [rank=1] || Traceback (most recent call last):
[2024-02-27T21:35:28.958145Z] fa88a9df [rank=1] || File "/run/determined/workdir/train.py", line 126, in
[2024-02-27T21:35:28.958655Z] fa88a9df [rank=1] || run(local=local_training)
[2024-02-27T21:35:28.958684Z] fa88a9df [rank=1] || File "/run/determined/workdir/train.py", line 116, in run
[2024-02-27T21:35:28.958772Z] fa88a9df [rank=1] || trial = MNistTrial(train_context, hparams=hparams)
[2024-02-27T21:35:28.958799Z] fa88a9df [rank=1] || File "/run/determined/workdir/train.py", line 46, in init
[2024-02-27T21:35:28.958817Z] fa88a9df [rank=1] || self.model = self.context.wrap_model(model.build_model(hparams=hparams))
[2024-02-27T21:35:28.958837Z] fa88a9df [rank=1] || File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_context.py", line 299, in wrap_model
[2024-02-27T21:35:28.958994Z] fa88a9df [rank=1] || wrapped_model = self._PyTorchDistributedDataParallel(model)
[2024-02-27T21:35:28.959022Z] fa88a9df [rank=1] || File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 674, in init
[2024-02-27T21:35:28.959194Z] fa88a9df [rank=1] || _verify_param_shape_across_processes(self.process_group, parameters)
[2024-02-27T21:35:28.959223Z] fa88a9df [rank=1] || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
[2024-02-27T21:35:28.959283Z] fa88a9df [rank=1] || return dist._verify_params_across_processes(process_group, tensors, logger)
[2024-02-27T21:35:28.959349Z] fa88a9df [rank=1] || RuntimeError: DDP expects same model across all ranks, but Rank 1 has 8 params, while rank 0 has inconsistent 0 params.
[2024-02-27T21:35:29.149213Z] fa88a9df [rank=1] || [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[2024-02-27T21:35:29.149348Z] fa88a9df [rank=1] || [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[2024-02-27T21:35:29.149395Z] fa88a9df [rank=1] || terminate called after throwing an instance of 'std::runtime_error'
[2024-02-27T21:35:29.149424Z] fa88a9df [rank=1] || what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1804775 milliseconds before timing out.
[2024-02-27T21:35:29.149448Z] fa88a9df [rank=1] || Fatal Python error: Aborted
[2024-02-27T21:35:29.149470Z] fa88a9df [rank=1] ||
[2024-02-27T21:35:29.149494Z] fa88a9df [rank=1] || Thread 0x00007f7a4fa45740 (most recent call first):
[2024-02-27T21:35:29.149516Z] fa88a9df [rank=1] ||
[2024-02-27T21:35:30.110313Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 165 closing signal SIGTERM
[2024-02-27T21:35:30.110442Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 166 closing signal SIGTERM
[2024-02-27T21:35:30.110571Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 168 closing signal SIGTERM
[2024-02-27T21:35:30.110617Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 169 closing signal SIGTERM
[2024-02-27T21:35:32.106313Z] fa88a9df || WARNING: torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
[2024-02-27T21:35:32.106517Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 165 closing signal SIGTERM
[2024-02-27T21:35:32.106571Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 166 closing signal SIGTERM
[2024-02-27T21:35:32.106606Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 168 closing signal SIGTERM
[2024-02-27T21:35:32.106630Z] fa88a9df || WARNING: torch.distributed.elastic.multiprocessing.api:Sending process 169 closing signal SIGTERM
[2024-02-27T21:35:34.295539Z] fa88a9df || Traceback (most recent call last):
[2024-02-27T21:35:34.295662Z] fa88a9df || File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[2024-02-27T21:35:34.295706Z] fa88a9df || return _run_code(code, main_globals, None,
[2024-02-27T21:35:34.295730Z] fa88a9df || File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
[2024-02-27T21:35:34.295754Z] fa88a9df || exec(code, run_globals)
[2024-02-27T21:35:34.295779Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in
[2024-02-27T21:35:34.296712Z] fa88a9df || main()
[2024-02-27T21:35:34.296814Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
[2024-02-27T21:35:34.296854Z] fa88a9df || return f(*args, **kwargs)
[2024-02-27T21:35:34.296878Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
[2024-02-27T21:35:34.297808Z] fa88a9df || run(args)
[2024-02-27T21:35:34.297906Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
[2024-02-27T21:35:34.298578Z] fa88a9df || elastic_launch(
[2024-02-27T21:35:34.298676Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
[2024-02-27T21:35:34.298707Z] fa88a9df || return launch_agent(self._config, self._entrypoint, list(args))
[2024-02-27T21:35:34.298729Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
[2024-02-27T21:35:34.298760Z] fa88a9df || result = agent.run()
[2024-02-27T21:35:34.298782Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
[2024-02-27T21:35:34.298804Z] fa88a9df || result = f(*args, **kwargs)
[2024-02-27T21:35:34.298826Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
[2024-02-27T21:35:34.299310Z] fa88a9df || result = self._invoke_run(role)
[2024-02-27T21:35:34.299408Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 865, in _invoke_run
[2024-02-27T21:35:34.299714Z] fa88a9df || run_result = self._monitor_workers(self._worker_group)
[2024-02-27T21:35:34.299812Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
[2024-02-27T21:35:34.299840Z] fa88a9df || result = f(*args, **kwargs)
[2024-02-27T21:35:34.299864Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 306, in _monitor_workers
[2024-02-27T21:35:34.299896Z] fa88a9df || result = self._pcontext.wait(0)
[2024-02-27T21:35:34.299918Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 288, in wait
[2024-02-27T21:35:34.300261Z] fa88a9df || return self._poll()
[2024-02-27T21:35:34.300356Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 664, in _poll
[2024-02-27T21:35:34.300398Z] fa88a9df || self.close() # terminate all running procs
[2024-02-27T21:35:34.300421Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 331, in close
[2024-02-27T21:35:34.300471Z] fa88a9df || self._close(death_sig=death_sig, timeout=timeout)
[2024-02-27T21:35:34.300493Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 708, in _close
[2024-02-27T21:35:34.300932Z] fa88a9df || handler.proc.wait(time_to_wait)
[2024-02-27T21:35:34.301031Z] fa88a9df || File "/opt/conda/lib/python3.10/subprocess.py", line 1209, in wait
[2024-02-27T21:35:34.301482Z] fa88a9df || return self._wait(timeout=timeout)
[2024-02-27T21:35:34.301579Z] fa88a9df || File "/opt/conda/lib/python3.10/subprocess.py", line 1953, in _wait
[2024-02-27T21:35:34.302304Z] fa88a9df || time.sleep(delay)
[2024-02-27T21:35:34.302401Z] fa88a9df || File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
[2024-02-27T21:35:34.302431Z] fa88a9df || raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
[2024-02-27T21:35:34.302453Z] fa88a9df || torch.distributed.elastic.multiprocessing.api.SignalException: Process 98 got signal: 15
[2024-02-27T21:35:35.624289Z] fa88a9df || ERROR: crashed: resources failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)
[2024-02-27T21:35:35.628581Z] || ERROR: Trial 139 (Experiment 138) was terminated: allocation failed: resources failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)

@ioga
Copy link
Contributor

ioga commented Feb 28, 2024

images with -mpi- in them are only for enterprise HPC setups running on HPE/Cray gear.
For your case, please use non-MPI build instead, e.g. determinedai/environments:cuda-11.8-pytorch-2.0-gpu-0.27.1.

@samjenks
Copy link
Author

samjenks commented Feb 28, 2024

Thanks,

I fixed the docker env accordingly but it did not seem to change anything in regards to the above ddp rank problem

@ioga
Copy link
Contributor

ioga commented Feb 28, 2024

  1. have you ever run distributed training with NCCL on the same node outside determined?
  2. can you try setting NCCL_DEBUG=INFO env variable on the experiment to get more logs?
  3. any more details on your setup? on some systems it may be necessary to configure NCCL to use proper network interfaces, e.g. NCCL_SOCKET_IFNAME=ens,eth,ib

@samjenks
Copy link
Author

  1. Yes, just a normal Pytorch DDP with NCCL backend, it worked

  2. experiment_144_trial_145_logs.txt

  3. setup, this is all on 1 server card with 8 Nvidia A40s. There isn't any networking being attempted yet.

Reading up on the NCCL_SOCKET_IFNAME, how do I determine if that needs to be set for the docker containers determined runs off of? My previous distributed training attempts were via Conda envs with no containers involved.

@ioga
Copy link
Contributor

ioga commented Feb 28, 2024

okay. I assume this was w/o docker, so in this thread we are basically troubleshooting "why does NCCL hang in docker".

  1. you can set the env vars in the experiment config in environment->environment_variables section, e.g.
environment:
  environment_variables:
    - NCCL_DEBUG=INFO
    # You may need to modify this to match your network configuration.
    - NCCL_SOCKET_IFNAME=ens,eth,ib

few more ideas to try:

  1. Try setting recommended nvidia settings for docker, i.e. put something like this into your /etc/docker/daemon.json:
{
    "default-runtime": "nvidia",
    "default-shm-size": "1G",
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    },
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
  1. Try setting NCCL_SHM_DISABLE=1 and/or NCCL_P2P_DISABLE=1 to make it fall back to network comms.

@samjenks
Copy link
Author

samjenks commented Feb 29, 2024

okay. I assume this was w/o docker, so in this thread we are basically troubleshooting "why does NCCL hang in docker".

I don't know if there is any confusion here but this thread is "Does the basic setup to OSS Determined-AI need modifications to the NCCL to prevent hanging when doing distributed training"

I am a bit unclear on why the assumption is that this was w/o docker?
As far as I read in Determined-AI's documentation, it is a requirement to use the determined/environments for cluster deployment. Is there any way to to run a job without having a containerized environment setup?

I will investigate the modifications to the NCCL env variables but I find it strange that a job will run on 6 GPUs just fine, fail on 7-8 GPUs and the problem would be related to NCCL comms. It feels like if that was the case, the job would also fail on 6 GPUs.
Is there a way to escalate the logging level to see how Determined is sharding the model/data across the slots?

@ioga
Copy link
Contributor

ioga commented Feb 29, 2024

I am a bit unclear on why the assumption is that this was w/o docker?

I meant, when you said

Yes, just a normal Pytorch DDP with NCCL backend, it worked

did you run it in docker, or not? I assumed you'd do it without docker, because that's how people usually do it.

I will investigate the modifications to the NCCL env variables but I find it strange that a job will run on 6 GPUs just fine, fail on 7-8 GPUs and the problem would be related to NCCL comms. It feels like if that was the case, the job would also fail on 6 GPUs

30 minute NCCL comms timeout is bizarre and unexpected. so I am heavily discounting the params size mismatch which comes after.

so you are saying when you set slots_per_trial: 6 for the same example, it doesn't hang and works fine? this is quite strange.

Is there a way to escalate the logging level to see how Determined is sharding the model/data across the slots?

my understanding is that you're running this example: https://github.com/determined-ai/determined/blob/main/examples/tutorials/mnist_pytorch/distributed.yaml it works fine for me on 8 gpus.

this is DDP, there's no model sharding at all, as it's supposed to be replicated across the GPUs. I don't see how that specific issue can be caused by the training code.

that's why NCCL shared memory settings and NCCL transport options aimed at fixing the 30 minute hangup seem like a more promising path forward here for me.

can you try running other examples, e.g. https://github.com/determined-ai/determined-examples/blob/main/computer_vision/cifar10_pytorch/distributed.yaml but with 8 slots instead of 16?

fail on 7-8 GPUs and the problem would be related to NCCL comms. It feels like if that was the case, the job would also fail on 6 GPUs

we've seen such symptoms happen on 8 GPU servers when it ended up being a faulty GPU card which had to be replaced. these are hard to diagnose, so let's try to exhaust the other options first. for this, a possible set test of tests could include disabling a couple working cards in det and then running an experiment which uses the remaining faulty candidate cards.

@samjenks
Copy link
Author

Hey thanks for taking the time to help me out with this.

I did a bit more robust debugging and the problem appears to be that two of the GPU cards do not work together. The previous 6-slot experiment worked because those two were not on the same job. I was able to replicate the NCCL hanging behavior on another 6-slot experiment that those two GPUs were jointly on and do a third ablation test where the problem GPUs were separated that worked, replicating the first experiment.

that's why NCCL shared memory settings and NCCL transport options aimed at fixing the 30 minute hangup seem like a more promising path forward here for me.

I tested both of the shared memory and transport option env variables and they both individually did fix the problem at 7-8 GPUs. The mnist code trained to completion.

a possible set test of tests could include disabling a couple working cards in det

I just attempted something similar by removing all of the GPUs from the agent pool except the 2 problem ones, was able to confirm similar behavior with the NCCL hanging, and am currently attempting the inverse for 7 GPUs

Do the changes to the shared memory and transport working imply faulty communication buses or cards? I guess I don't really understand why both of those env variable changes work

@ioga
Copy link
Contributor

ioga commented Feb 29, 2024

okay, good to hear you've tracked it down.

Do the changes to the shared memory and transport working imply faulty communication buses or cards? I guess I don't really understand why both of those env variable changes work

I hoped that if we discover it works w/o the shared memory and p2p, it'll confirm something's wrong with intra-node NVLink.

I am not a hardware troubleshooting expert so I'd refer you to your hardware vendor (or nvidia) for that. as far as I understand, technicians usually

  1. try to unplug the cards / plug them back in.
  2. if that does not work, swap the faulty card.

@samjenks
Copy link
Author

samjenks commented Feb 29, 2024

I closed this since my initial query seems to be solved.

But I do have a follow up, how do I debug the intra-node NVLink? The nvidia-smi nvlink -s returns what looks like normal results
Its clear it now works w/o shared memory and p2p.

I am also not a hardware troubleshooter so I'll leave that for last.

Edit: Need to test GPU 1 and GPU 3's interaction but this implies maybe its a PIX issue (2 and 3 were the problem children)
Screenshot from 2024-02-29 16-03-37

Edit2: For anyone who finds this at a latter date, the issue is the PCIe Access Control System is causing hanging behavior by routing the GPU 3's communications to GPU 1 and 2 through the CPU, slowing it to a crawl and causing the distributed training to timeout. The NCCL_P2P_DISABLE=1 forces the GPUs to not use the PCI interface to talk. Doing this can result in a loss of performance. It appears, if there is a will, that one can turn off the PCIe ACS via: nccl_docs, TBD if it directly fixes my issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants