Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Couldn't initialize SMDDP on HuggingFace Training Containers #3746

Open
3 of 6 tasks
rohit901 opened this issue Mar 5, 2024 · 1 comment
Open
3 of 6 tasks

Comments

@rohit901
Copy link

rohit901 commented Mar 5, 2024

Checklist

Concise Description:
I'm using the following HuggingFace container from here

py_version: py310
pytorch_version: 2.0.0
region: us-east-1
transformers_version: 4.28.1

I'm using the instance: ml.p4d.24xlarge (8x A100), and have enabled data parallel mode.
I'm running my job using accelerate script and accelerate sagemaker config:
config.yaml:

base_job_name: accelerate-sagemaker-1
compute_environment: AMAZON_SAGEMAKER
debug: false
distributed_type: DATA_PARALLEL
ec2_instance_type: ml.p4d.24xlarge
gpu_ids: all
iam_role_name: xxxx
mixed_precision: fp16
num_machines: 1
profile: default
py_version: py310
pytorch_version: 2.0.0
region: us-east-1
transformers_version: 4.28.1
use_cpu: false

However, my script fails with some NCCL error.
Initially, I used a different pytorch version (i.e. 2.1.0) and was facing issue saying smdistributed not found, I've described my issue here: #3627 (comment)

Now using a different container version, I'm getting NCCL errors.

snippets from the logs:

[1,mpirank:0,algo-1]<stdout>:Running smdistributed.dataparallel v1.8.0
[1,mpirank:0,algo-1]<stdout>:SMDDP: Single node mode
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7
[algo-1:00100] 7 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00100] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
 RuntimeError
 Couldn't initialize SMDDP.
 Expected mechanism for checking for NCCL backend has changed.
 Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
 
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
 return _run_module_code(code, init_globals, run_name,
 File "/opt/conda/lib/python3.10/runpy.py", line 96, in _run_module_code
 _run_code(code, mod_globals, init_globals,
 File "train_vlcm_distill_lcm_wds.py", line 1416, in <module>
 main(args)
 File "train_vlcm_distill_lcm_wds.py", line 780, in main
 accelerator = Accelerator(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 361, in __init__
 self.state = AcceleratorState(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 549, in __init__
 PartialState(cpu, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 95, in __init__
 import smdistributed.dataparallel.torch.torch_smddp  # noqa
 File "/opt/conda/lib/python3.10/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 33, in <module>
 raise RuntimeError("""
 algo-1:133:133 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:133:133 [0] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:133:133 [0] NCCL INFO cudaDriverVersion 12020
 NCCL version 2.19.3+cuda12.3
 algo-1:129:205 [7] NCCL INFO cudaDriverVersion 12020
 algo-1:129:205 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:129:205 [7] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:124:209 [4] NCCL INFO cudaDriverVersion 12020
 algo-1:124:209 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:124:209 [4] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:113:113 [1] NCCL INFO cudaDriverVersion 12020
 algo-1:113:113 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:113:113 [1] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:128:197 [6] NCCL INFO cudaDriverVersion 12020
 algo-1:128:197 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:116:211 [2] NCCL INFO cudaDriverVersion 12020
 algo-1:128:197 [6] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:116:211 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:116:211 [2] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:126:199 [5] NCCL INFO cudaDriverVersion 12020
 algo-1:126:199 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:122:122 [3] NCCL INFO cudaDriverVersion 12020
 algo-1:126:199 [5] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:122:122 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:122:122 [3] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:133:133 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:133:133 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:133:133 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:133:133 [0] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:133:133 [0] NCCL INFO Using non-device net plugin version 0
 algo-1:133:133 [0] NCCL INFO Using network AWS Libfabric
 algo-1:133:133 [0] NCCL INFO DMA-BUF is available on GPU device 0
 algo-1:116:211 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:116:211 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:116:211 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:113:113 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:113:113 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:113:113 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:116:211 [2] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:116:211 [2] NCCL INFO Using non-device net plugin version 0
 algo-1:116:211 [2] NCCL INFO Using network AWS Libfabric
 algo-1:113:113 [1] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:113:113 [1] NCCL INFO Using non-device net plugin version 0
 algo-1:113:113 [1] NCCL INFO Using network AWS Libfabric
 algo-1:129:205 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:129:205 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:129:205 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:126:199 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:126:199 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:126:199 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:128:197 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:128:197 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:128:197 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:129:205 [7] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:129:205 [7] NCCL INFO Using non-device net plugin version 0
 algo-1:129:205 [7] NCCL INFO Using network AWS Libfabric
 algo-1:126:199 [5] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:126:199 [5] NCCL INFO Using non-device net plugin version 0
 algo-1:126:199 [5] NCCL INFO Using network AWS Libfabric
 algo-1:128:197 [6] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:128:197 [6] NCCL INFO Using non-device net plugin version 0
 algo-1:128:197 [6] NCCL INFO Using network AWS Libfabric
 algo-1:122:122 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:122:122 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:122:122 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:122:122 [3] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:122:122 [3] NCCL INFO Using non-device net plugin version 0
 algo-1:122:122 [3] NCCL INFO Using network AWS Libfabric
 algo-1:124:209 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:124:209 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:124:209 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:124:209 [4] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:124:209 [4] NCCL INFO Using non-device net plugin version 0
 algo-1:124:209 [4] NCCL INFO Using network AWS Libfabric
 algo-1:116:211 [2] NCCL INFO DMA-BUF is available on GPU device 2
 algo-1:113:113 [1] NCCL INFO DMA-BUF is available on GPU device 1
 algo-1:129:205 [7] NCCL INFO DMA-BUF is available on GPU device 7
 algo-1:126:199 [5] NCCL INFO DMA-BUF is available on GPU device 5
 algo-1:128:197 [6] NCCL INFO DMA-BUF is available on GPU device 6
 algo-1:122:122 [3] NCCL INFO DMA-BUF is available on GPU device 3
 algo-1:124:209 [4] NCCL INFO DMA-BUF is available on GPU device 4
 algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init START
 algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init START
 algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init START
 algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init START
 algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init START
 algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init START
 algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init START
 algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init START
 algo-1:129:205 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:128:197 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:116:211 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:122:122 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:124:209 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:133:133 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:126:199 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:113:113 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:129:205 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
 algo-1:129:205 [7] NCCL INFO NVLS multicast support is not available on dev 7
 algo-1:126:199 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
 algo-1:126:199 [5] NCCL INFO NVLS multicast support is not available on dev 5
 algo-1:128:197 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
 algo-1:128:197 [6] NCCL INFO NVLS multicast support is not available on dev 6
 algo-1:113:113 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
 algo-1:113:113 [1] NCCL INFO NVLS multicast support is not available on dev 1
 algo-1:116:211 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
 algo-1:116:211 [2] NCCL INFO NVLS multicast support is not available on dev 2
 algo-1:124:209 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
 algo-1:124:209 [4] NCCL INFO NVLS multicast support is not available on dev 4
 algo-1:122:122 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
 algo-1:122:122 [3] NCCL INFO NVLS multicast support is not available on dev 3
 algo-1:133:133 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
 algo-1:133:133 [0] NCCL INFO NVLS multicast support is not available on dev 0
 algo-1:133:133 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 algo-1:113:113 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 algo-1:113:113 [1] NCCL INFO P2P Chunksize set to 524288
 algo-1:116:211 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 algo-1:116:211 [2] NCCL INFO P2P Chunksize set to 524288
 algo-1:122:122 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 algo-1:122:122 [3] NCCL INFO P2P Chunksize set to 524288
 algo-1:133:133 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 algo-1:126:199 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 algo-1:126:199 [5] NCCL INFO P2P Chunksize set to 524288
 algo-1:124:209 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 algo-1:124:209 [4] NCCL INFO P2P Chunksize set to 524288
 algo-1:128:197 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 algo-1:128:197 [6] NCCL INFO P2P Chunksize set to 524288
 algo-1:129:205 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 algo-1:129:205 [7] NCCL INFO P2P Chunksize set to 524288
 algo-1:133:133 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 algo-1:133:133 [0] NCCL INFO P2P Chunksize set to 524288
 algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Connected all rings
 algo-1:113:113 [1] NCCL INFO Connected all rings
 algo-1:133:133 [0] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Connected all rings
 algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Connected all trees
 algo-1:133:133 [0] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:133:133 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:133:133 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:113:113 [1] NCCL INFO Connected all trees
 algo-1:113:113 [1] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:113:113 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:113:113 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:116:211 [2] NCCL INFO Connected all trees
 algo-1:116:211 [2] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:116:211 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:116:211 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:129:205 [7] NCCL INFO Connected all trees
 algo-1:129:205 [7] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:129:205 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:129:205 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:122:122 [3] NCCL INFO Connected all trees
 algo-1:122:122 [3] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:122:122 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:122:122 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:124:209 [4] NCCL INFO Connected all trees
 algo-1:124:209 [4] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:124:209 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:124:209 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:128:197 [6] NCCL INFO Connected all trees
 algo-1:128:197 [6] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:126:199 [5] NCCL INFO Connected all trees
 algo-1:128:197 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:128:197 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:126:199 [5] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:126:199 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:126:199 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init COMPLETE
 Running smdistributed.dataparallel v1.8.0
 SMDDP: Single node mode
 algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
 algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7"
Command "mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/gethostname.cpython-310-x86_64-linux-gnu.so -x NCCL_PROTO=simple -x FI_EFA_USE_DEVICE_RDMA=1 smddprun /opt/conda/bin/python3.10 -m mpi4py train_vlcm_distill_lcm_wds.py --adam_weight_decay 0 --checkpointing_steps 200 --checkpoints_total_limit 10 --dataloader_num_workers 8 --ema_decay 0.95 --enable_xformers_memory_efficient_attention True --gradient_accumulation_steps 1 --gradient_checkpointing True --learning_rate 1e-06 --loss_type huber --max_train_samples 10727607 --max_train_steps 10727607 --mixed_precision fp16 --pretrained_teacher_model damo-vilab/text-to-video-ms-1.7b --resolution 512 --resume_from_checkpoint latest --seed 453645634 --train_batch_size 16 --use_8bit_adam True --validation_steps 200"
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2024-03-05 07:29:19 Uploading - Uploading generated training model
2024-03-05 07:29:19 Failed - Training job failed
Traceback (most recent call last):
  File "/home/rohit.bharadwaj/.conda/envs/LCM/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1021, in launch_command
    sagemaker_launcher(defaults, args)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 840, in sagemaker_launcher
    huggingface_estimator.fit(inputs=sagemaker_inputs)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
    return run_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 1341, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 2677, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 5568, in logs_for_job
    _logs_for_job(self, job_name, wait, poll, log_type, timeout)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7711, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7764, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job accelerate-sagemaker-1-2024-03-05-07-15-53-204: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
 RuntimeError
 Couldn't initialize SMDDP.
 Expected mechanism for checking for NCCL backend has changed.
 Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
 
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
 return _run_module_code(code, init_globals, run_name,
 File "/opt/conda/lib/python3.10/runpy.py",

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04

Current behavior:
shown in the logs above

Expected behavior:
Model should train on multiple GPUs.

Additional context:

@rohit901
Copy link
Author

rohit901 commented Mar 11, 2024

I think this issue is related to Pytorch version 2.2.0 or CUDA 12.
One of my dependency (xformers) was forcing the install of latest pytorch version, and was causing this issue. I hope this can be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant