You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using the instance: ml.p4d.24xlarge (8x A100), and have enabled data parallel mode.
I'm running my job using accelerate script and accelerate sagemaker config:
config.yaml:
However, my script fails with some NCCL error.
Initially, I used a different pytorch version (i.e. 2.1.0) and was facing issue saying smdistributed not found, I've described my issue here: #3627 (comment)
Now using a different container version, I'm getting NCCL errors.
snippets from the logs:
[1,mpirank:0,algo-1]<stdout>:Running smdistributed.dataparallel v1.8.0
[1,mpirank:0,algo-1]<stdout>:SMDDP: Single node mode
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7
[algo-1:00100] 7 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00100] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR Reporting training FAILURE
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
RuntimeError
Couldn't initialize SMDDP.
Expected mechanism for checking for NCCL backend has changed.
Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
Traceback (most recent call last)
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
run_command_line(args)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
run_path(sys.argv[0], run_name='__main__')
File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/conda/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "train_vlcm_distill_lcm_wds.py", line 1416, in <module>
main(args)
File "train_vlcm_distill_lcm_wds.py", line 780, in main
accelerator = Accelerator(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 361, in __init__
self.state = AcceleratorState(
File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 549, in __init__
PartialState(cpu, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 95, in __init__
import smdistributed.dataparallel.torch.torch_smddp # noqa
File "/opt/conda/lib/python3.10/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 33, in <module>
raise RuntimeError("""
algo-1:133:133 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:133:133 [0] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:133:133 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:133:133 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
algo-1:129:205 [7] NCCL INFO cudaDriverVersion 12020
algo-1:129:205 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:129:205 [7] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:129:205 [7] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:124:209 [4] NCCL INFO cudaDriverVersion 12020
algo-1:124:209 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:124:209 [4] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:113:113 [1] NCCL INFO cudaDriverVersion 12020
algo-1:113:113 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:124:209 [4] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:113:113 [1] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:128:197 [6] NCCL INFO cudaDriverVersion 12020
algo-1:128:197 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:116:211 [2] NCCL INFO cudaDriverVersion 12020
algo-1:128:197 [6] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:116:211 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:116:211 [2] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:113:113 [1] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:126:199 [5] NCCL INFO cudaDriverVersion 12020
algo-1:126:199 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:122:122 [3] NCCL INFO cudaDriverVersion 12020
algo-1:126:199 [5] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:122:122 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:128:197 [6] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:116:211 [2] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:122:122 [3] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:126:199 [5] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:122:122 [3] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:133:133 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:133:133 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:133:133 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:133:133 [0] NCCL INFO NET/OFI Selected Provider is efa
algo-1:133:133 [0] NCCL INFO Using non-device net plugin version 0
algo-1:133:133 [0] NCCL INFO Using network AWS Libfabric
algo-1:133:133 [0] NCCL INFO DMA-BUF is available on GPU device 0
algo-1:116:211 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:116:211 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:116:211 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:113:113 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:113:113 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:113:113 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:116:211 [2] NCCL INFO NET/OFI Selected Provider is efa
algo-1:116:211 [2] NCCL INFO Using non-device net plugin version 0
algo-1:116:211 [2] NCCL INFO Using network AWS Libfabric
algo-1:113:113 [1] NCCL INFO NET/OFI Selected Provider is efa
algo-1:113:113 [1] NCCL INFO Using non-device net plugin version 0
algo-1:113:113 [1] NCCL INFO Using network AWS Libfabric
algo-1:129:205 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:129:205 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:129:205 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:126:199 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:126:199 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:126:199 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:128:197 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:128:197 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:128:197 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:129:205 [7] NCCL INFO NET/OFI Selected Provider is efa
algo-1:129:205 [7] NCCL INFO Using non-device net plugin version 0
algo-1:129:205 [7] NCCL INFO Using network AWS Libfabric
algo-1:126:199 [5] NCCL INFO NET/OFI Selected Provider is efa
algo-1:126:199 [5] NCCL INFO Using non-device net plugin version 0
algo-1:126:199 [5] NCCL INFO Using network AWS Libfabric
algo-1:128:197 [6] NCCL INFO NET/OFI Selected Provider is efa
algo-1:128:197 [6] NCCL INFO Using non-device net plugin version 0
algo-1:128:197 [6] NCCL INFO Using network AWS Libfabric
algo-1:122:122 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:122:122 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:122:122 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:122:122 [3] NCCL INFO NET/OFI Selected Provider is efa
algo-1:122:122 [3] NCCL INFO Using non-device net plugin version 0
algo-1:122:122 [3] NCCL INFO Using network AWS Libfabric
algo-1:124:209 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:124:209 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:124:209 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:124:209 [4] NCCL INFO NET/OFI Selected Provider is efa
algo-1:124:209 [4] NCCL INFO Using non-device net plugin version 0
algo-1:124:209 [4] NCCL INFO Using network AWS Libfabric
algo-1:116:211 [2] NCCL INFO DMA-BUF is available on GPU device 2
algo-1:113:113 [1] NCCL INFO DMA-BUF is available on GPU device 1
algo-1:129:205 [7] NCCL INFO DMA-BUF is available on GPU device 7
algo-1:126:199 [5] NCCL INFO DMA-BUF is available on GPU device 5
algo-1:128:197 [6] NCCL INFO DMA-BUF is available on GPU device 6
algo-1:122:122 [3] NCCL INFO DMA-BUF is available on GPU device 3
algo-1:124:209 [4] NCCL INFO DMA-BUF is available on GPU device 4
algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init START
algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init START
algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init START
algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init START
algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init START
algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init START
algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init START
algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init START
algo-1:129:205 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:128:197 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:116:211 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:122:122 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:124:209 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:133:133 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:126:199 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:113:113 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:129:205 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
algo-1:129:205 [7] NCCL INFO NVLS multicast support is not available on dev 7
algo-1:126:199 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
algo-1:126:199 [5] NCCL INFO NVLS multicast support is not available on dev 5
algo-1:128:197 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
algo-1:128:197 [6] NCCL INFO NVLS multicast support is not available on dev 6
algo-1:113:113 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
algo-1:113:113 [1] NCCL INFO NVLS multicast support is not available on dev 1
algo-1:116:211 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
algo-1:116:211 [2] NCCL INFO NVLS multicast support is not available on dev 2
algo-1:124:209 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
algo-1:124:209 [4] NCCL INFO NVLS multicast support is not available on dev 4
algo-1:122:122 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
algo-1:122:122 [3] NCCL INFO NVLS multicast support is not available on dev 3
algo-1:133:133 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
algo-1:133:133 [0] NCCL INFO NVLS multicast support is not available on dev 0
algo-1:133:133 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
algo-1:113:113 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
algo-1:113:113 [1] NCCL INFO P2P Chunksize set to 524288
algo-1:116:211 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
algo-1:116:211 [2] NCCL INFO P2P Chunksize set to 524288
algo-1:122:122 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
algo-1:122:122 [3] NCCL INFO P2P Chunksize set to 524288
algo-1:133:133 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
algo-1:126:199 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
algo-1:126:199 [5] NCCL INFO P2P Chunksize set to 524288
algo-1:124:209 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
algo-1:124:209 [4] NCCL INFO P2P Chunksize set to 524288
algo-1:128:197 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
algo-1:128:197 [6] NCCL INFO P2P Chunksize set to 524288
algo-1:129:205 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
algo-1:129:205 [7] NCCL INFO P2P Chunksize set to 524288
algo-1:133:133 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
algo-1:133:133 [0] NCCL INFO P2P Chunksize set to 524288
algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Connected all rings
algo-1:113:113 [1] NCCL INFO Connected all rings
algo-1:133:133 [0] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Connected all rings
algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Connected all trees
algo-1:133:133 [0] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:133:133 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:133:133 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:113:113 [1] NCCL INFO Connected all trees
algo-1:113:113 [1] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:113:113 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:113:113 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:116:211 [2] NCCL INFO Connected all trees
algo-1:116:211 [2] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:116:211 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:116:211 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:129:205 [7] NCCL INFO Connected all trees
algo-1:129:205 [7] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:129:205 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:129:205 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:122:122 [3] NCCL INFO Connected all trees
algo-1:122:122 [3] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:122:122 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:122:122 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:124:209 [4] NCCL INFO Connected all trees
algo-1:124:209 [4] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:124:209 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:124:209 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:128:197 [6] NCCL INFO Connected all trees
algo-1:128:197 [6] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:126:199 [5] NCCL INFO Connected all trees
algo-1:128:197 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:128:197 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:126:199 [5] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:126:199 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:126:199 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init COMPLETE
Running smdistributed.dataparallel v1.8.0
SMDDP: Single node mode
algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7"
Command "mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/gethostname.cpython-310-x86_64-linux-gnu.so -x NCCL_PROTO=simple -x FI_EFA_USE_DEVICE_RDMA=1 smddprun /opt/conda/bin/python3.10 -m mpi4py train_vlcm_distill_lcm_wds.py --adam_weight_decay 0 --checkpointing_steps 200 --checkpoints_total_limit 10 --dataloader_num_workers 8 --ema_decay 0.95 --enable_xformers_memory_efficient_attention True --gradient_accumulation_steps 1 --gradient_checkpointing True --learning_rate 1e-06 --loss_type huber --max_train_samples 10727607 --max_train_steps 10727607 --mixed_precision fp16 --pretrained_teacher_model damo-vilab/text-to-video-ms-1.7b --resolution 512 --resume_from_checkpoint latest --seed 453645634 --train_batch_size 16 --use_8bit_adam True --validation_steps 200"
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR Encountered exit_code 1
2024-03-05 07:29:19 Uploading - Uploading generated training model
2024-03-05 07:29:19 Failed - Training job failed
Traceback (most recent call last):
File "/home/rohit.bharadwaj/.conda/envs/LCM/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1021, in launch_command
sagemaker_launcher(defaults, args)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 840, in sagemaker_launcher
huggingface_estimator.fit(inputs=sagemaker_inputs)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 1341, in fit
self.latest_training_job.wait(logs=logs)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 2677, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 5568, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7711, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7764, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job accelerate-sagemaker-1-2024-03-05-07-15-53-204: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
RuntimeError
Couldn't initialize SMDDP.
Expected mechanism for checking for NCCL backend has changed.
Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
Traceback (most recent call last)
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
run_command_line(args)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
run_path(sys.argv[0], run_name='__main__')
File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/conda/lib/python3.10/runpy.py",
I think this issue is related to Pytorch version 2.2.0 or CUDA 12.
One of my dependency (xformers) was forcing the install of latest pytorch version, and was causing this issue. I hope this can be fixed.
Checklist
Concise Description:
I'm using the following HuggingFace container from here
py_version: py310
pytorch_version: 2.0.0
region: us-east-1
transformers_version: 4.28.1
I'm using the instance: ml.p4d.24xlarge (8x A100), and have enabled data parallel mode.
I'm running my job using accelerate script and accelerate sagemaker config:
config.yaml:
However, my script fails with some NCCL error.
Initially, I used a different pytorch version (i.e. 2.1.0) and was facing issue saying smdistributed not found, I've described my issue here: #3627 (comment)
Now using a different container version, I'm getting NCCL errors.
snippets from the logs:
DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
Current behavior:
shown in the logs above
Expected behavior:
Model should train on multiple GPUs.
Additional context:
The text was updated successfully, but these errors were encountered: