NCCL all_reduce_perf errors with 5090s #287

RCS1 · 2025-02-19T22:16:01Z

all_reduce_perf test errors when using dual 5090 GPUs. Works fine with one 5090.

Using nvidia driver 570.86.16

1x 5090;

user@GENOA-06:~/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5092 on   GENOA-06 device  0 [0x21] NVIDIA GeForce RTX 5090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     5.63    0.00    0.00      0     0.20    0.04    0.00      0
          16             4     float     sum      -1     5.60    0.00    0.00      0     0.19    0.08    0.00      0
          32             8     float     sum      -1     5.30    0.01    0.00      0     0.18    0.17    0.00      0
          64            16     float     sum      -1     5.37    0.01    0.00      0     0.15    0.44    0.00      0
         128            32     float     sum      -1     4.61    0.03    0.00      0     0.15    0.86    0.00      0
         256            64     float     sum      -1     4.62    0.06    0.00      0     0.15    1.75    0.00      0
         512           128     float     sum      -1     4.59    0.11    0.00      0     0.15    3.46    0.00      0
        1024           256     float     sum      -1     4.60    0.22    0.00      0     0.14    7.09    0.00      0
        2048           512     float     sum      -1     6.21    0.33    0.00      0     0.15   14.03    0.00      0
        4096          1024     float     sum      -1     4.61    0.89    0.00      0     0.14   28.35    0.00      0
        8192          2048     float     sum      -1     4.57    1.79    0.00      0     0.15   56.30    0.00      0
       16384          4096     float     sum      -1     4.62    3.54    0.00      0     0.15  112.22    0.00      0
       32768          8192     float     sum      -1     4.57    7.18    0.00      0     0.15  219.18    0.00      0
       65536         16384     float     sum      -1     2.93   22.37    0.00      0     0.07  910.22    0.00      0
      131072         32768     float     sum      -1     2.91   45.07    0.00      0     0.07  1807.89    0.00      0
      262144         65536     float     sum      -1     2.95   88.98    0.00      0     0.09  2995.93    0.00      0
      524288        131072     float     sum      -1     3.10  169.24    0.00      0     0.08  6853.44    0.00      0
     1048576        262144     float     sum      -1     3.53  296.68    0.00      0     0.07  14563.56    0.00      0
     2097152        524288     float     sum      -1     3.64  576.23    0.00      0     0.07  28926.23    0.00      0
     4194304       1048576     float     sum      -1     5.30  791.69    0.00      0     0.09  48489.06    0.00      0
     8388608       2097152     float     sum      -1     9.83  853.29    0.00      0     0.08  109655.01    0.00      0
    16777216       4194304     float     sum      -1    19.44  863.04    0.00      0     0.07  226719.14    0.00      0
    33554432       8388608     float     sum      -1    41.36  811.21    0.00      0     0.07  453438.27    0.00      0
    67108864      16777216     float     sum      -1    85.69  783.15    0.00      0     0.08  818400.78    0.00      0
   134217728      33554432     float     sum      -1    175.6  764.34    0.00      0     0.08  1777718.25    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

2x 5090s;

user@GENOA-06:~/nccl-tests$ NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5068 on   GENOA-06 device  0 [0x21] NVIDIA GeForce RTX 5090
#  Rank  1 Group  0 Pid   5068 on   GENOA-06 device  1 [0x61] NVIDIA GeForce RTX 5090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
GENOA-06: Test NCCL failure common.cu:392 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. GENOA-06 pid 5068: Test failure common.cu:590
 .. GENOA-06 pid 5068: Test failure all_reduce.cu:90
 .. GENOA-06 pid 5068: Test failure common.cu:623
 .. GENOA-06 pid 5068: Test failure common.cu:1078
 .. GENOA-06 pid 5068: Test failure common.cu:891

2x 5090s with NCCL_DEBUG=INFO

user@GENOA-06:~/nccl-tests$ NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   4766 on   GENOA-06 device  0 [0x21] NVIDIA GeForce RTX 5090
#  Rank  1 Group  0 Pid   4766 on   GENOA-06 device  1 [0x61] NVIDIA GeForce RTX 5090
GENOA-06:4766:4766 [0] NCCL INFO Bootstrap: Using enp227s0f0:192.168.1.250<0>
GENOA-06:4766:4766 [0] NCCL INFO cudaDriverVersion 12080
GENOA-06:4766:4766 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
GENOA-06:4766:4780 [0] NCCL INFO ncclMaxSharedMem 82240 exceeds device/fn maxSharedMem 79856
GENOA-06:4766:4781 [1] NCCL INFO ncclMaxSharedMem 82240 exceeds device/fn maxSharedMem 79856
GENOA-06:4766:4780 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
GENOA-06:4766:4780 [0] NCCL INFO Failed to open libibverbs.so[.1]
GENOA-06:4766:4780 [0] NCCL INFO NET/Socket : Using [0]enp227s0f0:192.168.1.250<0> [1]enxfe1dc2e063ea:fe80::52b1:b1f3:51c0:ecca%enxfe1dc2e063ea<0>
GENOA-06:4766:4780 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
GENOA-06:4766:4780 [0] NCCL INFO Using network Socket
GENOA-06:4766:4781 [1] NCCL INFO Using network Socket
GENOA-06:4766:4781 [1] NCCL INFO ncclCommInitAll comm 0x639f4c7a4290 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x726fb0d348b9154b - Init START
GENOA-06:4766:4780 [0] NCCL INFO ncclCommInitAll comm 0x639f4c724ac0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 21000 commId 0x726fb0d348b9154b - Init START
GENOA-06:4766:4781 [1] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
GENOA-06:4766:4781 [1] NCCL INFO Bootstrap timings total 0.001151 (create 0.000048, send 0.000163, recv 0.000423, ring 0.000026, delay 0.000000)
GENOA-06:4766:4780 [0] NCCL INFO Bootstrap timings total 0.001101 (create 0.000034, send 0.000134, recv 0.000570, ring 0.000023, delay 0.000000)
GENOA-06:4766:4781 [1] NCCL INFO Setting affinity for GPU 1 to 0fff,c0000000,00000000,00000000,0fffc000

GENOA-06:4766:4781 [1] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 4, falling back to simple order

GENOA-06:4766:4781 [1] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 1, falling back to simple order
GENOA-06:4766:4780 [0] NCCL INFO Setting affinity for GPU 0 to 03fff000,00000000,00000000,000003ff,f0000000

GENOA-06:4766:4780 [0] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 4, falling back to simple order

GENOA-06:4766:4780 [0] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 1, falling back to simple order
GENOA-06:4766:4780 [0] NCCL INFO comm 0x639f4c724ac0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
GENOA-06:4766:4781 [1] NCCL INFO comm 0x639f4c7a4290 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
GENOA-06:4766:4780 [0] NCCL INFO Channel 00/02 : 0 1
GENOA-06:4766:4780 [0] NCCL INFO Channel 01/02 : 0 1
GENOA-06:4766:4781 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
GENOA-06:4766:4781 [1] NCCL INFO P2P Chunksize set to 131072
GENOA-06:4766:4780 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
GENOA-06:4766:4780 [0] NCCL INFO P2P Chunksize set to 131072
GENOA-06:4766:4780 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 1
GENOA-06:4766:4785 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 29
GENOA-06:4766:4783 [0] NCCL INFO [Proxy Service] Device 0 CPU core 149
GENOA-06:4766:4784 [1] NCCL INFO [Proxy Service] Device 1 CPU core 22
GENOA-06:4766:4786 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 127
GENOA-06:4766:4780 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
GENOA-06:4766:4780 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
GENOA-06:4766:4781 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
GENOA-06:4766:4781 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
GENOA-06:4766:4780 [0] NCCL INFO CC Off, workFifoBytes 1048576
GENOA-06:4766:4780 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
GENOA-06:4766:4780 [0] NCCL INFO ncclCommInitAll comm 0x639f4c724ac0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 21000 commId 0x726fb0d348b9154b - Init COMPLETE
GENOA-06:4766:4780 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 2 total 0.29 (kernels 0.26, alloc 0.02, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
GENOA-06:4766:4781 [1] NCCL INFO ncclCommInitAll comm 0x639f4c7a4290 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x726fb0d348b9154b - Init COMPLETE
GENOA-06:4766:4781 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 2 total 0.29 (kernels 0.26, alloc 0.02, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
GENOA-06:4766:4787 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
GENOA-06:4766:4788 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
GENOA-06:4766:4787 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
GENOA-06:4766:4788 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
GENOA-06:4766:4787 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
GENOA-06:4766:4788 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1

GENOA-06:4766:4766 [1] enqueue.cc:1500 NCCL WARN Cuda failure 1 'invalid argument'
GENOA-06:4766:4766 [1] NCCL INFO group.cc:242 -> 1
GENOA-06:4766:4766 [1] NCCL INFO group.cc:470 -> 1
GENOA-06:4766:4766 [1] NCCL INFO group.cc:573 -> 1
GENOA-06:4766:4766 [1] NCCL INFO group.cc:106 -> 1
GENOA-06: Test NCCL failure common.cu:392 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. GENOA-06 pid 4766: Test failure common.cu:590
 .. GENOA-06 pid 4766: Test failure all_reduce.cu:90
 .. GENOA-06 pid 4766: Test failure common.cu:623
 .. GENOA-06 pid 4766: Test failure common.cu:1078
 .. GENOA-06 pid 4766: Test failure common.cu:891

I've also tried adding "NCCL_P2P_DISABLE=1" with the same results.

The text was updated successfully, but these errors were encountered:

kiskra-nvidia · 2025-02-20T07:30:39Z

It looks to me like there is no P2P connectivity between the 2 GPUs. There should be... So, in a way, NCCL_P2P_DISABLE=1 probably doesn't do anything because P2P is already disabled... Have you followed the steps outlined in https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-direct? In particular, I recommend first running https://github.com/NVIDIA/cuda-samples/tree/master/Samples/5_Domain_Specific/p2pBandwidthLatencyTest.

sjeaugey · 2025-02-20T08:48:55Z

GeForce cards do not support P2P.

The CUDA failure seems to happen when we launch the kernel.

I see this line in the log:

GENOA-06:4766:4780 [0] NCCL INFO ncclMaxSharedMem 82240 exceeds device/fn maxSharedMem 79856

Perhaps that's the reason for the launch error?

@RCS1 could you try to remove this line (line 488 of src/include/device.h):
https://github.com/NVIDIA/nccl/blob/master/src/include/device.h#L488
and see if it fixes the issue?

RCS1 · 2025-02-20T19:07:07Z

Hi sjeaugey,

Thanks for your reply.

I'm unsure where that file would be located? That doesn't seem to be a directory I have.

RCS1 · 2025-02-20T19:28:24Z

Results of p2pBandwidthLatencyTest

user@GENOA-05:~/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA Graphics Device, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA Graphics Device, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     0
     1	     0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1516.99  38.24 
     1  38.92 1537.94 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 1509.66  38.20 
     1  38.66 1536.43 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1528.05  43.03 
     1  42.78 1540.10 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1525.06  43.19 
     1  42.78 1539.36 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   2.10  12.83 
     1  12.75   2.08 

   CPU     0      1 
     0   2.11   5.91 
     1   5.86   2.04 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   2.09  12.76 
     1  12.87   2.08 

   CPU     0      1 
     0   2.13   5.67 
     1   5.64   2.20 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

RCS1 · 2025-02-20T19:38:25Z

Also to note;

Nvidia 570 (closed) drivers do not recognize 5090s (unsure as to why) so I am currently using 570-open

Could this be causing issues?

AddyLaddy · 2025-02-20T19:48:57Z

The ncclMaxSharedMem message is likely the issue here. That should have been a WARN/exit and I've fixed it in the next release. On these systems you'll just have to reduce the amount of Shared Memory NCCL requests with this change:

diff --git a/src/include/device.h b/src/include/device.h
index f4dfcf219..0763a579a 100644
--- a/src/include/device.h
+++ b/src/include/device.h
@@ -474,7 +474,7 @@ __host__ __device__ constexpr int ncclCalcUnroll(int bytePerPack, int insns, int
 
 __host__ __device__ constexpr int ncclCollUnroll(int cudaArch = NCCL_CUDA_ARCH) {
   // Our collective unroll should move to the same bytes&insns model as NVLS.
-  return cudaArch >= 800 ? 8 : 4;
+  return cudaArch >= 800 ? (cudaArch == 1200 ? 6 : 8) : 4;
 }
 
 __host__ __device__ constexpr int ncclNvlsUnrollBytes(int cudaArch = NCCL_CUDA_ARCH) { return 4*16; }

RCS1 · 2025-02-20T19:58:35Z

Is there a way I can adjust the test command to get around this and see performance? Thank you!

AddyLaddy · 2025-02-20T20:46:03Z

Is there a way I can adjust the test command to get around this and see performance? Thank you!

No, you need to modify the NCCL library in order for the CUDA kernels to work on these GPU SKUs. It will be fixed in NCCL 2.26.x

sjeaugey · 2025-02-21T08:59:27Z

@RCS1 I was suggesting that you checkout the NCCL source code, delete line 488 of src/include/device.h, then recompile NCCL and use that newly-built NCCL. You can refer to the README for instruction on how to build NCCL. It's pretty straightforward.

Now maybe my change wouldn't work, and the patch above would work better.

waxixi-6 mentioned this issue Feb 28, 2025

NCCL hypercube_perf errors with 5090s #289

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL all_reduce_perf errors with 5090s #287

NCCL all_reduce_perf errors with 5090s #287

RCS1 commented Feb 19, 2025 •

edited

Loading

kiskra-nvidia commented Feb 20, 2025

sjeaugey commented Feb 20, 2025

RCS1 commented Feb 20, 2025

RCS1 commented Feb 20, 2025

RCS1 commented Feb 20, 2025

AddyLaddy commented Feb 20, 2025

RCS1 commented Feb 20, 2025

AddyLaddy commented Feb 20, 2025

sjeaugey commented Feb 21, 2025

NCCL all_reduce_perf errors with 5090s #287

NCCL all_reduce_perf errors with 5090s #287

Comments

RCS1 commented Feb 19, 2025 • edited Loading

kiskra-nvidia commented Feb 20, 2025

sjeaugey commented Feb 20, 2025

RCS1 commented Feb 20, 2025

RCS1 commented Feb 20, 2025

RCS1 commented Feb 20, 2025

AddyLaddy commented Feb 20, 2025

RCS1 commented Feb 20, 2025

AddyLaddy commented Feb 20, 2025

sjeaugey commented Feb 21, 2025

RCS1 commented Feb 19, 2025 •

edited

Loading