Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL all_reduce_perf errors with 5090s #287

Open
RCS1 opened this issue Feb 19, 2025 · 9 comments
Open

NCCL all_reduce_perf errors with 5090s #287

RCS1 opened this issue Feb 19, 2025 · 9 comments

Comments

@RCS1
Copy link

RCS1 commented Feb 19, 2025

all_reduce_perf test errors when using dual 5090 GPUs. Works fine with one 5090.

Using nvidia driver 570.86.16

1x 5090;

user@GENOA-06:~/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5092 on   GENOA-06 device  0 [0x21] NVIDIA GeForce RTX 5090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     5.63    0.00    0.00      0     0.20    0.04    0.00      0
          16             4     float     sum      -1     5.60    0.00    0.00      0     0.19    0.08    0.00      0
          32             8     float     sum      -1     5.30    0.01    0.00      0     0.18    0.17    0.00      0
          64            16     float     sum      -1     5.37    0.01    0.00      0     0.15    0.44    0.00      0
         128            32     float     sum      -1     4.61    0.03    0.00      0     0.15    0.86    0.00      0
         256            64     float     sum      -1     4.62    0.06    0.00      0     0.15    1.75    0.00      0
         512           128     float     sum      -1     4.59    0.11    0.00      0     0.15    3.46    0.00      0
        1024           256     float     sum      -1     4.60    0.22    0.00      0     0.14    7.09    0.00      0
        2048           512     float     sum      -1     6.21    0.33    0.00      0     0.15   14.03    0.00      0
        4096          1024     float     sum      -1     4.61    0.89    0.00      0     0.14   28.35    0.00      0
        8192          2048     float     sum      -1     4.57    1.79    0.00      0     0.15   56.30    0.00      0
       16384          4096     float     sum      -1     4.62    3.54    0.00      0     0.15  112.22    0.00      0
       32768          8192     float     sum      -1     4.57    7.18    0.00      0     0.15  219.18    0.00      0
       65536         16384     float     sum      -1     2.93   22.37    0.00      0     0.07  910.22    0.00      0
      131072         32768     float     sum      -1     2.91   45.07    0.00      0     0.07  1807.89    0.00      0
      262144         65536     float     sum      -1     2.95   88.98    0.00      0     0.09  2995.93    0.00      0
      524288        131072     float     sum      -1     3.10  169.24    0.00      0     0.08  6853.44    0.00      0
     1048576        262144     float     sum      -1     3.53  296.68    0.00      0     0.07  14563.56    0.00      0
     2097152        524288     float     sum      -1     3.64  576.23    0.00      0     0.07  28926.23    0.00      0
     4194304       1048576     float     sum      -1     5.30  791.69    0.00      0     0.09  48489.06    0.00      0
     8388608       2097152     float     sum      -1     9.83  853.29    0.00      0     0.08  109655.01    0.00      0
    16777216       4194304     float     sum      -1    19.44  863.04    0.00      0     0.07  226719.14    0.00      0
    33554432       8388608     float     sum      -1    41.36  811.21    0.00      0     0.07  453438.27    0.00      0
    67108864      16777216     float     sum      -1    85.69  783.15    0.00      0     0.08  818400.78    0.00      0
   134217728      33554432     float     sum      -1    175.6  764.34    0.00      0     0.08  1777718.25    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

2x 5090s;

user@GENOA-06:~/nccl-tests$ NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5068 on   GENOA-06 device  0 [0x21] NVIDIA GeForce RTX 5090
#  Rank  1 Group  0 Pid   5068 on   GENOA-06 device  1 [0x61] NVIDIA GeForce RTX 5090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
GENOA-06: Test NCCL failure common.cu:392 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. GENOA-06 pid 5068: Test failure common.cu:590
 .. GENOA-06 pid 5068: Test failure all_reduce.cu:90
 .. GENOA-06 pid 5068: Test failure common.cu:623
 .. GENOA-06 pid 5068: Test failure common.cu:1078
 .. GENOA-06 pid 5068: Test failure common.cu:891

2x 5090s with NCCL_DEBUG=INFO

user@GENOA-06:~/nccl-tests$ NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   4766 on   GENOA-06 device  0 [0x21] NVIDIA GeForce RTX 5090
#  Rank  1 Group  0 Pid   4766 on   GENOA-06 device  1 [0x61] NVIDIA GeForce RTX 5090
GENOA-06:4766:4766 [0] NCCL INFO Bootstrap: Using enp227s0f0:192.168.1.250<0>
GENOA-06:4766:4766 [0] NCCL INFO cudaDriverVersion 12080
GENOA-06:4766:4766 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
GENOA-06:4766:4780 [0] NCCL INFO ncclMaxSharedMem 82240 exceeds device/fn maxSharedMem 79856
GENOA-06:4766:4781 [1] NCCL INFO ncclMaxSharedMem 82240 exceeds device/fn maxSharedMem 79856
GENOA-06:4766:4780 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
GENOA-06:4766:4780 [0] NCCL INFO Failed to open libibverbs.so[.1]
GENOA-06:4766:4780 [0] NCCL INFO NET/Socket : Using [0]enp227s0f0:192.168.1.250<0> [1]enxfe1dc2e063ea:fe80::52b1:b1f3:51c0:ecca%enxfe1dc2e063ea<0>
GENOA-06:4766:4780 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
GENOA-06:4766:4780 [0] NCCL INFO Using network Socket
GENOA-06:4766:4781 [1] NCCL INFO Using network Socket
GENOA-06:4766:4781 [1] NCCL INFO ncclCommInitAll comm 0x639f4c7a4290 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x726fb0d348b9154b - Init START
GENOA-06:4766:4780 [0] NCCL INFO ncclCommInitAll comm 0x639f4c724ac0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 21000 commId 0x726fb0d348b9154b - Init START
GENOA-06:4766:4781 [1] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
GENOA-06:4766:4781 [1] NCCL INFO Bootstrap timings total 0.001151 (create 0.000048, send 0.000163, recv 0.000423, ring 0.000026, delay 0.000000)
GENOA-06:4766:4780 [0] NCCL INFO Bootstrap timings total 0.001101 (create 0.000034, send 0.000134, recv 0.000570, ring 0.000023, delay 0.000000)
GENOA-06:4766:4781 [1] NCCL INFO Setting affinity for GPU 1 to 0fff,c0000000,00000000,00000000,0fffc000

GENOA-06:4766:4781 [1] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 4, falling back to simple order

GENOA-06:4766:4781 [1] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 1, falling back to simple order
GENOA-06:4766:4780 [0] NCCL INFO Setting affinity for GPU 0 to 03fff000,00000000,00000000,000003ff,f0000000

GENOA-06:4766:4780 [0] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 4, falling back to simple order

GENOA-06:4766:4780 [0] graph/search.cc:1135 NCCL WARN Could not find a path for pattern 1, falling back to simple order
GENOA-06:4766:4780 [0] NCCL INFO comm 0x639f4c724ac0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
GENOA-06:4766:4781 [1] NCCL INFO comm 0x639f4c7a4290 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
GENOA-06:4766:4780 [0] NCCL INFO Channel 00/02 : 0 1
GENOA-06:4766:4780 [0] NCCL INFO Channel 01/02 : 0 1
GENOA-06:4766:4781 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
GENOA-06:4766:4781 [1] NCCL INFO P2P Chunksize set to 131072
GENOA-06:4766:4780 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
GENOA-06:4766:4780 [0] NCCL INFO P2P Chunksize set to 131072
GENOA-06:4766:4780 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 1
GENOA-06:4766:4785 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 29
GENOA-06:4766:4783 [0] NCCL INFO [Proxy Service] Device 0 CPU core 149
GENOA-06:4766:4784 [1] NCCL INFO [Proxy Service] Device 1 CPU core 22
GENOA-06:4766:4786 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 127
GENOA-06:4766:4780 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
GENOA-06:4766:4780 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
GENOA-06:4766:4781 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
GENOA-06:4766:4781 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
GENOA-06:4766:4780 [0] NCCL INFO CC Off, workFifoBytes 1048576
GENOA-06:4766:4780 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
GENOA-06:4766:4780 [0] NCCL INFO ncclCommInitAll comm 0x639f4c724ac0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 21000 commId 0x726fb0d348b9154b - Init COMPLETE
GENOA-06:4766:4780 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 2 total 0.29 (kernels 0.26, alloc 0.02, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
GENOA-06:4766:4781 [1] NCCL INFO ncclCommInitAll comm 0x639f4c7a4290 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x726fb0d348b9154b - Init COMPLETE
GENOA-06:4766:4781 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 2 total 0.29 (kernels 0.26, alloc 0.02, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
GENOA-06:4766:4787 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
GENOA-06:4766:4788 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
GENOA-06:4766:4787 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
GENOA-06:4766:4788 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
GENOA-06:4766:4787 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
GENOA-06:4766:4788 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1

GENOA-06:4766:4766 [1] enqueue.cc:1500 NCCL WARN Cuda failure 1 'invalid argument'
GENOA-06:4766:4766 [1] NCCL INFO group.cc:242 -> 1
GENOA-06:4766:4766 [1] NCCL INFO group.cc:470 -> 1
GENOA-06:4766:4766 [1] NCCL INFO group.cc:573 -> 1
GENOA-06:4766:4766 [1] NCCL INFO group.cc:106 -> 1
GENOA-06: Test NCCL failure common.cu:392 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. GENOA-06 pid 4766: Test failure common.cu:590
 .. GENOA-06 pid 4766: Test failure all_reduce.cu:90
 .. GENOA-06 pid 4766: Test failure common.cu:623
 .. GENOA-06 pid 4766: Test failure common.cu:1078
 .. GENOA-06 pid 4766: Test failure common.cu:891

I've also tried adding "NCCL_P2P_DISABLE=1" with the same results.

@kiskra-nvidia
Copy link
Member

It looks to me like there is no P2P connectivity between the 2 GPUs. There should be... So, in a way, NCCL_P2P_DISABLE=1 probably doesn't do anything because P2P is already disabled... Have you followed the steps outlined in https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-direct? In particular, I recommend first running https://github.com/NVIDIA/cuda-samples/tree/master/Samples/5_Domain_Specific/p2pBandwidthLatencyTest.

@sjeaugey
Copy link
Member

GeForce cards do not support P2P.

The CUDA failure seems to happen when we launch the kernel.

I see this line in the log:

GENOA-06:4766:4780 [0] NCCL INFO ncclMaxSharedMem 82240 exceeds device/fn maxSharedMem 79856

Perhaps that's the reason for the launch error?

@RCS1 could you try to remove this line (line 488 of src/include/device.h):
https://github.com/NVIDIA/nccl/blob/master/src/include/device.h#L488
and see if it fixes the issue?

@RCS1
Copy link
Author

RCS1 commented Feb 20, 2025

Hi sjeaugey,

Thanks for your reply.

I'm unsure where that file would be located? That doesn't seem to be a directory I have.

@RCS1
Copy link
Author

RCS1 commented Feb 20, 2025

Results of p2pBandwidthLatencyTest

user@GENOA-05:~/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA Graphics Device, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA Graphics Device, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     0
     1	     0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1516.99  38.24 
     1  38.92 1537.94 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 1509.66  38.20 
     1  38.66 1536.43 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1528.05  43.03 
     1  42.78 1540.10 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1525.06  43.19 
     1  42.78 1539.36 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   2.10  12.83 
     1  12.75   2.08 

   CPU     0      1 
     0   2.11   5.91 
     1   5.86   2.04 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   2.09  12.76 
     1  12.87   2.08 

   CPU     0      1 
     0   2.13   5.67 
     1   5.64   2.20 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

@RCS1
Copy link
Author

RCS1 commented Feb 20, 2025

Also to note;

Nvidia 570 (closed) drivers do not recognize 5090s (unsure as to why) so I am currently using 570-open

Could this be causing issues?

@AddyLaddy
Copy link
Collaborator

The ncclMaxSharedMem message is likely the issue here. That should have been a WARN/exit and I've fixed it in the next release. On these systems you'll just have to reduce the amount of Shared Memory NCCL requests with this change:

diff --git a/src/include/device.h b/src/include/device.h
index f4dfcf219..0763a579a 100644
--- a/src/include/device.h
+++ b/src/include/device.h
@@ -474,7 +474,7 @@ __host__ __device__ constexpr int ncclCalcUnroll(int bytePerPack, int insns, int
 
 __host__ __device__ constexpr int ncclCollUnroll(int cudaArch = NCCL_CUDA_ARCH) {
   // Our collective unroll should move to the same bytes&insns model as NVLS.
-  return cudaArch >= 800 ? 8 : 4;
+  return cudaArch >= 800 ? (cudaArch == 1200 ? 6 : 8) : 4;
 }
 
 __host__ __device__ constexpr int ncclNvlsUnrollBytes(int cudaArch = NCCL_CUDA_ARCH) { return 4*16; }

@RCS1
Copy link
Author

RCS1 commented Feb 20, 2025

Is there a way I can adjust the test command to get around this and see performance? Thank you!

@AddyLaddy
Copy link
Collaborator

Is there a way I can adjust the test command to get around this and see performance? Thank you!

No, you need to modify the NCCL library in order for the CUDA kernels to work on these GPU SKUs. It will be fixed in NCCL 2.26.x

@sjeaugey
Copy link
Member

@RCS1 I was suggesting that you checkout the NCCL source code, delete line 488 of src/include/device.h, then recompile NCCL and use that newly-built NCCL. You can refer to the README for instruction on how to build NCCL. It's pretty straightforward.

Now maybe my change wouldn't work, and the patch above would work better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants