Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for MLNX OFED LTS 4.9-4.1.7.0 #2

Closed
joaquintorres opened this issue Mar 18, 2022 · 8 comments
Closed

Support for MLNX OFED LTS 4.9-4.1.7.0 #2

joaquintorres opened this issue Mar 18, 2022 · 8 comments

Comments

@joaquintorres
Copy link
Contributor

Hi! Thanks for your work on this script, I tested it on a more recent version and worked like a charm.

I was wondering if it would be possible to apply the patch to the LTS version of MLNX OFED, I'm working with some really outdated/EOL hardware (ConnectX interfaces) and I only could get it to work under the LTS and not 5.x versions, given the drop in mlx4 support. I don't follow the full extent of what the script is doing, but if I understood correctly what changes with each version is how CMakeLists is populated and the version of rdma-core, right? Or would there be some significant difference for 4.x that I'm missing?

Some relevant outputs of my set up:

(MLNX_OFED_LINUX-4.9-4.1.7.0-rhel8.5-x86_64.iso)

# dnf check all
libfabric-ohpc-1.13.0-3.2.ohpc.2.4.x86_64 has missing requires of libefa.so.1()(64bit)
libfabric-ohpc-1.13.0-3.2.ohpc.2.4.x86_64 has missing requires of libefa.so.1(EFA_1.1)(64bit)
libfabric-ohpc-1.13.0-3.2.ohpc.2.4.x86_64 has missing requires of libibverbs.so.1(IBVERBS_1.6)(64bit)
libfabric-ohpc-1.13.0-3.2.ohpc.2.4.x86_64 has missing requires of librdmacm.so.1(RDMACM_1.2)(64bit)
mpich-ofi-gnu9-ohpc-3.4.2-3.1.ohpc.2.4.x86_64 has missing requires of libefa.so.1()(64bit)
mvapich2-gnu9-ohpc-2.3.6-4.1.ohpc.2.4.x86_64 has missing requires of libibmad.so.5()(64bit)
mvapich2-gnu9-ohpc-2.3.6-4.1.ohpc.2.4.x86_64 has missing requires of libibmad.so.5(IBMAD_1.3)(64bit)
Error: Check discovered 7 problem(s)

(Notably libibmad.so.5 is also causing some issues)

# dnf list | grep rdma
librdmacm.x86_64                                                                   41mlnx1-OFED.4.7.3.0.6.49417                      @System         
librdmacm-devel.x86_64                                                             41mlnx1-OFED.4.7.3.0.6.49417                      @System         
librdmacm-utils.x86_64                                                             41mlnx1-OFED.4.7.3.0.6.49417                      @System         
ucx-rdmacm.x86_64                                                                  1.8.0-1.49417                                     @System         
glusterfs-rdma.x86_64                                                              6.0-56.4.el8                                      baseos          
librdmacm.i686                                                                     35.0-1.el8                                        baseos          
rdma-core.i686                                                                     35.0-1.el8                                        baseos          
rdma-core.x86_64                                                                   35.0-1.el8                                        baseos          
rdma-core-devel.i686                                                               35.0-1.el8                                        baseos          
rdma-core-devel.x86_64                                                             35.0-1.el8                                        baseos          
ucx-rdmacm.x86_64                                                                  1.10.1-2.el8                                      appstream       
ucx-rdmacm-ohpc.x86_64                                                             1.11.2-3.2.ohpc.2.4                               OpenHPC-updates 
ucx-rdmacm-ohpc.aarch64                                                            1.11.2-3.4.ohpc.2.4                               OpenHPC-updates 

(From what I understand the first lines are the ones relevant to the script)

# lspci -nn | grep Mellanox
04:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-DI ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 2.5GT/s In... (rev a0)

I can try patching things myself but I don't want to cause further breakage without fully understanding how the patch works. I'm not sure if installing 5.5 and applying the patch would help support my card, if possible I'd like to avoid uninstalling something that's already working. Any help would be greatly appreciated.

Regards,
Joaquín Torres
HPC Cluster SysAdmin
e-mail: [email protected]
Laboratorio de Simulación, Modelado y Diseño Computacional
Centro Atómico Constituyentes, Comisión Nacional de Energía Atómica
Av. Gral. Paz 1499, B1650 Villa Maipú
Provincia de Buenos Aires

@viniciusferrao
Copy link
Owner

viniciusferrao commented Mar 18, 2022

Hi @joaquintorres I can take a look at MLNX OFED 4.9, the new LTS version is 5.4 by the way.

Can you provide the modules that are loaded on the system with this Connect-X card? If it's mlx4 it should be available on this patched version. It may be mthca, if it is the case, this patch will not work.

Please let me know, so I can help you with that.

@joaquintorres
Copy link
Contributor Author

It doesn't appear to be mthca
What I get is

# lsmod | grep mlx
mlx5_fpga_tools        16384  0
mlx5_ib               450560  0
ib_uverbs             155648  3 rdma_ucm,mlx5_ib,ib_ucm
mlx5_core            1417216  2 mlx5_fpga_tools,mlx5_ib
mlxfw                  24576  1 mlx5_core
tls                   102400  1 mlx5_core
mlx4_en               159744  0
mlx4_ib               253952  0
ib_core               417792  10 rdma_cm,ib_ipoib,mlx4_ib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,ib_ucm
mlx4_core             401408  2 mlx4_ib,mlx4_en
mlx_compat             16384  15 rdma_cm,ib_ipoib,mlx4_core,mlx4_ib,iw_cm,mlx5_fpga_tools,ib_umad,mlx4_en,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core,ib_ucm
# lsmod | grep ib
ib_ucm                 24576  0
ib_ipoib              208896  0
ib_cm                  57344  3 rdma_cm,ib_ipoib,ib_ucm
ib_umad                28672  6
mlx5_ib               450560  0
ib_uverbs             155648  3 rdma_ucm,mlx5_ib,ib_ucm
mlx5_core            1417216  2 mlx5_fpga_tools,mlx5_ib
mlx4_ib               253952  0
ib_core               417792  10 rdma_cm,ib_ipoib,mlx4_ib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,ib_ucm
mlx4_core             401408  2 mlx4_ib,mlx4_en
mlx_compat             16384  15 rdma_cm,ib_ipoib,mlx4_core,mlx4_ib,iw_cm,mlx5_fpga_tools,ib_umad,mlx4_en,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core,ib_ucm
libcrc32c              16384  1 xfs
libata                270336  2 ata_piix,ata_generic

Full lsmod output:

# lsmod
Module                  Size  Used by
rdma_ucm               32768  0
ib_ucm                 24576  0
rdma_cm                69632  1 rdma_ucm
iw_cm                  53248  1 rdma_cm
ib_ipoib              208896  0
ib_cm                  57344  3 rdma_cm,ib_ipoib,ib_ucm
ib_umad                28672  6
mlx5_fpga_tools        16384  0
mlx5_ib               450560  0
ib_uverbs             155648  3 rdma_ucm,mlx5_ib,ib_ucm
mlx5_core            1417216  2 mlx5_fpga_tools,mlx5_ib
mlxfw                  24576  1 mlx5_core
tls                   102400  1 mlx5_core
mlx4_en               159744  0
mlx4_ib               253952  0
ib_core               417792  10 rdma_cm,ib_ipoib,mlx4_ib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,ib_ucm
mlx4_core             401408  2 mlx4_ib,mlx4_en
mlx_compat             16384  15 rdma_cm,ib_ipoib,mlx4_core,mlx4_ib,iw_cm,mlx5_fpga_tools,ib_umad,mlx4_en,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core,ib_ucm
knem                   53248  0
esp6_offload           16384  0
esp6                   24576  1 esp6_offload
esp4_offload           16384  0
esp4                   20480  1 esp4_offload
mst_pciconf           299008  0
mst_pci                94208  0
ext4                  761856  0
mbcache                16384  1 ext4
jbd2                  131072  1 ext4
nls_utf8               16384  1
isofs                  49152  1
loop                   40960  2
rpcsec_gss_krb5        40960  0
nfsv4                 831488  2
dns_resolver           16384  1 nfsv4
nfs                   385024  2 nfsv4
fscache               385024  1 nfs
iTCO_wdt               16384  0
iTCO_vendor_support    16384  1 iTCO_wdt
intel_powerclamp       16384  0
gpio_ich               16384  0
coretemp               16384  0
dcdbas                 16384  0
intel_cstate           20480  0
intel_uncore          184320  0
i7core_edac            28672  0
pcspkr                 16384  0
joydev                 24576  0
ipmi_ssif              32768  0
ipmi_si                69632  0
ipmi_devintf           20480  0
wmi                    32768  0
ipmi_msghandler       110592  3 ipmi_devintf,ipmi_si,ipmi_ssif
acpi_power_meter       20480  0
lpc_ich                28672  0
nfsd                  528384  13
auth_rpcgss           135168  2 nfsd,rpcsec_gss_krb5
nfs_acl                16384  1 nfsd
lockd                 122880  2 nfsd,nfs
grace                  16384  2 nfsd,lockd
sunrpc                557056  25 nfsd,nfsv4,auth_rpcgss,lockd,rpcsec_gss_krb5,nfs_acl,nfs
xfs                  1544192  3
libcrc32c              16384  1 xfs
sd_mod                 53248  4
t10_pi                 16384  1 sd_mod
sr_mod                 28672  0
cdrom                  65536  2 isofs,sr_mod
sg                     40960  0
ata_generic            16384  0
mgag200                36864  1
drm_kms_helper        253952  3 mgag200
syscopyarea            16384  1 drm_kms_helper
sysfillrect            16384  1 drm_kms_helper
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
drm                   573440  4 drm_kms_helper,mgag200
ata_piix               36864  0
libata                270336  2 ata_piix,ata_generic
i2c_algo_bit           16384  1 mgag200
crc32c_intel           24576  1
bnx2                   94208  0
mptsas                 69632  2
scsi_transport_sas     45056  1 mptsas
mptscsih               45056  1 mptsas
mptbase                98304  2 mptsas,mptscsih
dm_mirror              28672  0
dm_region_hash         20480  1 dm_mirror
dm_log                 20480  2 dm_region_hash,dm_mirror
dm_mod                151552  12 dm_log,dm_mirror
fuse                  155648  5

@viniciusferrao
Copy link
Owner

It seems to be a mlx4 card. Is it a Connect-X2? The already provided patch should work with this card on 5.4 and 5.5.

But if you can wait, I'm looking at MLNX OFED 4.9 right now, at the end of the day the card is still supported on 4.9.

Just give me some hours to work on it.

@joaquintorres
Copy link
Contributor Author

I thought it might be a ConnectX-2 because some things seemed to indicate so, but from all I can gather based on the model it's just ConnectX.
I don't have issues with trying with 5.x again, but even with the patch I had very poor results recognizing my card last time (MLNX OFED installed but no devices were recognized, UMAD port couldn't open, ibstat showed no output, etc.). Maybe I made a mistake during the installation? At the time I didn't know the patch added support for mlx4 so I attributed the hardware recognition issues to 5.x, or maybe there were additional steps needed since mlx4 support wasn't added during install.
Since I was trying out things I didn't properly document my previous 5.x experience, maybe there's something I missed. I could try to overwrite my current install and try again. It would also take me a while since installing on this node takes fairly long, maybe I just didn't force the installation to ignore firmware support and that ended up causing breakage.

@viniciusferrao
Copy link
Owner

viniciusferrao commented Mar 19, 2022

@joaquintorres the patch is now live. The issue was automatically closed with the commit.

I've tested it and seems to be working as expected:

[root@localhost ~]# dnf install libfabric-ohpc
Updating Subscription Management repositories.
Last metadata expiration check: 0:17:11 ago on Sat 19 Mar 2022 12:34:01 AM -03.
Error: 
 Problem: package libfabric-ohpc-1.13.0-3.2.ohpc.2.4.x86_64 requires libefa.so.1()(64bit), but none of the providers can be installed
  - package libfabric-ohpc-1.13.0-3.2.ohpc.2.4.x86_64 requires libefa.so.1(EFA_1.1)(64bit), but none of the providers can be installed
  - package libibverbs-32.0-4.el8.x86_64 requires rdma-core(x86-64) = 32.0-4.el8, but none of the providers can be installed
  - package libibverbs-35.0-1.el8.x86_64 requires rdma-core(x86-64) = 35.0-1.el8, but none of the providers can be installed
  - installed package mlnx-ofa_kernel-4.9-OFED.4.9.4.1.7.1.rhel8u5.x86_64 obsoletes rdma-core < 41mlnx1-1 provided by rdma-core-32.0-4.el8.x86_64
  - installed package mlnx-ofa_kernel-4.9-OFED.4.9.4.1.7.1.rhel8u5.x86_64 obsoletes rdma-core < 41mlnx1-1 provided by rdma-core-35.0-1.el8.x86_64
  - cannot install the best candidate for the job
  - problem with installed package mlnx-ofa_kernel-4.9-OFED.4.9.4.1.7.1.rhel8u5.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

After the patch:

Reinstalled:
  ibacm-50mlnx1-49418.versatushpc.x86_64                                  infiniband-diags-50mlnx1-49418.versatushpc.x86_64               
  infiniband-diags-compat-50mlnx1-49418.versatushpc.x86_64                libibumad-50mlnx1-49418.versatushpc.x86_64                      
  libibverbs-50mlnx1-49418.versatushpc.x86_64                             libibverbs-utils-50mlnx1-49418.versatushpc.x86_64               
  librdmacm-50mlnx1-49418.versatushpc.x86_64                              librdmacm-utils-50mlnx1-49418.versatushpc.x86_64                
  rdma-core-50mlnx1-49418.versatushpc.x86_64                              rdma-core-devel-50mlnx1-49418.versatushpc.x86_64                
  srp_daemon-50mlnx1-49418.versatushpc.x86_64                            

Complete!
[root@localhost PATCHED-MLNX-OFED]# dnf install libfabric-ohpc
Updating Subscription Management repositories.
Last metadata expiration check: 0:20:39 ago on Sat 19 Mar 2022 12:34:01 AM -03.
Dependencies resolved.
===========================================================================================================================================
 Package                       Architecture         Version                              Repository                                   Size
===========================================================================================================================================
Installing:
 libfabric-ohpc                x86_64               1.13.0-3.2.ohpc.2.4                  OpenHPC-updates                             749 k
Installing dependencies:
 libpsm2                       x86_64               11.2.185-1.el8                       rhel-8-for-x86_64-baseos-rpms               201 k
 ohpc-filesystem               noarch               2.1-4.1.ohpc.2.4                     OpenHPC-updates                             7.9 k

Transaction Summary
===========================================================================================================================================
Install  3 Packages

Total download size: 958 k
Installed size: 2.6 M
Is this ok [y/N]: n

Please note that you must install MLNX OFED 4.9 with --upstream-libs during mlnxofedinstall or else you'll have the libibmad.so.5 issue. It's kinda funny because I had this problem years ago and I didn't remembered it until finding my own issue again on OpenHPC: openhpc/ohpc#1031

Which is the exactly same issue.

If it still didn't work for you please let me know and open another issue.

Thank you.

@joaquintorres
Copy link
Contributor Author

It looks like it works properly, thanks!
@viniciusferrao I had also seen your previous issue while looking for more info on the subject. Notably, the conflicts with libibmad.so.5 also seem to prevent the proper deployment of the patch if --upstream-libs was not used when installling MLNX OFED:

# ./patch-mlnxofed.sh 
Detected MLNX OFED release: 4.9-4.1.7.0
Installing required dependencies...
Last metadata expiration check: 0:41:05 ago on lun 21 mar 2022 12:10:18 -03.
Package kernel-rpm-macros-125-1.el8.noarch is already installed.
Package rpm-build-4.14.3-19.el8_5.2.x86_64 is already installed.
Package patch-2.7.6-11.el8.x86_64 is already installed.
Package pandoc-2.0.6-5.el8.2.x86_64 is already installed.
Package cmake-3.20.2-4.el8.x86_64 is already installed.
Package systemd-devel-239-51.el8_5.3.x86_64 is already installed.
Package python36-devel-3.6.8-38.module+el8.5.0+671+195e4563.x86_64 is already installed.
Package libnl3-devel-3.5.0-1.el8.x86_64 is already installed.
Package python3-Cython-0.28.1-3.el8.x86_64 is already installed.
Dependencies resolved.
================================================================================
 Package                Arch        Version                Repository      Size
================================================================================
Installing:
 perl-generators        noarch      1.10-9.el8             appstream       17 k
Upgrading:
 systemd                x86_64      239-51.el8_5.5         baseos         3.6 M
 systemd-container      x86_64      239-51.el8_5.5         baseos         752 k
 systemd-devel          x86_64      239-51.el8_5.5         baseos         388 k
 systemd-libs           x86_64      239-51.el8_5.5         baseos         1.1 M
 systemd-pam            x86_64      239-51.el8_5.5         baseos         478 k
 systemd-udev           x86_64      239-51.el8_5.5         baseos         1.6 M
Installing dependencies:
 perl-Fedora-VSP        noarch      0.001-9.el8            appstream       23 k

(Transaction details, etc.)

Downloading MLNX OFED 4.9-4.1.7.0 sources...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 53.2M  100 53.2M    0     0  3861k      0  0:00:14  0:00:14 --:--:-- 4620k

Extracting files from SRPMS...
cpio: rdma-core.spec not created: newer or same age version exists
3083 blocks

Patching MLNX OFED to add back support for MLX4 and EFA...

patching file CMakeLists.txt
Patched: CMakeLists.txt

patching file rdma-core.spec
Patched: rdma-core.spec

Building RPMS... it may take a while
Last metadata expiration check: 0:43:08 ago on lun 21 mar 2022 12:10:18 -03.
Error: 
 Problem: package infiniband-diags-50mlnx1-49418.versatushpc.x86_64 obsoletes libibmad < 50mlnx1-49418.versatushpc provided by libibmad-5.4.0.MLNX20190423.1d917ae-0.1.49417.x86_64
  - package ibsim-0.10-1.49417.x86_64 requires libibmad.so.12()(64bit), but none of the providers can be installed
  - package ibsim-0.10-1.49417.x86_64 requires libibmad.so.12(IBMAD_1.3)(64bit), but none of the providers can be installed
  - conflicting requests
  - problem with installed package ibsim-0.10-1.49417.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

I'm assuming since it's also a conflict with libibmad (although not with the same file). Maybe I'm wrong and it's just because of something else I did, and I have no idea if it's reproducible: I'm assuming you always install with --upstream-libs since your previous issue?
In any case, maybe it would be useful to add this solution to your README so anyone looking to fix all MLNX_OFED broken packages can find a complete solution in one place.
Thank you so much again for such a fast fix, keep up the great work!

@viniciusferrao
Copy link
Owner

@joaquintorres just to fill all the questions. Yes I assumed --upstream-libs always because not using it on EL8 is too many breakage and there's no point to be of not using it. You see that modern MLNX OFED uses it by default.

Without it, is just a mess.

@joaquintorres
Copy link
Contributor Author

It does seem like the more sane default since most systems (at least the builds I saw) will use EL8 as a base. I guess edge cases like mine of working with absurdly outdated hardware won't be as frequent or relevant. Understandable, but still a shame that perfectly good hardware can't work without going through multiple hurdles or keeping the software stack outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants