Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX does not include InfiniBand when building with NVHPC compilers #8397

Open
mcuma opened this issue Jul 21, 2022 · 7 comments
Open

UCX does not include InfiniBand when building with NVHPC compilers #8397

mcuma opened this issue Jul 21, 2022 · 7 comments
Assignees
Labels

Comments

@mcuma
Copy link

mcuma commented Jul 21, 2022

Describe the bug

We are building OpenMPI with UCX as fabric using Spack package manager on our clusters that have various generations of Mellanox InfiniBand. We specify the verbs and mlx5 providers. When building with GNU or Intel compilers, ucx_info correctly shows these, but with the NVHPC it does not. I have verified in the Spack build logs that the libibverbs and libmlx5 are being found by configure and linked in both the GNU/Intel and NVHPC builds, which makes me perplexed as to why the ucx_info is not reporting these as active in the NVHPC build.

OpenMPI which is built atop of this UCX, in the NVHPC build case, runs TCP over IB resulting in ~20 us latencies, while the gcc/intel build runs over IB with ~1.7 us latencies.

Any thoughts on this would be appreciated. I have also did the same build on our older CentOS 7 system and the issue is present there too. So, I don't think it's OS/driver stack related.

I'll be happy to provide more info but first I'd be curious to hear if you had similar reports in the past, or, if someone has successfully built UCX with NVHPC for IB.

Steps to Reproduce

Spack command:
spack install [email protected]%[email protected]~pmi target=nehalem fabrics=ucx +internal-hwloc+thread_multiple schedulers=slurm +legacylaunchers ^ucx +mlx5-dv+verbs+cm+ud+dc+rc+cma ^[email protected] ^[email protected]

For the GNU (on Rocky Linux 8), replace [email protected] with [email protected])

This results in the following configure arguments:
--with-verbs=/usr --disable-mt --enable-cma --disable-params-check --without-avx --enable-optimizations --disable-assertions --disable-logging --with-pic --with-rc --with-ud --with-dc --with-mlx5-dv --without-ib-hw-tm --without-dm --with-cm --without-rocm --without-java --without-cuda --without-gdrcopy --without-knem --without-xpmem

NVHPC build:
$ ucx_info -v

UCT version=1.11.2 revision ef2bbcf

configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.5/ucx-1.11.2-asrhvd26hyucdhokcp6l5ufukmgxync7 --with-verbs=/usr --enable-mt --enable-cma --disable-params-check --without-avx --enable-optimizations --disable-assertions --disable-logging --with-pic --with-rc --with-ud --with-dc --with-mlx5-dv --without-ib-hw-tm --without-dm --with-cm --without-rocm --without-java --without-cuda --without-gdrcopy --without-knem --without-xpmem

GCC build:
$ucx_info -v

UCT version=1.11.2 revision ef2bbcf

configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-etdhrh4gzoj2nroomy7bipr2p2e3ly4l --with-verbs=/usr --enable-mt --enable-cma --disable-params-check --without-avx --enable-optimizations --disable-assertions --disable-logging --with-pic --with-rc --with-ud --with-dc --with-mlx5-dv --without-ib-hw-tm --without-dm --with-cm --without-rocm --without-java --without-cuda --without-gdrcopy --without-knem --without-xpmem

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.7/ucx-1.11.2-jvwucdhwoqpn2xsttr55wgb5kzzbo32v/bin/ucx_info -d | grep verbs
(nothing)
$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-ujc57b4cyrldztpxujdb7v3kaaww54tt/bin/ucx_info -d | grep verbs

Transport: rc_verbs

Transport: ud_verbs

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.7/ucx-1.11.2-jvwucdhwoqpn2xsttr55wgb5kzzbo32v/bin/ucx_info -d | grep mlx

< failed to open memory domain mlx4_0 >

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-ujc57b4cyrldztpxujdb7v3kaaww54tt/bin/ucx_info -d | grep mlx

Memory domain: mlx4_0

Device: mlx4_0:1

Device: mlx4_0:1

Setup and versions

$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.5 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"

$ uname -a
Linux notchpeak2 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -q rdma-core
rdma-core-35.0-1.el8.x86_64

$ rpm -q libibverbs
libibverbs-35.0-1.el8.x86_64

$ ibv_devinfo -vv
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.42.5000
node_guid: 0002:c903:00a4:faa0
sys_image_guid: 0002:c903:00a4:faa3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: MT_1090120019
phys_port_cnt: 2
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffe00
max_qp: 131000
max_qp_wr: 16351
device_cap_flags: 0x057e9c76
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
UD_AV_PORT_ENFORCE
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
UD_IP_CSUM
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
RAW_IP_CSUM
Unknown flags: 0x488000
max_sge: 32
max_sge_rd: 30
max_cq: 65408
max_cqe: 4194303
max_mr: 524032
max_pd: 32764
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2096000
max_qp_init_rd_atom: 128
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 8192
max_mcast_qp_attach: 248
max_total_mcast_qp_attach: 2031616
max_ah: 2147483647
max_fmr: 0
max_srq: 65472
max_srq_wr: 16383
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 15
general_odp_caps:
rc_odp_caps:
NO SUPPORT
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
NO SUPPORT
xrc_odp_caps:
NO SUPPORT
completion timestamp_mask: 0x0000ffffffffffff
hca_core_clock: 427000kHZ
device_cap_flags_ex: 0x57E9C76
tso_caps:
max_tso: 0
rss_caps:
max_rwq_indirection_tables: 0
max_rwq_indirection_table_size: 0
rx_hash_function: 0x0
rx_hash_fields_mask: 0x0
max_wq_type_rq: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
tag matching not supported

cq moderation caps:
	max_cq_count:	65535
	max_cq_period:	65535 us

num_comp_vectors:		64
	port:	1
		state:			PORT_ACTIVE (4)
		max_mtu:		4096 (5)
		active_mtu:		4096 (5)
		sm_lid:			1
		port_lid:		513
		port_lmc:		0x00
		link_layer:		InfiniBand
		max_msg_sz:		0x40000000
		port_cap_flags:		0x02594868
		port_cap_flags2:	0x0000
		max_vl_num:		8 (4)
		bad_pkey_cntr:		0x0
		qkey_viol_cntr:		0x0
		sm_sl:			0
		pkey_tbl_len:		128
		gid_tbl_len:		128
		subnet_timeout:		18
		init_type_reply:	0
		active_width:		4X (2)
		active_speed:		14.0 Gbps (16)
		phys_state:		LINK_UP (5)
		GID[  0]:		fe80:0000:0000:0000:0002:c903:00a4:faa1

	port:	2
		state:			PORT_DOWN (1)
		max_mtu:		4096 (5)
		active_mtu:		4096 (5)
		sm_lid:			0
		port_lid:		0
		port_lmc:		0x00
		link_layer:		InfiniBand
		max_msg_sz:		0x40000000
		port_cap_flags:		0x02594868
		port_cap_flags2:	0x0000
		max_vl_num:		8 (4)
		bad_pkey_cntr:		0x0
		qkey_viol_cntr:		0x0
		sm_sl:			0
		pkey_tbl_len:		128
		gid_tbl_len:		128
		subnet_timeout:		0
		init_type_reply:	0
		active_width:		4X (2)
		active_speed:		2.5 Gbps (1)
		phys_state:		POLLING (2)
		GID[  0]:		fe80:0000:0000:0000:0002:c903:00a4:faa2

Additional information (depending on the issue)

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.7/ucx-1.11.2-jvwucdhwoqpn2xsttr55wgb5kzzbo32v/bin/ucx_info -d

Memory domain: posix

Component: posix

allocate: unlimited

remote key: 24 bytes

rkey_ptr is supported

Transport: posix

Device: memory

System device:

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: sysv

Component: sysv

allocate: unlimited

remote key: 12 bytes

rkey_ptr is supported

Transport: sysv

Device: memory

System device:

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: self

Component: self

register: unlimited, cost: 0 nsec

remote key: 0 bytes

Transport: self

Device: memory0

System device:

capabilities:

bandwidth: 0.00/ppn + 6911.00 MB/sec

latency: 0 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 8K

am_bcopy: <= 8K

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 0 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: tcp

Component: tcp

register: unlimited, cost: 0 nsec

remote key: 0 bytes

Transport: tcp

Device: eth0

System device:

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: lo

System device:

capabilities:

bandwidth: 11.91/ppn + 0.00 MB/sec

latency: 10960 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 18 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: eth0.26

System device:

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 0

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: ib0

System device:

capabilities:

bandwidth: 6239.81/ppn + 0.00 MB/sec

latency: 5210 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Connection manager: tcp

max_conn_priv: 2064 bytes

< failed to open memory domain mlx4_0 >

Connection manager: rdmacm

max_conn_priv: 54 bytes

Memory domain: cma

Component: cma

register: unlimited, cost: 9 nsec

Transport: cma

Device: memory

System device:

capabilities:

bandwidth: 0.00/ppn + 11145.00 MB/sec

latency: 80 nsec

overhead: 400 nsec

put_zcopy: unlimited, up to 16 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 1

get_zcopy: unlimited, up to 16 iov

get_opt_zcopy_align: <= 1

get_align_mtu: <= 1

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 4 bytes

error handling: peer failure, ep_check

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-ujc57b4cyrldztpxujdb7v3kaaww54tt/bin/ucx_info -d

Memory domain: posix

Component: posix

allocate: unlimited

remote key: 24 bytes

rkey_ptr is supported

Transport: posix

Device: memory

System device:

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: sysv

Component: sysv

allocate: unlimited

remote key: 12 bytes

rkey_ptr is supported

Transport: sysv

Device: memory

System device:

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: self

Component: self

register: unlimited, cost: 0 nsec

remote key: 0 bytes

Transport: self

Device: memory0

System device:

capabilities:

bandwidth: 0.00/ppn + 6911.00 MB/sec

latency: 0 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 8K

am_bcopy: <= 8K

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 0 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: tcp

Component: tcp

register: unlimited, cost: 0 nsec

remote key: 0 bytes

Transport: tcp

Device: eth0

System device:

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: lo

System device:

capabilities:

bandwidth: 11.91/ppn + 0.00 MB/sec

latency: 10960 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 18 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: eth0.26

System device:

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 0

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: ib0

System device:

capabilities:

bandwidth: 6239.81/ppn + 0.00 MB/sec

latency: 5210 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Connection manager: tcp

max_conn_priv: 2064 bytes

Memory domain: mlx4_0

Component: ib

register: unlimited, cost: 180 nsec

remote key: 8 bytes

local memory handle is required for zcopy

Transport: rc_verbs

Device: mlx4_0:1

System device: 0000:42:00.0 (0)

capabilities:

bandwidth: 6433.22/ppn + 0.00 MB/sec

latency: 700 + 1.000 * N nsec

overhead: 75 nsec

put_short: <= 88

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 6 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 2K

get_bcopy: <= 8256

get_zcopy: 65..1G, up to 6 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 2K

am_short: <= 87

am_bcopy: <= 8255

am_zcopy: <= 8255, up to 5 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 2K

am header: <= 127

domain: device

atomic_add: 64 bit

atomic_fadd: 64 bit

atomic_cswap: 64 bit

connection: to ep

device priority: 10

device num paths: 1

max eps: 256

device address: 4 bytes

ep address: 4 bytes

error handling: peer failure, ep_check

Transport: ud_verbs

Device: mlx4_0:1

System device: 0000:42:00.0 (0)

capabilities:

bandwidth: 6433.22/ppn + 0.00 MB/sec

latency: 730 nsec

overhead: 105 nsec

am_short: <= 172

am_bcopy: <= 4088

am_zcopy: <= 4088, up to 8 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 3952

connection: to ep, to iface

device priority: 10

device num paths: 1

max eps: inf

device address: 4 bytes

iface address: 3 bytes

ep address: 6 bytes

error handling: peer failure, ep_check

Connection manager: rdmacm

max_conn_priv: 54 bytes

Memory domain: cma

Component: cma

register: unlimited, cost: 9 nsec

Transport: cma

Device: memory

System device:

capabilities:

bandwidth: 0.00/ppn + 11145.00 MB/sec

latency: 80 nsec

overhead: 400 nsec

put_zcopy: unlimited, up to 16 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 1

get_zcopy: unlimited, up to 16 iov

get_opt_zcopy_align: <= 1

get_align_mtu: <= 1

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 4 bytes

error handling: peer failure, ep_check

Configure outputs with excerpts for mlx and verbs:
NVHPC:
$ grep mlx spack-build-02-configure-out.txt
checking infiniband/mlx5_hw.h usability... no
checking infiniband/mlx5_hw.h presence... no
checking for infiniband/mlx5_hw.h... no
checking for mlx5dv_query_device in -lmlx5-rdmav2... no
checking for mlx5dv_query_device in -lmlx5... yes
checking for infiniband/mlx5dv.h... yes
checking whether mlx5dv_init_obj is declared... yes
checking whether mlx5dv_create_qp is declared... yes
checking whether mlx5dv_is_supported is declared... yes
checking whether mlx5dv_devx_subscribe_devx_event is declared... yes
checking for struct mlx5dv_cq.cq_uar... yes
configure: Compiling with mlx5 bare-metal support
checking for struct mlx5_wqe_av.base... no
checking for struct mlx5_grh_av.rmac... no
checking for struct mlx5_cqe64.ib_stride_index... no
$ grep verbs spack-build-02-configure-out.txt
configure: Compiling with verbs support from /usr
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
checking for ibv_get_device_list in -libverbs... yes
checking infiniband/verbs_exp.h usability... no
checking infiniband/verbs_exp.h presence... no
checking for infiniband/verbs_exp.h... no

GCC:
$ grep mlx spack-build-02-configure-out.txt
checking infiniband/mlx5_hw.h usability... no
checking infiniband/mlx5_hw.h presence... no
checking for infiniband/mlx5_hw.h... no
checking for mlx5dv_query_device in -lmlx5-rdmav2... no
checking for mlx5dv_query_device in -lmlx5... yes
checking for infiniband/mlx5dv.h... yes
checking whether mlx5dv_init_obj is declared... yes
checking whether mlx5dv_create_qp is declared... yes
checking whether mlx5dv_is_supported is declared... yes
checking whether mlx5dv_devx_subscribe_devx_event is declared... yes
checking for struct mlx5dv_cq.cq_uar... yes
configure: Compiling with mlx5 bare-metal support
checking for struct mlx5_wqe_av.base... no
checking for struct mlx5_grh_av.rmac... no
checking for struct mlx5_cqe64.ib_stride_index... no
$ grep verbs spack-build-02-configure-out.txt
configure: Compiling with verbs support from /usr
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
checking for ibv_get_device_list in -libverbs... yes
checking infiniband/verbs_exp.h usability... no
checking infiniband/verbs_exp.h presence... no
checking for infiniband/verbs_exp.h... no

@mcuma mcuma added the Bug label Jul 21, 2022
@hoopoepg
Copy link
Contributor

hi @mcuma
thank you for bug report

could you provide complete output from configure script, config.h and config.log files from UCX built over NVHPC environment?

is it possible to build debug version of UCX (with enabled logging) over NVHPC environment and provide output from command
UCX_LOG_LEVEL=debug ucx_info -d

thank you

@tonycurtis
Copy link
Contributor

FWIW I've seen the nvidia/pgi compilers "intrude" on the environment. When the module was loaded, even ibv_devinfo was failing. It could be LD_LIBRARY_PATH or something similar that is causing the problem. Maybe ldd/lddtree the executable to see if unexpected libraries are being picked up? There are also various versions of the nvidia compilers that include/exclude an in-built (Open-)MPI.

@mcuma
Copy link
Author

mcuma commented Jul 29, 2022

Hi @hoopoepg ,

thanks for your reply and the debug suggestion. I am not sure how to well attach files to this ticket so let me put them to a public link, https://home.chpc.utah.edu/~mcuma/debug/ucx/

In particular, the spack_build directory contains the requested files from the Spack build, debug_build contains the configure and ucx_info output from a build with --enable-debug option, and nodebug_build the same output with the build w/o debug.

Interestingly, the build with --enable-debug includes the verbs provider, while the build without debug does not. The difference seems to be the optimization flag. The --enable-debug forces -O0, while the non-debug uses -O3.

If one adds --enable-compiler-opt=1 or 1 configure option to force the -O1 or -O0, it also results in a correct IB build.

Would you like me to open a ticket with the NVHPC group to address this?

Though, even with the --enable-compiler-opt=1 , the OpenMPI built with UCX is complaining when running:
$ mpirun -x UCX_NET_DEVICES=mlx5_0:1 -np 2 ./a.out
Process 0 on notch308
STARTING LATENCY AND BANDWIDTH BENCHMARK
Process 1 on notch309
[notch308:2480858] pml_ucx.c:908 Error: mca_pml_ucx_send_nbr failed: -2, No resources are available to initiate the operation
[notch308:2480858] *** An error occurred in MPI_Barrier
[notch308:2480858] *** reported by process [1877147649,0]
[notch308:2480858] *** on communicator MPI_COMM_WORLD
[notch308:2480858] *** MPI_ERR_OTHER: known error not in list
[notch308:2480858] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[notch308:2480858] *** and potentially your MPI job)
[notch309:2758566] pml_ucx.c:908 Error: mca_pml_ucx_send_nbr failed: -2, No resources are available to initiate the operation
[notch308:2480848] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[notch308:2480848] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

It seems like that device is still not set up correctly. Any thoughts on this are appreciated.

Thanks.

@hoopoepg
Copy link
Contributor

as I can see from logs all builds successfully detected verbs library and enabled basic IB support.

could you run command
ldd -r <PATH-WHERE-UCX-INSTALLED>/lib/ucx/libuct_ib.so
on UCX build which can't detect IB devices and check if all dependencies are available?

It seems like that device is still not set up correctly. Any thoughts on this are appreciated.

yes, it could be reason, but as I can see from your bug report (ibv_devinfo) - devices are configured properly.
what is output for command
ulimit -a ?

@mcuma
Copy link
Author

mcuma commented Jul 30, 2022

Hi Sergey,

the ldd seems to get the IB libraries correctly:
$ ldd -r /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.5/ucx-1.11.2-asrhvd26hyucdhokcp6l5ufukmgxync7/lib/ucx/libuct_ib.so
linux-vdso.so.1 (0x00007ffd3045e000)
libibverbs.so.1 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libibverbs.so.1 (0x00007f9bef67f000)
libmlx5.so.1 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libmlx5.so.1 (0x00007f9bef42c000)
libuct.so.0 => /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.5/ucx-1.11.2-asrhvd26hyucdhokcp6l5ufukmgxync7/lib/libuct.so.0 (0x00007f9bef1ef000)
....

The IB drivers should be OK, we have no issue with UCX built with GNU and Intel compilers, and IB works fine with MVAPICH2 and Intel MPI as well.

Here's the ulimit -a output, anything suspicious there? We did raise a few limits in the past to accommodate MPI buffers, etc.
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 384899
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 16384
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 65535
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I am wondering if you had reports like this in the past or if you have a platform where you could try to reproduce what we're doing? It seems not too many sites are using NVHPC with MPI/IB so I don' t have any contacts to cross-check this with.

Also, the OpenMPI supplied with the NVHPC suite does not have the IB support built in, and neither does UCX that's shipped with the Rocky Linux 8 that we run - at least the "ucx_info -d" does not show it.

Thanks.

@mcuma
Copy link
Author

mcuma commented Aug 1, 2022

One more tidbit that I found. On our older cluster that still has CentOS 7, UCX builds correctly with NVHPC (= includes the verbs providers), but, OpenMPI still crashes with the "No resources available" error.

I ended up with a workaround to build UCX with the stock OS gcc and then building OpenMPI with this UCX which results in correct OpenMPI behavior.

So, I am good for now but still wondering if the issue we have is local due to our OS setup, or due to lack of other sites building UCX with NVHPC and verbs.

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 2, 2022

hi
glad to see you have workaround for the issue.
we will try to reproduce it locally (unfortunately we can't reproduce it for now) and try to identify root cause

thank you for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants