-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
got active message id 0, but no handler installed #9819
Comments
@duqi-hnu what is the output of "lscpu"? |
@yosefe A screenshot of using lscpu try with UCX_TLS=rc_verbs, |
@duqi-hnu seems the CPU vendor and model are unknown, what kind of CPU is it? |
@yosefe This is an ARMv8 CPU |
rc_v is using different, apparently "stronger", memory barriers, from rdma-core diff --git a/src/ucs/arch/aarch64/cpu.h b/src/ucs/arch/aarch64/cpu.h
index 9f45e9d40b..4b07c57784 100644
--- a/src/ucs/arch/aarch64/cpu.h
+++ b/src/ucs/arch/aarch64/cpu.h
@@ -53,8 +53,8 @@ BEGIN_C_DECLS
* of DSB. The barrier used for synchronization of access between write back
* and device mapped memory (PCIe BAR).
*/
-#define ucs_memory_bus_store_fence() ucs_aarch64_dmb(oshst)
-#define ucs_memory_bus_load_fence() ucs_aarch64_dmb(oshld)
+#define ucs_memory_bus_store_fence() ucs_aarch64_dmb(st)
+#define ucs_memory_bus_load_fence() ucs_aarch64_dmb(ld)
/* The macro is used to flush all pending stores from write combining buffer.
* Micro-arch "auto" flush the stores once WC buffer is full or on a timeout.
@@ -62,16 +62,16 @@ BEGIN_C_DECLS
* rdma subsystem patch:
* https://patchwork.kernel.org/project/linux-rdma/patch/[email protected]/
*/
-#define ucs_memory_bus_cacheline_wc_flush() ucs_aarch64_dgh()
+#define ucs_memory_bus_cacheline_wc_flush() ucs_memory_bus_store_fence()
#define ucs_memory_cpu_fence() ucs_aarch64_dmb(ish)
-#define ucs_memory_cpu_store_fence() ucs_aarch64_dmb(ishst)
-#define ucs_memory_cpu_load_fence() ucs_aarch64_dmb(ishld)
+#define ucs_memory_cpu_store_fence() ucs_aarch64_dmb(st)
+#define ucs_memory_cpu_load_fence() ucs_aarch64_dmb(ld)
/* The macro is used to serialize stores to Normal NC or Device memory
* (see Arm Spec, B2.7.2)
*/
-#define ucs_memory_cpu_wc_fence() ucs_aarch64_dmb(oshst)
+#define ucs_memory_cpu_wc_fence() ucs_aarch64_dmb(st)
/* |
@duqi-hnu is the test always passing with UCX_TLS=rc_v? |
@yosefe when we use UCX_TLS=all, the probability of an error is greater than UCX_TLS=rc_v. |
I see. But did it ever fail with UCX_TLS=rc_v, at least once? |
@yosefe yes, UCX_TLS=rc_v just has a low probability of failure |
It seems like an issue with the specific CPU implementation. rdma-core uses ARM-spec compliant memory barriers and |
@yosefe If I want to use a different transport protocol, I need to modify the code to conform to the memory barrier semantics of the ARM specification? |
it seems like the specific CPU you are using is not implementing memory barriers according to ARM specifications. You can try use different memory barriers (modify the attached patch accordingly) and use UCX_TLS=all (with UCXT_TLS=rc_v, the barriers are part of rdma-core) |
Describe the bug
On the ARMv8 platform, when using mpich-4.2.0 and ucx-1.16.0.rc5 versions, running the IMB-MPI program to test pingping between two nodes results in a warning "got active message id 0, but no handler installed," and the job hangs.
Steps to Reproduce
Setup and versions
VERSION="22.03 (LTS-SP3)"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 (LTS-SP3)"
ANSI_COLOR="0;31"
ibstat
commandCA type: MT4119
Number of ports: 1
Firmware version: 16.35.3502
Hardware version: 0
Node GUID: 0x1070fd03000246e1
System image GUID: 0x1070fd03000246e0
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 16
LMC: 0
SM lid: 1
Capability mask: 0xa651e848
Port GUID: 0x1070fd03000246e1
Link layer: InfiniBand
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXMemory domain: self
Component: self
register: unlimited, cost: 0 nsec
remote key: 0 bytes
rkey_ptr is supported
memory types: host (access,reg_nonblock,reg,cache)
Transport: self
Device: memory
Type: loopback
System device:
capabilities:
bandwidth: 0.00/ppn + 19360.00 MB/sec
latency: 0 nsec
overhead: 10 nsec
put_short: <= 4294967295
put_bcopy: unlimited
get_bcopy: unlimited
am_short: <= 8K
am_bcopy: <= 8K
domain: cpu
atomic_add: 32, 64 bit
atomic_and: 32, 64 bit
atomic_or: 32, 64 bit
atomic_xor: 32, 64 bit
atomic_fadd: 32, 64 bit
atomic_fand: 32, 64 bit
atomic_for: 32, 64 bit
atomic_fxor: 32, 64 bit
atomic_swap: 32, 64 bit
atomic_cswap: 32, 64 bit
connection: to iface
device priority: 0
device num paths: 1
max eps: inf
device address: 0 bytes
iface address: 8 bytes
error handling: ep_check
Memory domain: tcp
Component: tcp
register: unlimited, cost: 0 nsec
remote key: 0 bytes
memory types: host (access,reg_nonblock,reg,cache)
Transport: tcp
Device: enP4p2s0f0
Type: network
System device: enP4p2s0f0 (0)
capabilities:
bandwidth: 113.16/ppn + 0.00 MB/sec
latency: 5776 nsec
overhead: 50000 nsec
put_zcopy: <= 18446744073709551590, up to 6 iov
put_opt_zcopy_align: <= 1
put_align_mtu: <= 0
am_short: <= 8K
am_bcopy: <= 8K
am_zcopy: <= 64K, up to 6 iov
am_opt_zcopy_align: <= 1
am_align_mtu: <= 0
am header: <= 8037
connection: to ep, to iface
device priority: 0
device num paths: 1
max eps: 256
device address: 6 bytes
iface address: 2 bytes
ep address: 10 bytes
error handling: peer failure, ep_check, keepalive
Transport: tcp
Device: ibp1s0f1
Type: network
System device: ibp1s0f1 (1)
capabilities:
bandwidth: 2200.00/ppn + 0.00 MB/sec
latency: 5206 nsec
overhead: 50000 nsec
put_zcopy: <= 18446744073709551590, up to 6 iov
put_opt_zcopy_align: <= 1
put_align_mtu: <= 0
am_short: <= 8K
am_bcopy: <= 8K
am_zcopy: <= 64K, up to 6 iov
am_opt_zcopy_align: <= 1
am_align_mtu: <= 0
am header: <= 8037
connection: to ep, to iface
device priority: 1
device num paths: 1
max eps: 256
device address: 6 bytes
iface address: 2 bytes
ep address: 10 bytes
error handling: peer failure, ep_check, keepalive
Transport: tcp
Device: lo
Type: network
System device:
capabilities:
bandwidth: 11.91/ppn + 0.00 MB/sec
latency: 10960 nsec
overhead: 50000 nsec
put_zcopy: <= 18446744073709551590, up to 6 iov
put_opt_zcopy_align: <= 1
put_align_mtu: <= 0
am_short: <= 8K
am_bcopy: <= 8K
am_zcopy: <= 64K, up to 6 iov
am_opt_zcopy_align: <= 1
am_align_mtu: <= 0
am header: <= 8037
connection: to ep, to iface
device priority: 1
device num paths: 1
max eps: 256
device address: 18 bytes
iface address: 2 bytes
ep address: 10 bytes
error handling: peer failure, ep_check, keepalive
Connection manager: tcp
max_conn_priv: 2064 bytes
Memory domain: sysv
Component: sysv
allocate: unlimited
remote key: 12 bytes
rkey_ptr is supported
memory types: host (access,alloc,cache)
Transport: sysv
Device: memory
Type: intra-node
System device:
capabilities:
bandwidth: 0.00/ppn + 15360.00 MB/sec
latency: 80 nsec
overhead: 10 nsec
put_short: <= 4294967295
put_bcopy: unlimited
get_bcopy: unlimited
am_short: <= 100
am_bcopy: <= 8256
domain: cpu
atomic_add: 32, 64 bit
atomic_and: 32, 64 bit
atomic_or: 32, 64 bit
atomic_xor: 32, 64 bit
atomic_fadd: 32, 64 bit
atomic_fand: 32, 64 bit
atomic_for: 32, 64 bit
atomic_fxor: 32, 64 bit
atomic_swap: 32, 64 bit
atomic_cswap: 32, 64 bit
connection: to iface
device priority: 0
device num paths: 1
max eps: inf
device address: 8 bytes
iface address: 8 bytes
error handling: ep_check
Memory domain: posix
Component: posix
allocate: <= 264133672K
remote key: 24 bytes
rkey_ptr is supported
memory types: host (access,alloc,cache)
Transport: posix
Device: memory
Type: intra-node
System device:
capabilities:
bandwidth: 0.00/ppn + 15360.00 MB/sec
latency: 80 nsec
overhead: 10 nsec
put_short: <= 4294967295
put_bcopy: unlimited
get_bcopy: unlimited
am_short: <= 100
am_bcopy: <= 8256
domain: cpu
atomic_add: 32, 64 bit
atomic_and: 32, 64 bit
atomic_or: 32, 64 bit
atomic_xor: 32, 64 bit
atomic_fadd: 32, 64 bit
atomic_fand: 32, 64 bit
atomic_for: 32, 64 bit
atomic_fxor: 32, 64 bit
atomic_swap: 32, 64 bit
atomic_cswap: 32, 64 bit
connection: to iface
device priority: 0
device num paths: 1
max eps: inf
device address: 8 bytes
iface address: 8 bytes
error handling: ep_check
Memory domain: mlx5_0
Component: ib
register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
remote key: 8 bytes
local memory handle is required for zcopy
memory types: host (access,reg,cache)
< no supported devices found >
Memory domain: mlx5_1
Component: ib
register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
remote key: 8 bytes
local memory handle is required for zcopy
memory types: host (access,reg,cache)
Transport: dc_mlx5
Device: mlx5_1:1
Type: network
System device: mlx5_1 (1)
capabilities:
bandwidth: 11794.23/ppn + 0.00 MB/sec
latency: 660 nsec
overhead: 40 nsec
put_short: <= 2K
put_bcopy: <= 8256
put_zcopy: <= 1G, up to 11 iov
put_opt_zcopy_align: <= 512
put_align_mtu: <= 4K
get_bcopy: <= 8256
get_zcopy: 65..1G, up to 11 iov
get_opt_zcopy_align: <= 512
get_align_mtu: <= 4K
am_short: <= 2046
am_bcopy: <= 8254
am_zcopy: <= 8254, up to 3 iov
am_opt_zcopy_align: <= 512
am_align_mtu: <= 4K
am header: <= 138
domain: device
atomic_add: 32, 64 bit
atomic_and: 32, 64 bit
atomic_or: 32, 64 bit
atomic_xor: 32, 64 bit
atomic_fadd: 32, 64 bit
atomic_fand: 32, 64 bit
atomic_for: 32, 64 bit
atomic_fxor: 32, 64 bit
atomic_swap: 32, 64 bit
atomic_cswap: 32, 64 bit
connection: to iface
device priority: 38
device num paths: 1
max eps: inf
device address: 3 bytes
iface address: 7 bytes
error handling: buffer (zcopy), remote access, peer failure, ep_check
Transport: rc_verbs
Device: mlx5_1:1
Type: network
System device: mlx5_1 (1)
capabilities:
bandwidth: 11794.23/ppn + 0.00 MB/sec
latency: 600 + 1.000 * N nsec
overhead: 75 nsec
put_short: <= 124
put_bcopy: <= 8256
put_zcopy: <= 1G, up to 5 iov
put_opt_zcopy_align: <= 512
put_align_mtu: <= 4K
get_bcopy: <= 8256
get_zcopy: 65..1G, up to 5 iov
get_opt_zcopy_align: <= 512
get_align_mtu: <= 4K
am_short: <= 123
am_bcopy: <= 8255
am_zcopy: <= 8255, up to 4 iov
am_opt_zcopy_align: <= 512
am_align_mtu: <= 4K
am header: <= 127
domain: device
atomic_add: 64 bit
atomic_fadd: 64 bit
atomic_cswap: 64 bit
connection: to ep
device priority: 38
device num paths: 1
max eps: 256
device address: 3 bytes
ep address: 7 bytes
error handling: peer failure, ep_check
Transport: rc_mlx5
Device: mlx5_1:1
Type: network
System device: mlx5_1 (1)
capabilities:
bandwidth: 11794.23/ppn + 0.00 MB/sec
latency: 600 + 1.000 * N nsec
overhead: 40 nsec
put_short: <= 2K
put_bcopy: <= 8256
put_zcopy: <= 1G, up to 14 iov
put_opt_zcopy_align: <= 512
put_align_mtu: <= 4K
get_bcopy: <= 8256
get_zcopy: 65..1G, up to 14 iov
get_opt_zcopy_align: <= 512
get_align_mtu: <= 4K
am_short: <= 2046
am_bcopy: <= 8254
am_zcopy: <= 8254, up to 3 iov
am_opt_zcopy_align: <= 512
am_align_mtu: <= 4K
am header: <= 186
domain: device
atomic_add: 32, 64 bit
atomic_and: 32, 64 bit
atomic_or: 32, 64 bit
atomic_xor: 32, 64 bit
atomic_fadd: 32, 64 bit
atomic_fand: 32, 64 bit
atomic_for: 32, 64 bit
atomic_fxor: 32, 64 bit
atomic_swap: 32, 64 bit
atomic_cswap: 32, 64 bit
connection: to ep
device priority: 38
device num paths: 1
max eps: 256
device address: 3 bytes
ep address: 10 bytes
error handling: buffer (zcopy), remote access, peer failure, ep_check
Transport: ud_verbs
Device: mlx5_1:1
Type: network
System device: mlx5_1 (1)
capabilities:
bandwidth: 11794.23/ppn + 0.00 MB/sec
latency: 630 nsec
overhead: 105 nsec
am_short: <= 116
am_bcopy: <= 4088
am_zcopy: <= 4088, up to 5 iov
am_opt_zcopy_align: <= 512
am_align_mtu: <= 4K
am header: <= 3992
connection: to ep, to iface
device priority: 38
device num paths: 1
max eps: inf
device address: 3 bytes
iface address: 3 bytes
ep address: 6 bytes
error handling: peer failure, ep_check
Transport: ud_mlx5
Device: mlx5_1:1
Type: network
System device: mlx5_1 (1)
capabilities:
bandwidth: 11794.23/ppn + 0.00 MB/sec
latency: 630 nsec
overhead: 80 nsec
am_short: <= 180
am_bcopy: <= 4088
am_zcopy: <= 4088, up to 3 iov
am_opt_zcopy_align: <= 512
am_align_mtu: <= 4K
am header: <= 132
connection: to ep, to iface
device priority: 38
device num paths: 1
max eps: inf
device address: 3 bytes
iface address: 3 bytes
ep address: 6 bytes
error handling: peer failure, ep_check
Connection manager: rdmacm
max_conn_priv: 54 bytes
Memory domain: cma
Component: cma
register: unlimited, cost: 9 nsec
memory types: host (access,reg_nonblock,reg,cache)
Transport: cma
Device: memory
Type: intra-node
System device:
capabilities:
bandwidth: 0.00/ppn + 11145.00 MB/sec
latency: 80 nsec
overhead: 2000 nsec
put_zcopy: unlimited, up to 16 iov
put_opt_zcopy_align: <= 1
put_align_mtu: <= 1
get_zcopy: unlimited, up to 16 iov
get_opt_zcopy_align: <= 1
get_align_mtu: <= 1
connection: to iface
device priority: 0
device num paths: 1
max eps: inf
device address: 8 bytes
iface address: 4 bytes
error handling: peer failure, ep_check
Memory domain: xpmem
Component: xpmem
register: unlimited, cost: 60 nsec
remote key: 24 bytes
rkey_ptr is supported
memory types: host (access,alloc,reg_nonblock,reg,cache)
Transport: xpmem
Device: memory
Type: intra-node
System device:
capabilities:
bandwidth: 0.00/ppn + 15360.00 MB/sec
latency: 80 nsec
overhead: 10 nsec
put_short: <= 4294967295
put_bcopy: unlimited
get_bcopy: unlimited
am_short: <= 100
am_bcopy: <= 8256
domain: cpu
atomic_add: 32, 64 bit
atomic_and: 32, 64 bit
atomic_or: 32, 64 bit
atomic_xor: 32, 64 bit
atomic_fadd: 32, 64 bit
atomic_fand: 32, 64 bit
atomic_for: 32, 64 bit
atomic_fxor: 32, 64 bit
atomic_swap: 32, 64 bit
atomic_cswap: 32, 64 bit
connection: to iface
device priority: 0
device num paths: 1
max eps: inf
device address: 8 bytes
iface address: 16 bytes
error handling: none
The text was updated successfully, but these errors were encountered: