Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

got active message id 0, but no handler installed #9819

Closed
duqi-hnu opened this issue Apr 13, 2024 · 13 comments
Closed

got active message id 0, but no handler installed #9819

duqi-hnu opened this issue Apr 13, 2024 · 13 comments
Labels

Comments

@duqi-hnu
Copy link

Describe the bug

On the ARMv8 platform, when using mpich-4.2.0 and ucx-1.16.0.rc5 versions, running the IMB-MPI program to test pingping between two nodes results in a warning "got active message id 0, but no handler installed," and the job hangs.
image

Steps to Reproduce

  • mpirun -f nodelist -n 2 ./IMB-MPI1 pingping
  • UCX version used 1.16.0
  • Any UCX environment variables used

Setup and versions

  • OS version (Linux distro) + CPU architecture (aarch64)
    • NAME="openEuler"
      VERSION="22.03 (LTS-SP3)"
      ID="openEuler"
      VERSION_ID="22.03"
      PRETTY_NAME="openEuler 22.03 (LTS-SP3)"
      ANSI_COLOR="0;31"
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • OFED-internal-23.07-0.5.1
    • HW information from ibstat command
    • CA 'mlx5_1'
      CA type: MT4119
      Number of ports: 1
      Firmware version: 16.35.3502
      Hardware version: 0
      Node GUID: 0x1070fd03000246e1
      System image GUID: 0x1070fd03000246e0
      Port 1:
      State: Active
      Physical state: LinkUp
      Rate: 100
      Base lid: 16
      LMC: 0
      SM lid: 1
      Capability mask: 0xa651e848
      Port GUID: 0x1070fd03000246e1
      Link layer: InfiniBand

Additional information (depending on the issue)

  • MPICH version: 4.2.0
  • Output of ucx_info -d to show transports and devices recognized by UCX

Memory domain: self

Component: self

register: unlimited, cost: 0 nsec

remote key: 0 bytes

rkey_ptr is supported

memory types: host (access,reg_nonblock,reg,cache)

Transport: self

Device: memory

Type: loopback

System device:

capabilities:

bandwidth: 0.00/ppn + 19360.00 MB/sec

latency: 0 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 8K

am_bcopy: <= 8K

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 0 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: tcp

Component: tcp

register: unlimited, cost: 0 nsec

remote key: 0 bytes

memory types: host (access,reg_nonblock,reg,cache)

Transport: tcp

Device: enP4p2s0f0

Type: network

System device: enP4p2s0f0 (0)

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 0

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: ibp1s0f1

Type: network

System device: ibp1s0f1 (1)

capabilities:

bandwidth: 2200.00/ppn + 0.00 MB/sec

latency: 5206 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Transport: tcp

Device: lo

Type: network

System device:

capabilities:

bandwidth: 11.91/ppn + 0.00 MB/sec

latency: 10960 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 18 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

Connection manager: tcp

max_conn_priv: 2064 bytes

Memory domain: sysv

Component: sysv

allocate: unlimited

remote key: 12 bytes

rkey_ptr is supported

memory types: host (access,alloc,cache)

Transport: sysv

Device: memory

Type: intra-node

System device:

capabilities:

bandwidth: 0.00/ppn + 15360.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: posix

Component: posix

allocate: <= 264133672K

remote key: 24 bytes

rkey_ptr is supported

memory types: host (access,alloc,cache)

Transport: posix

Device: memory

Type: intra-node

System device:

capabilities:

bandwidth: 0.00/ppn + 15360.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

Memory domain: mlx5_0

Component: ib

register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec

remote key: 8 bytes

local memory handle is required for zcopy

memory types: host (access,reg,cache)

< no supported devices found >

Memory domain: mlx5_1

Component: ib

register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec

remote key: 8 bytes

local memory handle is required for zcopy

memory types: host (access,reg,cache)

Transport: dc_mlx5

Device: mlx5_1:1

Type: network

System device: mlx5_1 (1)

capabilities:

bandwidth: 11794.23/ppn + 0.00 MB/sec

latency: 660 nsec

overhead: 40 nsec

put_short: <= 2K

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 11 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 4K

get_bcopy: <= 8256

get_zcopy: 65..1G, up to 11 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 4K

am_short: <= 2046

am_bcopy: <= 8254

am_zcopy: <= 8254, up to 3 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 138

domain: device

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 38

device num paths: 1

max eps: inf

device address: 3 bytes

iface address: 7 bytes

error handling: buffer (zcopy), remote access, peer failure, ep_check

Transport: rc_verbs

Device: mlx5_1:1

Type: network

System device: mlx5_1 (1)

capabilities:

bandwidth: 11794.23/ppn + 0.00 MB/sec

latency: 600 + 1.000 * N nsec

overhead: 75 nsec

put_short: <= 124

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 5 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 4K

get_bcopy: <= 8256

get_zcopy: 65..1G, up to 5 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 4K

am_short: <= 123

am_bcopy: <= 8255

am_zcopy: <= 8255, up to 4 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 127

domain: device

atomic_add: 64 bit

atomic_fadd: 64 bit

atomic_cswap: 64 bit

connection: to ep

device priority: 38

device num paths: 1

max eps: 256

device address: 3 bytes

ep address: 7 bytes

error handling: peer failure, ep_check

Transport: rc_mlx5

Device: mlx5_1:1

Type: network

System device: mlx5_1 (1)

capabilities:

bandwidth: 11794.23/ppn + 0.00 MB/sec

latency: 600 + 1.000 * N nsec

overhead: 40 nsec

put_short: <= 2K

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 14 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 4K

get_bcopy: <= 8256

get_zcopy: 65..1G, up to 14 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 4K

am_short: <= 2046

am_bcopy: <= 8254

am_zcopy: <= 8254, up to 3 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 186

domain: device

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to ep

device priority: 38

device num paths: 1

max eps: 256

device address: 3 bytes

ep address: 10 bytes

error handling: buffer (zcopy), remote access, peer failure, ep_check

Transport: ud_verbs

Device: mlx5_1:1

Type: network

System device: mlx5_1 (1)

capabilities:

bandwidth: 11794.23/ppn + 0.00 MB/sec

latency: 630 nsec

overhead: 105 nsec

am_short: <= 116

am_bcopy: <= 4088

am_zcopy: <= 4088, up to 5 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 3992

connection: to ep, to iface

device priority: 38

device num paths: 1

max eps: inf

device address: 3 bytes

iface address: 3 bytes

ep address: 6 bytes

error handling: peer failure, ep_check

Transport: ud_mlx5

Device: mlx5_1:1

Type: network

System device: mlx5_1 (1)

capabilities:

bandwidth: 11794.23/ppn + 0.00 MB/sec

latency: 630 nsec

overhead: 80 nsec

am_short: <= 180

am_bcopy: <= 4088

am_zcopy: <= 4088, up to 3 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 132

connection: to ep, to iface

device priority: 38

device num paths: 1

max eps: inf

device address: 3 bytes

iface address: 3 bytes

ep address: 6 bytes

error handling: peer failure, ep_check

Connection manager: rdmacm

max_conn_priv: 54 bytes

Memory domain: cma

Component: cma

register: unlimited, cost: 9 nsec

memory types: host (access,reg_nonblock,reg,cache)

Transport: cma

Device: memory

Type: intra-node

System device:

capabilities:

bandwidth: 0.00/ppn + 11145.00 MB/sec

latency: 80 nsec

overhead: 2000 nsec

put_zcopy: unlimited, up to 16 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 1

get_zcopy: unlimited, up to 16 iov

get_opt_zcopy_align: <= 1

get_align_mtu: <= 1

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 4 bytes

error handling: peer failure, ep_check

Memory domain: xpmem

Component: xpmem

register: unlimited, cost: 60 nsec

remote key: 24 bytes

rkey_ptr is supported

memory types: host (access,alloc,reg_nonblock,reg,cache)

Transport: xpmem

Device: memory

Type: intra-node

System device:

capabilities:

bandwidth: 0.00/ppn + 15360.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 16 bytes

error handling: none

@duqi-hnu duqi-hnu added the Bug label Apr 13, 2024
@yosefe
Copy link
Contributor

yosefe commented Apr 13, 2024

@duqi-hnu what is the output of "lscpu"?
it's possible some of the memory fences we use are not suited to the platform.
can you pls try with UCX_TLS=rc_verbs?

@duqi-hnu
Copy link
Author

@yosefe A screenshot of using lscpu
image

try with UCX_TLS=rc_verbs,
UCX_TLS=rc_verbs mpirun -f nodelist -n 2 ./IMB-MPI1 pingping
image

@yosefe
Copy link
Contributor

yosefe commented Apr 14, 2024

@duqi-hnu seems the CPU vendor and model are unknown, what kind of CPU is it?
can you try with UCX_TLS=rc_v?

@duqi-hnu
Copy link
Author

@yosefe This is an ARMv8 CPU
run with “UCX_TLS=rc_v mpirun -f nodelist -n 2 ./IMB-MPI1 pingping”,the job runs properly. Why?

@yosefe
Copy link
Contributor

yosefe commented Apr 14, 2024

rc_v is using different, apparently "stronger", memory barriers, from rdma-core
can you pls try with the following UCX patch:

diff --git a/src/ucs/arch/aarch64/cpu.h b/src/ucs/arch/aarch64/cpu.h
index 9f45e9d40b..4b07c57784 100644
--- a/src/ucs/arch/aarch64/cpu.h
+++ b/src/ucs/arch/aarch64/cpu.h
@@ -53,8 +53,8 @@ BEGIN_C_DECLS
  * of DSB. The barrier used for synchronization of access between write back
  * and device mapped memory (PCIe BAR).
  */
-#define ucs_memory_bus_store_fence()  ucs_aarch64_dmb(oshst)
-#define ucs_memory_bus_load_fence()   ucs_aarch64_dmb(oshld)
+#define ucs_memory_bus_store_fence()  ucs_aarch64_dmb(st)
+#define ucs_memory_bus_load_fence()   ucs_aarch64_dmb(ld)

 /* The macro is used to flush all pending stores from write combining buffer.
  * Micro-arch "auto" flush the stores once WC buffer is full or on a timeout.
@@ -62,16 +62,16 @@ BEGIN_C_DECLS
  * rdma subsystem patch:
  * https://patchwork.kernel.org/project/linux-rdma/patch/[email protected]/
  */
-#define ucs_memory_bus_cacheline_wc_flush()     ucs_aarch64_dgh()
+#define ucs_memory_bus_cacheline_wc_flush()     ucs_memory_bus_store_fence()

 #define ucs_memory_cpu_fence()        ucs_aarch64_dmb(ish)
-#define ucs_memory_cpu_store_fence()  ucs_aarch64_dmb(ishst)
-#define ucs_memory_cpu_load_fence()   ucs_aarch64_dmb(ishld)
+#define ucs_memory_cpu_store_fence()  ucs_aarch64_dmb(st)
+#define ucs_memory_cpu_load_fence()   ucs_aarch64_dmb(ld)

 /* The macro is used to serialize stores to Normal NC or Device memory
  * (see Arm Spec, B2.7.2)
  */
-#define ucs_memory_cpu_wc_fence()     ucs_aarch64_dmb(oshst)
+#define ucs_memory_cpu_wc_fence()     ucs_aarch64_dmb(st)


 /*

@duqi-hnu
Copy link
Author

@yosefe After the use of this patch, the situation has improved, and the probability of hanging has dropped to about 10%, but it is still not fundamentally solved.

@yosefe
Copy link
Contributor

yosefe commented Apr 15, 2024

@duqi-hnu is the test always passing with UCX_TLS=rc_v?
Can we consult with the CPU vendor what is the right memory barrier to use?

@duqi-hnu
Copy link
Author

@yosefe when we use UCX_TLS=all, the probability of an error is greater than UCX_TLS=rc_v.

@yosefe
Copy link
Contributor

yosefe commented Apr 15, 2024

@yosefe when we use UCX_TLS=all, the probability of an error is greater than UCX_TLS=rc_v.

I see. But did it ever fail with UCX_TLS=rc_v, at least once?

@duqi-hnu
Copy link
Author

@yosefe yes, UCX_TLS=rc_v just has a low probability of failure

@yosefe
Copy link
Contributor

yosefe commented Apr 15, 2024

It seems like an issue with the specific CPU implementation. rdma-core uses ARM-spec compliant memory barriers and UCX_TLS=rc_v uses rdma-core for posting WQE and polling CQ.

@duqi-hnu
Copy link
Author

@yosefe If I want to use a different transport protocol, I need to modify the code to conform to the memory barrier semantics of the ARM specification?

@yosefe
Copy link
Contributor

yosefe commented Apr 15, 2024

it seems like the specific CPU you are using is not implementing memory barriers according to ARM specifications. You can try use different memory barriers (modify the attached patch accordingly) and use UCX_TLS=all (with UCXT_TLS=rc_v, the barriers are part of rdma-core)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants