Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertions in ud_ep.c fail sporadically #8620

Open
ioanhadade opened this issue Oct 12, 2022 · 8 comments
Open

Assertions in ud_ep.c fail sporadically #8620

ioanhadade opened this issue Oct 12, 2022 · 8 comments
Assignees
Labels

Comments

@ioanhadade
Copy link

ioanhadade commented Oct 12, 2022

Describe the bug

Every now and then, we have an assert fail in ucx which leads to an application abort. These assert failures seem to either be raised when the application is in MPI_Wait call or in an MPI_Recv. Examples of stack traces of both cases presented below:

Abort in MPI_Wait

/usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
...
762 ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
763
764 if (uct_ud_ep_is_connected(ep)) {
==> 765 ucs_assertv_always(ep->dest_ep_id == ctl->conn_rep.src_ep_id,
766 "ep=%p [id=%d dest_ep_id=%d flags=0x%x] "
767 "crep [neth->dest=%d dst_ep_id=%d src_ep_id=%d]",
768 ep, ep->ep_id, ep->dest_ep_id, ep->path_index, ep->flags,

/usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
...
762 ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
763
764 if (uct_ud_ep_is_connected(ep)) {
==> 765 ucs_assertv_always(ep->dest_ep_id == ctl->conn_rep.src_ep_id,
766 "ep=%p [id=%d dest_ep_id=%d flags=0x%x] "
767 "crep [neth->dest=%d dst_ep_id=%d src_ep_id=%d]",
768 ep, ep->ep_id, ep->dest_ep_id, ep->path_index, ep->flags,

==== backtrace (tid: 190902) ====
0 0x0000000000059f16 uct_ud_ep_rx_ctl() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c:765
1 0x0000000000059f16 uct_ud_ep_process_rx() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c:893
2 0x0000000000062c91 uct_ud_mlx5_iface_poll_rx() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/accel/ud_mlx5.c:510
3 0x0000000000062c91 uct_ud_mlx5_iface_progress() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/accel/ud_mlx5.c:559
4 0x000000000003645b ucs_callbackq_dispatch() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucs/datastruct/callbackq.h:211
5 0x000000000003645b uct_worker_progress() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/api/uct.h:2592
6 0x000000000003645b ucp_worker_progress() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/core/ucp_worker.c:2455
7 0x000000000002f324 opal_progress() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/opal/runtime/opal_progress.c:231
8 0x0000000000035f9d ompi_sync_wait_mt() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/opal/threads/wait_sync.c:85
9 0x0000000000054a56 ompi_request_default_wait_all() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/request/req_wait.c:243
10 0x0000000000083063 PMPI_Waitall() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mpi/c/profile/pwaitall.c:80
11 0x000000000005a60e ompi_waitall_f() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:10

Abort in MPI_Recv

[ac1-2033:471587:0:471587] ud_ep.c:762 Assertion `ctl->type == UCT_UD_PACKET_CREP' failed

/usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
...
759 uct_ud_ctl_hdr_t ctl = (uct_ud_ctl_hdr_t)(neth + 1);
760
761 ucs_trace_func("");
==> 762 ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
763
764 if (uct_ud_ep_is_connected(ep)) {
765 ucs_assertv_always(ep->dest_ep_id == ctl->conn_rep.src_ep_id,

==== backtrace (tid: 471587) ====
0 0x0000000000059eb4 uct_ud_ep_rx_ctl() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c:762
1 0x0000000000059eb4 uct_ud_ep_rx_ctl() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c:764
2 0x0000000000059eb4 uct_ud_ep_process_rx() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/base/ud_ep.c:893
3 0x0000000000062c91 uct_ud_mlx5_iface_poll_rx() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/accel/ud_mlx5.c:510
4 0x0000000000062c91 uct_ud_mlx5_iface_progress() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/ib/ud/accel/ud_mlx5.c:559
5 0x000000000003645b ucs_callbackq_dispatch() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucs/datastruct/callbackq.h:211
6 0x000000000003645b uct_worker_progress() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/uct/api/uct.h:2592
7 0x000000000003645b ucp_worker_progress() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/core/ucp_worker.c:2455
8 0x000000000000450b mca_pml_ucx_recv() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:644
9 0x000000000007c2bf PMPI_Recv() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mpi/c/profile/precv.c:82
10 0x00000000000580d3 ompi_recv_f() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mpi/fortran/mpif-h/profile/precv_f.c:85

We have looked closer into the application stack trace as well and have noticed that these happen in code regions where we post a number of ISends and then later on post the receives and the wait. We have started to modify these code sections so that we post the receives before the sends as this is typically recommended anyway but our code base is very large so there might be some places where we still have the former pattern in place. What puzzles us more is why we seem to be going into the ud code path since we use the following environment:

Steps to Reproduce

ucx_info -v
UCT version=1.11.1 revision c58db6b
configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --prefix=/usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/ucx/ec-hpcx-ompi

export UCX_TLS=rc_x,self,sm
export OMPI_MCA_coll_hcoll=0

Setup and versions

uname -a
Linux ab6-100.bullx 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Tue Sep 7 07:07:31 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/release
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"
Red Hat Enterprise Linux release 8.4 (Ootpa)
Red Hat Enterprise Linux release 8.4 (Ootpa)

ibv_devinfo -vv
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.32.1010
node_guid: 0800:3803:00ba:9f50
sys_image_guid: 0800:3803:00ba:9f50
vendor_id: 0x119f
vendor_part_id: 4123
hw_ver: 0x0
board_id: BL_12001641
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 131072
max_qp_wr: 32768
device_cap_flags: 0xe97e1c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
UD_IP_CSUM
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
MANAGED_FLOW_STEERING
Unknown flags: 0xC8480000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 8388608
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2097152
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 156250kHZ
device_cap_flags_ex: 0x30000051E97E1C36
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004100000000
tso_caps:
max_tso: 0
rss_caps:
max_rwq_indirection_tables: 0
max_rwq_indirection_table_size: 0
rx_hash_function: 0x0
rx_hash_fields_mask: 0x0
max_wq_type_rq: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
max_rndv_hdr_size: 64
max_num_tags: 127
max_ops: 32768
max_sge: 1
flags:
IBV_TM_CAP_RC

cq moderation caps:
	max_cq_count:	65535
	max_cq_period:	4095 us

maximum available device memory:	262144Bytes

num_comp_vectors:		63
	port:	1
		state:			PORT_ACTIVE (4)
		max_mtu:		4096 (5)
		active_mtu:		4096 (5)
		sm_lid:			115
		port_lid:		966
		port_lmc:		0x00
		link_layer:		InfiniBand
		max_msg_sz:		0x40000000
		port_cap_flags:		0x2251e848
		port_cap_flags2:	0x0032
		max_vl_num:		4 (3)
		bad_pkey_cntr:		0x0
		qkey_viol_cntr:		0x0
		sm_sl:			0
		pkey_tbl_len:		128
		gid_tbl_len:		8
		subnet_timeout:		18
		init_type_reply:	0
		active_width:		4X (2)
		active_speed:		50.0 Gbps (64)
		phys_state:		LINK_UP (5)
		GID[  0]:		fe80:0000:0000:0000:0800:3803:00ba:9f50

ofed_info -s
OFED-internal-5.4-1.0.3:

@dmitrygx
Copy link
Member

@ioanhadade is it possible to get a repro? or try UCX master or latest UCX release - https://github.com/openucx/ucx/releases/tag/v1.13.1?

@ioanhadade
Copy link
Author

A reproducer will be hard as it is happening when we run our entire model and in various specific places. We can definitely try and upgrade to UCX 1.13 but was wondering whether you can give a brief explanation what this failed assert means as from a high level, it looks to us that this issue appears when we have a high number of outstanding isends for which a receive has not been posted.

@dmitrygx
Copy link
Member

it means that UCX's UD transport received an unexpected control packet - CREP ("Connection reply") instead of expected CREQ ("Connection request").

@ioanhadade
Copy link
Author

ah ok, I guess UD is used for these control packets irrespective if you request rc in UCX_TLS, right?

@dmitrygx
Copy link
Member

ah ok, I guess UD is used for these control packets irrespective if you request rc in UCX_TLS, right?

if you specify UCX_TLS=rc, then UD will be used during wireup procedure as an auxiliary transport and then UD QP gets closed.

@ioanhadade
Copy link
Author

Is there a way to disable this auxiliary transport and force UCX to only create RC QPs?

We will try and move to UCX 1.13 and see if the problem goes away.

@dmitrygx
Copy link
Member

Is there a way to disable this auxiliary transport and force UCX to only create RC QPs?

no, it is impossible, RC QPs should be created along with UD QPs, but UD QPs won't be used during communication.
but you could try using DC transport which doesn't create UD QPs, please, set UCX_TLS=dc

We will try and move to UCX 1.13 and see if the problem goes away.

great

@ioanhadade
Copy link
Author

thanks for this, we prefer to use rc (rc_x for our adapter) as it is much more performant than dc (dc_x) for our workloads (up to 128 nodes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants