Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi: libfabric psm3 causing segfaults #5987

Closed
hzhou opened this issue May 4, 2022 · 1 comment
Closed

ch4/ofi: libfabric psm3 causing segfaults #5987

hzhou opened this issue May 4, 2022 · 1 comment

Comments

@hzhou
Copy link
Contributor

hzhou commented May 4, 2022

We used to have this error, ref. #5499 (comment). Now that #5864 reenabled psm3, we are seeing this error again with nightly tests noweak config.

+ mpichversion

mpichversion:50857 terminated with signal 11 at PC=7f7332a5decc SP=7fff2f25d608.  Backtrace:
/lib64/libc.so.6(cfree+0x1c)[0x7f7332a5decc]
/lib64/ld-linux-x86-64.so.2(+0x1003a)[0x7f7337b7503a]
/lib64/libc.so.6(+0x39c99)[0x7f7332a11c99]
/lib64/libc.so.6(+0x39ce7)[0x7f7332a11ce7]
/lib64/libc.so.6(__libc_start_main+0xfc)[0x7f73329fa50c]
mpichversion[0x400ee7]
MPICH Version:    	4.1a1
MPICH Release date:	Sat Apr 30 00:48:02 CDT 2022
MPICH Device:    	ch4:ofi
MPICH configure: 	--prefix=/var/lib/jenkins-slave/workspace/mpich-main-special-tests/compiler/gnu/jenkins_configure/noweak/label/centos64/netmod/ch4-ofi/mpich-main/_inst --with-device=ch4:ofi --with-libfabric=embedded --disable-mlx --disable-weak-symbols --enable-large-tests --with-wrapper-dl-type=rpath
MPICH CC: 	gcc -std=gnu99    -O2
MPICH CXX: 	g++   -O2
MPICH F77: 	gfortran   -O2
MPICH FC: 	gfortran   -O2
MPICH Custom Information: 	
Build step 'Run with timeout' marked build as failure

The error is in the psm3 deconstructor where it frees memory allocated during init. Somehow, the deconstructor is being called twice.

Strangely, this time we only have this failure with --disable-weak-symbols

@hzhou
Copy link
Contributor Author

hzhou commented May 16, 2022

This is due to namespace collisions and was fixed by libfabric upstream. It is now fixed by #6006

Ref:

@hzhou hzhou closed this as completed May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant