Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/psm3: symbol colisions with either psm or psm2 #7757

Closed
hzhou opened this issue May 12, 2022 · 4 comments
Closed

prov/psm3: symbol colisions with either psm or psm2 #7757

hzhou opened this issue May 12, 2022 · 4 comments

Comments

@hzhou
Copy link
Contributor

hzhou commented May 12, 2022

Describe the bug
We have been fighting these segfaults for so long. Depend on the configurations and where the testing is done, we hit following:

+ mpichversion

mpichversion:197053 terminated with signal 11 at PC=7fb96300fecc SP=7ffce054e3b8.  Backtrace:
/lib64/libc.so.6(cfree+0x1c)[0x7fb96300fecc]
/lib64/ld-linux-x86-64.so.2(+0x1003a)[0x7fb96816b03a]
/lib64/libc.so.6(+0x39c99)[0x7fb962fc3c99]
/lib64/libc.so.6(+0x39ce7)[0x7fb962fc3ce7]
/lib64/libc.so.6(__libc_start_main+0xfc)[0x7fb962fac50c]
mpichversion[0x400ee7]
MPICH Version:    	4.1a1
MPICH Release date:	Thu May 12 00:48:01 CDT 2022
MPICH Device:    	ch4:ofi
MPICH configure: 	--prefix=/var/lib/jenkins-slave/workspace/mpich-main-special-tests/compiler/gnu/jenkins_configure/noweak/label/centos64/netmod/ch4-ofi/mpich-main/_inst --with-device=ch4:ofi --with-libfabric=embedded --disable-mlx --disable-weak-symbols --enable-large-tests --with-wrapper-dl-type=rpath
MPICH CC: 	gcc -std=gnu99    -O2
MPICH CXX: 	g++   -O2
MPICH F77: 	gfortran   -O2
MPICH FC: 	gfortran   -O2
MPICH Custom Information: 	
Build step 'Run with timeout' marked build as failure

The backtrace shows it is inside an at_exit handler in psm3, but I can't figure out how it is segfaulting (in one of the free).

Today, when I manually playing with it, I hit this: (the first line is a print I added)

$ ./cpi
psmi_verno_isinteroperable: verno=110, PSMI_VERNO_GET_MAJOR(verno)=1, PSM2_VERNO_MAJOR=3, compare=300, psmi_verno = 300
pmrs-gpu-240-02.cels.anl.gov.221137psmi_verno_isinteroperable() not updated for current version!

cpi:221137 terminated with signal 6 at PC=7f0f51adc387 SP=7ffed8123e28.  Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x7f0f51adc387]
/lib64/libc.so.6(abort+0x148)[0x7f0f51adda78]
/lib64/libpsm_infinipath.so.1(+0x19b4a)[0x7f0f50ec2b4a]
/lib64/libpsm_infinipath.so.1(+0x19fb1)[0x7f0f50ec2fb1]
/lib64/libpsm_infinipath.so.1(__psm_init+0x24a)[0x7f0f50ec97fa]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(+0xb2b685)[0x7f0f5299f685]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(fi_getinfo+0x203)[0x7f0f528e7c43]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIDI_OFI_find_provider+0x95)[0x7f0f52413675]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIDI_OFI_init_local+0x158)[0x7f0f523e6958]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPID_Init+0x290)[0x7f0f52396d80]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPII_Init_thread+0x1f1)[0x7f0f522f9fd1]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIR_Init_impl+0x56)[0x7f0f522faa76]
/home/zhouh/temp/mpich-main/_inst/lib/libmpi.so.0(MPI_Init+0x1e)[0x7f0f530d916e]
./cpi[0x400a19]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0f51ac8555]
./cpi[0x4008f9]

It is called in __psm2_init in prov/psm3/psm3/psm.c and for hours I couldn't understand how that happened, until it hit me that it is not actually running the __psm2_init in psm3, but must be running the one in libpsm2.

I understand there is code history and there is marketing, but do we have to keep the messy names inside the psm3 code? Can we rename all the names with e.g. psm3_ to avoid collisions?

Currently we have no solutions but to pass in --disable-psm2 --disable-psm

@hzhou hzhou added the bug label May 12, 2022
@acgoldma
Copy link
Contributor

Can you try updating to latest ofi/psm3, we have renamed a bunch of symbols with the most recent release.

@hzhou
Copy link
Contributor Author

hzhou commented May 13, 2022

Can you try updating to latest ofi/psm3, we have renamed a bunch of symbols with the most recent release.

Sounds good! Do you have a commit hash/PR for the renaming updates?

@hzhou
Copy link
Contributor Author

hzhou commented May 13, 2022

Found the PR -- #7521

We have confirmed upgrading to v1.15.0 fixed the issue (pmodels/mpich#6006)

@hzhou hzhou closed this as completed May 13, 2022
@acgoldma
Copy link
Contributor

Thank you. for verifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants