You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The backtrace shows it is inside an at_exit handler in psm3, but I can't figure out how it is segfaulting (in one of the free).
Today, when I manually playing with it, I hit this: (the first line is a print I added)
$ ./cpi
psmi_verno_isinteroperable: verno=110, PSMI_VERNO_GET_MAJOR(verno)=1, PSM2_VERNO_MAJOR=3, compare=300, psmi_verno = 300
pmrs-gpu-240-02.cels.anl.gov.221137psmi_verno_isinteroperable() not updated for current version!
cpi:221137 terminated with signal 6 at PC=7f0f51adc387 SP=7ffed8123e28. Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x7f0f51adc387]
/lib64/libc.so.6(abort+0x148)[0x7f0f51adda78]
/lib64/libpsm_infinipath.so.1(+0x19b4a)[0x7f0f50ec2b4a]
/lib64/libpsm_infinipath.so.1(+0x19fb1)[0x7f0f50ec2fb1]
/lib64/libpsm_infinipath.so.1(__psm_init+0x24a)[0x7f0f50ec97fa]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(+0xb2b685)[0x7f0f5299f685]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(fi_getinfo+0x203)[0x7f0f528e7c43]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIDI_OFI_find_provider+0x95)[0x7f0f52413675]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIDI_OFI_init_local+0x158)[0x7f0f523e6958]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPID_Init+0x290)[0x7f0f52396d80]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPII_Init_thread+0x1f1)[0x7f0f522f9fd1]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIR_Init_impl+0x56)[0x7f0f522faa76]
/home/zhouh/temp/mpich-main/_inst/lib/libmpi.so.0(MPI_Init+0x1e)[0x7f0f530d916e]
./cpi[0x400a19]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0f51ac8555]
./cpi[0x4008f9]
It is called in __psm2_init in prov/psm3/psm3/psm.c and for hours I couldn't understand how that happened, until it hit me that it is not actually running the __psm2_init in psm3, but must be running the one in libpsm2.
I understand there is code history and there is marketing, but do we have to keep the messy names inside the psm3 code? Can we rename all the names with e.g. psm3_ to avoid collisions?
Currently we have no solutions but to pass in --disable-psm2 --disable-psm
The text was updated successfully, but these errors were encountered:
Describe the bug
We have been fighting these segfaults for so long. Depend on the configurations and where the testing is done, we hit following:
The backtrace shows it is inside an at_exit handler in psm3, but I can't figure out how it is segfaulting (in one of the free).
Today, when I manually playing with it, I hit this: (the first line is a print I added)
It is called in
__psm2_init
inprov/psm3/psm3/psm.c
and for hours I couldn't understand how that happened, until it hit me that it is not actually running the__psm2_init
inpsm3
, but must be running the one inlibpsm2
.I understand there is code history and there is marketing, but do we have to keep the messy names inside the psm3 code? Can we rename all the names with e.g.
psm3_
to avoid collisions?Currently we have no solutions but to pass in
--disable-psm2 --disable-psm
The text was updated successfully, but these errors were encountered: