Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modules: update libfabric to mpich/v1.12.1 #5344

Closed
wants to merge 1 commit into from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jun 8, 2021

Pull Request Description

This PR upgrades embedded libfabric to v1.12.1, released on 04/01/2021. This upgrade includes the fix mentioned in #5332 (comment)

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2106_libfabric branch 2 times, most recently from a753a89 to 6a21e4a Compare June 8, 2021 14:53
@hzhou
Copy link
Contributor Author

hzhou commented Jun 8, 2021

test:mpich/ch4/ofi

+ mpichversion
*** Error in `mpichversion': double free or corruption (fasttop): 0x00000000008b1eb0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81679)[0x7f146cf3e679]
/lib64/ld-linux-x86-64.so.2(+0x1003a)[0x7f1471a1603a]
/lib64/libc.so.6(+0x39c99)[0x7f146cef6c99]
/lib64/libc.so.6(+0x39ce7)[0x7f146cef6ce7]
/lib64/libc.so.6(__libc_start_main+0xfc)[0x7f146cedf50c]
mpichversion[0x400ea7]
======= Memory map: ========
...

MPICH Version:    	4.0a1
MPICH Release date:	unreleased development copy
MPICH Device:    	ch4:ofi
MPICH configure: 	--prefix=/var/lib/jenkins-slave/workspace/mpich-review-ch4-ofi/jenkins_configure/default/label/centos64_review/_inst --with-device=ch4:ofi --disable-ft-tests --with-libfabric=embedded --disable-mlx --enable-nemesis-dbg-localoddeven --enable-large-tests --enable-collalgo-tests --disable-perftest --with-wrapper-dl-type=rpath
MPICH CC: 	gcc -std=gnu99    -O2
MPICH CXX: 	g++   -O2
MPICH F77: 	gfortran   -O2
MPICH FC: 	gfortran   -O2
MPICH Custom Information: 	

[0] Program received signal SIGABRT, Aborted.
[0] 0x00007ffff4de9387 in raise () from /lib64/libc.so.6
[0] #0  0x00007ffff4de9387 in raise () from /lib64/libc.so.6
[0] #1  0x00007ffff4deaa78 in abort () from /lib64/libc.so.6
[0] #2  0x00007ffff4e2bed7 in __libc_message () from /lib64/libc.so.6
[0] #3  0x00007ffff4e34299 in _int_free () from /lib64/libc.so.6
[0] #4  0x00007ffff7deb07a in _dl_fini () from /lib64/ld-linux-x86-64.so.2
[0] #5  0x00007ffff4decce9 in __run_exit_handlers () from /lib64/libc.so.6
[0] #6  0x00007ffff4decd37 in exit () from /lib64/libc.so.6
[0] #7  0x00007ffff4dd555c in __libc_start_main () from /lib64/libc.so.6
[0] #8  0x0000000000400969 in _start ()

The double free occurs on __hfi_mylabel in prov/psm3/psm3/opa/opa_debug.c. The label is allocated in a constructor using strdup and freed in a desctructor. I don't see how it is being freed twice.

@hzhou hzhou force-pushed the 2106_libfabric branch from 6a21e4a to e401bba Compare June 8, 2021 23:54
@hzhou
Copy link
Contributor Author

hzhou commented Jun 8, 2021

test:mpich/ch4/ofi

@hzhou hzhou force-pushed the 2106_libfabric branch from e401bba to f62cbe7 Compare June 9, 2021 01:37
@hzhou
Copy link
Contributor Author

hzhou commented Jun 9, 2021

test:mpich/ch4/ofi

psm3 is super weird. Here is what I changed -- https://github.com/pmodels/libfabric/blob/697f2577740b8a7b0986b28d7eea0086d9aa7463/prov/psm3/psm3/psm.c#L442-L447
Yet,

++ /var/lib/jenkins-slave/workspace/mpich-review-ch4-ofi/jenkins_configure/default/label/centos64_review/_inst/bin/mpiexec -n 2 /var/lib/jenkins-slave/workspace/mpich-review-ch4-ofi/jenkins_configure/default/label/centos64_review/examples/cpi
Required minimum FI_VERSION: 0, current version: 1000c
pmrs-centos64-240-05.cels.anl.gov:rank0: psmi_verno_isinteroperable() not updated for current version!
pmrs-centos64-240-05.cels.anl.gov:rank1: psmi_verno_isinteroperable() not updated for current version!

lt-cpi:143731 terminated with signal 6 at PC=7f4ef17a2337 SP=7ffe942c4dd8.  Backtrace:
...

Where is my prints?

@hzhou hzhou marked this pull request as draft June 9, 2021 12:53
@hzhou
Copy link
Contributor Author

hzhou commented Aug 17, 2021

test:mpich/ch4/ofi

@hzhou
Copy link
Contributor Author

hzhou commented Aug 17, 2021

Closed. We'll jump to v1.13.0 -- #5499

@hzhou hzhou closed this Aug 17, 2021
@hzhou hzhou mentioned this pull request Aug 17, 2021
4 tasks
@hzhou hzhou mentioned this pull request Apr 5, 2022
4 tasks
@hzhou hzhou deleted the 2106_libfabric branch November 26, 2024 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant