Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi: p_red times out with psm3 #5975

Closed
nitbhat opened this issue Apr 27, 2022 · 5 comments · Fixed by #5997
Closed

ch4/ofi: p_red times out with psm3 #5975

nitbhat opened this issue Apr 27, 2022 · 5 comments · Fixed by #5997

Comments

@nitbhat
Copy link
Contributor

nitbhat commented Apr 27, 2022

I've seen these psm3 timeout for some other tests too. This was seen during the per-commit testing of the PR that adds the psm3 capability set. (#5864)

Running tests in ./coll [00:00:22]
Unexpected output in p_red: libfabric:82619:psm3:av:psmx3_epid_to_epaddr():231<warn> psm2_ep_connect returned error Operation timed out, remote epid=a00018a03.Try setting FI_PSM3_CONN_TIMEOUT to a larger value (current: 10 seconds).
Unexpected output in p_red: 
Unexpected output in p_red: p_red:82619 terminated with signal 6 at PC=7f8f0380d387 SP=7ffc621dcd88.  Backtrace:
Unexpected output in p_red: /lib64/libc.so.6(gsignal+0x37)[0x7f8f0380d387]
Unexpected output in p_red: /lib64/libc.so.6(abort+0x148)[0x7f8f0380ea78]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x22f4d17)[0x7f8f05e99d17]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x22f4e67)[0x7f8f05e99e67]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x2307f7f)[0x7f8f05eacf7f]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x345049)[0x7f8f03eea049]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x34edbd)[0x7f8f03ef3dbd]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x3038b3)[0x7f8f03ea88b3]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x303ddf)[0x7f8f03ea8ddf]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x3039e7)[0x7f8f03ea89e7]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x303ddf)[0x7f8f03ea8ddf]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x305369)[0x7f8f03eaa369]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x306f85)[0x7f8f03eabf85]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x40661c)[0x7f8f03fab61c]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x3b4498)[0x7f8f03f59498]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(+0x3b4902)[0x7f8f03f59902]
Unexpected output in p_red: /var/lib/jenkins-slave/workspace/mpich-review-custom/_inst/lib/libmpi.so.0(PMPI_Wait+0x27e)[0x7f8f03da831e]
Unexpected output in p_red: ./p_red[0x401a7a]
Unexpected output in p_red: /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8f037f9555]
Unexpected output in p_red: ./p_red[0x401ba0]
Unexpected output in p_red: 
Unexpected output in p_red: ===================================================================================
Unexpected output in p_red: =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
Unexpected output in p_red: =   PID 82620 RUNNING AT pmrs-gpu-240-02.cels.anl.gov
Unexpected output in p_red: =   EXIT CODE: 9
Unexpected output in p_red: =   CLEANING UP REMAINING PROCESSES
Unexpected output in p_red: =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
Unexpected output in p_red: ===================================================================================
Unexpected output in p_red: YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
Unexpected output in p_red: This typically refers to a problem with your application.
Unexpected output in p_red: Please see the FAQ page for debugging suggestions
Program p_red exited without No Errors
Running tests in ./comm [00:36:26]
@hzhou
Copy link
Contributor

hzhou commented May 6, 2022

The failing test was:

summary_junit_xml.1180 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=tsp_tree MPIR_CVAR_IREDUCE_TREE_TYPE=kary MPIR_CVAR_IREDUCE_TREE_KVAL=3 MPIR_CVAR_IREDUCE_TREE_PIPELINE_CHUNK_SIZE=4096

In particular, following tests passed:

193 - ./coll/p_red 4    1.2 sec Passed
1177 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=sched_smp 1.1 sec Passed
1178 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=sched_binomial    1.2 sec Passed
1179 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=sched_reduce_scatter_gather       1.1 sec Passed
1181 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=tsp_tree MPIR_CVAR_IREDUCE_TREE_TYPE=knomial_1 MPIR_CVAR_IREDUCE_TREE_KVAL=3 MPIR_CVAR_IREDUCE_TREE_PIPELINE_CHUNK_SIZE=4096      1.2 sec Passed
1182 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=tsp_tree MPIR_CVAR_IREDUCE_TREE_TYPE=knomial_2 MPIR_CVAR_IREDUCE_TREE_KVAL=3 MPIR_CVAR_IREDUCE_TREE_PIPELINE_CHUNK_SIZE=4096      1.2 sec Passed
1183 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=tsp_ring MPIR_CVAR_IREDUCE_RING_CHUNK_SIZE=4096   1.2 sec Passed

@hzhou
Copy link
Contributor

hzhou commented May 6, 2022

The failing test was:

summary_junit_xml.1180 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=tsp_tree MPIR_CVAR_IREDUCE_TREE_TYPE=kary MPIR_CVAR_IREDUCE_TREE_KVAL=3 MPIR_CVAR_IREDUCE_TREE_PIPELINE_CHUNK_SIZE=4096

np=4 pass, but np=5 fail

Also KVAL=2 pass, but KVAL=3 fail

The error is from process [1]

@hzhou
Copy link
Contributor

hzhou commented May 6, 2022

Backtrace:

[1] #0  0x00007ffff4d05387 in raise () from /lib64/libc.so.6                                                                                                                                                                                                                                                                                                                [0/1796]
[1] #1  0x00007ffff4d06a78 in abort () from /lib64/libc.so.6
[1] #2  0x00007ffff7541937 in psmx3_epid_to_epaddr ()
[1]    from /home/zhouh/MPI/lib/libmpi.so.0
[1] #3  0x00007ffff7541a87 in psmx3_av_query_sep ()
[1]    from /home/zhouh/MPI/lib/libmpi.so.0
[1] #4  0x00007ffff7554b9f in psmx3_tagged_recv_no_flag_directed ()
[1]    from /home/zhouh/MPI/lib/libmpi.so.0
[1] #5  0x00007ffff5459e5f in fi_trecv (context=<optimized out>,
[1]     ignore=31525197391593472, tag=2147483935, src_addr=<optimized out>,
[1]     desc=0x0, len=8, buf=0x705090, ep=<optimized out>)
[1]     at mymake/libfabric/include/rdma/fi_tagged.h:91
[1] #6  MPIDI_OFI_do_irecv (context_offset=1, mode=0, flags=0,
[1]     request=<optimized out>, vni_dst=0, vni_src=<optimized out>,
[1]     addr=<optimized out>, comm=0x7ffff7cd5f80 <MPIDI_OFI_global>,
[1]     tag=<optimized out>, rank=<optimized out>, datatype=<optimized out>,
[1]     count=<optimized out>, buf=<optimized out>)
[1]     at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:248
[1] #7  MPIDI_NM_mpi_irecv (buf=buf@entry=0x705090, count=count@entry=2,
[1]     datatype=datatype@entry=1275069445, rank=rank@entry=4, tag=tag@entry=287,
[1]     comm=comm@entry=0x7ffff7cc88a0 <MPIR_Comm_builtin>, addr=<optimized out>,
[1]     request=<optimized out>, partner=<optimized out>, attr=1)
[1]     at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:345
[1] #8  0x00007ffff5463b79 in MPIDI_irecv (attr=1, req=0x7053b0, av=0x6a9f70,
[1]     comm=0x7ffff7cc88a0 <MPIR_Comm_builtin>, tag=287, rank=4,
[1]     datatype=1275069445, count=2, buf=0x705090)
[1]     at ./src/mpid/ch4/src/ch4_recv.h:128
[1] #9  MPID_Irecv.constprop.0 () at ./src/mpid/ch4/src/ch4_recv.h:311
[1] #10 0x00007ffff546c102 in MPIC_Irecv (buf=0x705090, count=2,
[1]     datatype=1275069445, source=4, tag=287, comm_ptr=<optimized out>,
[1]     request_ptr=0x7053b0) at src/mpi/coll/helper_fns.c:577
[1] #11 0x00007ffff550b73b in vtx_issue (vtxid=5, vtxp=0x705358,
[1]     sched=sched@entry=0x6afe90)
[1]     at src/mpi/coll/transports/gentran/gentran_utils.c:74
[1] #12 0x00007ffff550bc3e in vtx_record_completion (remove=0, sched=0x6afe90,
[1]     vtxp=0x7052d0) at src/mpi/coll/transports/gentran/gentran_utils.c:314
[1] #13 vtx_issue (vtxid=<optimized out>, vtxp=0x7052d0,
[1]     sched=sched@entry=0x6afe90)
[1]     at src/mpi/coll/transports/gentran/gentran_utils.c:183
[1] #14 0x00007ffff550c81e in vtx_record_completion (remove=0, sched=0x6afe90,
[1]     vtxp=0x705248) at src/mpi/coll/transports/gentran/gentran_utils.c:314
[1] #15 vtx_issue (vtxid=<optimized out>, vtxp=0x705248,
[1]     sched=sched@entry=0x6afe90)
[1]     at src/mpi/coll/transports/gentran/gentran_utils.c:93
[1] #16 0x00007ffff550bc3e in vtx_record_completion (remove=0, sched=0x6afe90,
[1]     vtxp=0x7051c0) at src/mpi/coll/transports/gentran/gentran_utils.c:314
[1] #17 vtx_issue (vtxid=vtxid@entry=2, vtxp=vtxp@entry=0x7051c0,
[1]     sched=sched@entry=0x6afe90)
[1]     at src/mpi/coll/transports/gentran/gentran_utils.c:183
[1] #18 0x00007ffff550e23d in vtx_record_completion (remove=1, sched=0x6afe90,
[1]     vtxp=0x705138) at src/mpi/coll/transports/gentran/gentran_utils.c:314
[1] #19 MPII_Genutil_sched_poke (sched=<optimized out>,
[1]     is_complete=is_complete@entry=0x7fffffffd448,
[1]     made_progress=made_progress@entry=0x7fffffffd44c)
[1]     at src/mpi/coll/transports/gentran/gentran_utils.c:561
[1] #20 0x00007ffff550f75d in MPII_Genutil_progress_hook (
[1]     made_progress=<optimized out>)
[1]     at src/mpi/coll/transports/gentran/gentran_utils.c:354
[1] #21 0x00007ffff56f3d7c in MPIR_Progress_hook_exec_all (
[1]     made_progress=made_progress@entry=0x7fffffffd50c)
[1]     at src/util/mpir_progress_hook.c:29
[1] #22 0x00007ffff55b321a in MPIDI_progress_test (
[1]     state=state@entry=0x7fffffffd590, wait=wait@entry=1)
[1]     at ./src/mpid/ch4/src/ch4_progress.h:138
[1] #23 0x00007ffff55b83c8 in MPID_Progress_wait (state=0x7fffffffd590)
[1]     at ./src/mpid/ch4/src/ch4_progress.h:324
[1] #24 MPIR_Wait_state (request_ptr=0x7ffff7cd3a80 <MPIR_Request_direct>,
[1]     status=<optimized out>, state=0x7fffffffd590)
[1]     at src/mpi/request/request_impl.c:883
[1] #25 0x00007ffff55b8650 in MPID_Wait (
[1]     request_ptr=request_ptr@entry=0x7ffff7cd3a80 <MPIR_Request_direct>,
[1]     status=status@entry=0x1) at ./src/mpid/ch4/src/ch4_wait.h:129
[1] #26 0x00007ffff55b8958 in MPIR_Wait () at src/mpi/request/request_impl.c:945
[1] #27 0x00007ffff53ee5b3 in internal_Wait (status=0x1, request=0x7fffffffd76c)
[1]     at src/binding/c/c_binding.c:62778
[1] #28 PMPI_Wait (request=request@entry=0x7fffffffd76c, status=status@entry=0x1)
[1]     at src/binding/c/c_binding.c:62824
[1] #29 0x0000000000401b7c in main (argc=<optimized out>, argv=<optimized out>)
[1]     at p_red.c:66

@hzhou
Copy link
Contributor

hzhou commented May 6, 2022

KARY = 3 --

           P1  -   P4
       /   
P0     -   P2
       \
            P3

EDIT: The backtrace above was taken with root = 1, It is P0 that crashes. And the reason is that P1 has finalized and gone away.

@hzhou
Copy link
Contributor

hzhou commented May 6, 2022

Just did a test run -- looks very much like a compiler optimization bugs
image

I think it all points to the default gcc compiler, which I believe is gcc-4.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants