-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check upstream master branches for PMIx/PRRTE #12906
Conversation
380433e
to
14750c2
Compare
Till we figure out what got busted in upstream pmix/prrte combo. See what's happening with open-mpi/ompi#12906 Signed-off-by: Howard Pritchard <[email protected]>
add fetch depth 0 Till we figure out what got busted in upstream pmix/prrte combo. See what's happening with open-mpi/ompi#12906 Signed-off-by: Howard Pritchard <[email protected]>
Repoint submodules. Disable han and hcoll components to avoid bug when testing singleton comm_spawn. Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
For the life of me, I cannot figure out this Jenkins console. Makes zero sense. Claims it failed but there are "no logs" available as to why? I assume it is yet another startup failure - but how do I re-trigger it? |
bot:ompi:retest |
Signed-off-by: Ralph Castain <[email protected]>
Yo @rhc54 Per our discussion today, here's a C code equivalent of the mpi4py #include <stdio.h>
#include <mpi.h>
int main()
{
int size, rank;
int color, key=0;
int local_leader, remote_leader;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank < size / 2) {
color = 0;
local_leader = 0;
// EDIT: Per comments later in the thread, even though
// remote_leader is not zero in this case in the original
// Python test, it appears to need to be 0 for this C
// perhaps-not-entirely-correctly-translated-from-Python
// test...?
//remote_leader = size / 2;
remote_leader = 0;
} else {
color = 1;
local_leader = 0;
remote_leader = 0;
}
int tag = 17;
MPI_Comm intracomm, intercomm;
MPI_Comm_split(MPI_COMM_WORLD, color, key, &intracomm);
MPI_Intercomm_create(intracomm, local_leader,
MPI_COMM_WORLD, remote_leader, tag, &intercomm);
MPI_Group lgroup, rgroup;
MPI_Comm_group(intercomm, &lgroup);
MPI_Comm_remote_group(intercomm, &rgroup);
MPI_Info info;
MPI_Info_create(&info);
MPI_Comm intercomm2;
printf("Calling MPI_Intercomm_create_from_groups()\n");
MPI_Intercomm_create_from_groups(lgroup, local_leader,
rgroup, remote_leader,
"the tag", info,
MPI_ERRORS_ABORT, &intercomm2);
printf("Done!\n");
MPI_Finalize();
return 0;
} For me, this fails and hangs on my Mac:
|
Hilarious - I get a completely different failure signature, and it comes from the MPI layer (no error reports from PRRTE or PMIx): $ mpirun -np 4 ./intercomm_from_group
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
[rhc-node01:76677] ompi_group_dense_lookup: invalid peer index (2)
[rhc-node01:76677] *** Process received signal ***
[rhc-node01:76677] Signal: Segmentation fault (11)
[rhc-node01:76677] Signal code: Address not mapped (1)
[rhc-node01:76677] Failing at address: 0x48
[rhc-node01:76677] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff884f47a0]
[rhc-node01:76677] [ 1] /opt/hpc/external/ompi/lib/libmpi.so.0(ompi_intercomm_create_from_groups+0x1c0)[0xffff87e66308]
[rhc-node01:76677] [ 2] /opt/hpc/external/ompi/lib/libmpi.so.0(PMPI_Intercomm_create_from_groups+0x1d4)[0xffff87f19274]
[rhc-node01:76677] [ 3] ./intercomm_from_group[0x400b88]
[rhc-node01:76677] [ 4] /lib64/libc.so.6(+0x27300)[0xffff87c69300]
[rhc-node01:76677] [ 5] /lib64/libc.so.6(__libc_start_main+0x98)[0xffff87c693d8]
[rhc-node01:76677] [ 6] ./intercomm_from_group[0x400970]
[rhc-node01:76677] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 76677 on node rhc-node01 exited on
signal 11 (Segmentation fault).
-------------------------------------------------------------------------- and looking at it with gdb: (gdb) where
#0 ompi_intercomm_create_from_groups (local_group=0x2cc5ed80, local_leader=0, remote_group=0x2cc613f0, remote_leader=2, tag=0x400bf8 "the tag", info=0x2cc67ce0,
errhandler=0x420078 <ompi_mpi_errors_abort>, newintercomm=0xfffff9f19b20) at communicator/comm.c:1779
#1 0x0000ffff87f19274 in PMPI_Intercomm_create_from_groups (local_group=0x2cc5ed80, local_leader=0, remote_group=0x2cc613f0, remote_leader=2, tag=0x400bf8 "the tag",
info=0x2cc67ce0, errhandler=0x420078 <ompi_mpi_errors_abort>, newintercomm=0xfffff9f19b20) at intercomm_create_from_groups.c:85
#2 0x0000000000400b88 in main ()
(gdb) print leader_procs
$1 = (ompi_proc_t **) 0x2cc6bb00
(gdb) print leader_procs[0]
$2 = (ompi_proc_t *) 0x2cbe39e0
(gdb) print leader_procs[0]->super.proc
There is no member named proc.
(gdb) print leader_procs[0]->super.proc_name
$3 = {jobid = 2092761089, vpid = 0}
(gdb) print leader_procs[1]->super.proc_name
Cannot access memory at address 0x48 indicating that this line: leader_procs[1] = ompi_group_get_proc_ptr (remote_group, remote_leader, true); returned trash. Could be slight differences in PMIx/PRRTE hashes. |
The code is incorrect, on a 4 ranks run the remote_leader for the MPI_Group_size(rgroup, &remote_leader);
remote_leader--; before the call to |
Thanks @bosilca - that fixed the segfault. Now it just hangs, but hopefully that's a bug I can do something about. |
Looks like the intercomm_create_from_group failure is caused by the underlying code passing PMIx different group IDs from the participants. Using Jeff's provided example, I'm seeing "the tag-OMPIi-[[19550,1],0]" and "the tag-OMPIi-[[19550,1],2]" - so the two groups don't match and things hang. Haven't dug deeper to see where the mistake was made. |
Of course, I am assuming that there shouldn't be two disjoint PMIx groups being constructed, each with two procs in it - is that assumption correct? |
okay this test is not correct. |
remote leader in both cases needs to be 0. |
|
your 'c' version of the python code is incorrect. |
Ok, perhaps I translated it from python incorrectly. In the original Python test, it's definitely not 0 in both cases. But perhaps I missed some other part of setup...? Shrug. |
trying to beef up param checking. i'll probably do that in a separate pr. |
The MPI_Comm_create_from_group and especially the MPI_Intercomm_create_from_groups functions are recent additions to the standard (MPI 4.0) and users may get confused easily trying to use them. So better parameter checking is needed. Related to open-mpi#12906 where an incorrect code example showed up. Signed-off-by: Howard Pritchard <[email protected]>
add fetch depth 0 Till we figure out what got busted in upstream pmix/prrte combo. See what's happening with open-mpi/ompi#12906 Signed-off-by: Howard Pritchard <[email protected]>
The MPI_Comm_create_from_group and especially the MPI_Intercomm_create_from_groups functions are recent additions to the standard (MPI 4.0) and users may get confused easily trying to use them. So better parameter checking is needed. Related to open-mpi#12906 where an incorrect code example showed up. Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit a0486e0)
The MPI_Comm_create_from_group and especially the MPI_Intercomm_create_from_groups functions are recent additions to the standard (MPI 4.0) and users may get confused easily trying to use them. So better parameter checking is needed. Related to open-mpi#12906 where an incorrect code example showed up. Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit a0486e0)
Closing this for now - will reopen when upstream is complete |
Repoint submodules