Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check upstream master branches for PMIx/PRRTE #12906

Closed
wants to merge 3 commits into from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Nov 3, 2024

Repoint submodules

@rhc54 rhc54 marked this pull request as draft November 3, 2024 14:24
@rhc54 rhc54 added test mpi4py-all Run the optional mpi4py CI tests and removed Target: main labels Nov 3, 2024
@rhc54 rhc54 force-pushed the topic/chk branch 2 times, most recently from 380433e to 14750c2 Compare November 12, 2024 20:00
hppritcha added a commit to hppritcha/prrte that referenced this pull request Nov 13, 2024
Till we figure out what got busted in upstream pmix/prrte combo.
See what's happening with

open-mpi/ompi#12906

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/prrte that referenced this pull request Nov 13, 2024
add fetch depth 0

Till we figure out what got busted in upstream pmix/prrte combo.
See what's happening with

open-mpi/ompi#12906

Signed-off-by: Howard Pritchard <[email protected]>
Repoint submodules. Disable han and hcoll components
to avoid bug when testing singleton comm_spawn.

Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Nov 25, 2024

For the life of me, I cannot figure out this Jenkins console. Makes zero sense. Claims it failed but there are "no logs" available as to why? I assume it is yet another startup failure - but how do I re-trigger it?

@hppritcha
Copy link
Member

bot:ompi:retest

Signed-off-by: Ralph Castain <[email protected]>
@jsquyres
Copy link
Member

jsquyres commented Dec 2, 2024

Yo @rhc54 Per our discussion today, here's a C code equivalent of the mpi4py testCreateFromGroups:

#include <stdio.h>
#include <mpi.h>

int main()
{
    int size, rank;
    int color, key=0;
    int local_leader, remote_leader;

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (rank < size / 2) {
        color = 0;
        local_leader = 0;
        // EDIT: Per comments later in the thread, even though
        // remote_leader is not zero in this case in the original
        // Python test, it appears to need to be 0 for this C
        // perhaps-not-entirely-correctly-translated-from-Python
        // test...?
        //remote_leader = size / 2;
        remote_leader = 0;
    } else {
        color = 1;
        local_leader = 0;
        remote_leader = 0;
    }

    int tag = 17;
    MPI_Comm intracomm, intercomm;
    MPI_Comm_split(MPI_COMM_WORLD, color, key, &intracomm);
    MPI_Intercomm_create(intracomm, local_leader,
                         MPI_COMM_WORLD, remote_leader, tag, &intercomm);

    MPI_Group lgroup, rgroup;
    MPI_Comm_group(intercomm, &lgroup);
    MPI_Comm_remote_group(intercomm, &rgroup);

    MPI_Info info;
    MPI_Info_create(&info);

    MPI_Comm intercomm2;
    printf("Calling MPI_Intercomm_create_from_groups()\n");
    MPI_Intercomm_create_from_groups(lgroup, local_leader,
                                     rgroup, remote_leader,
                                     "the tag", info,
                                     MPI_ERRORS_ABORT, &intercomm2);

    printf("Done!\n");
    MPI_Finalize();
    return 0;
}

For me, this fails and hangs on my Mac:

$ mpicc mpi4py-test-create-from-groups.c -o a.out && mpirun -np 4 a.out 
Calling ic cfromgroups
Calling ic cfromgroups
Calling ic cfromgroups
Calling ic cfromgroups
[hostname:53112] PRTE ERROR: Not found in file grpcomm_direct_group.c at line 1137
[hostname:53112] PRTE ERROR: Not found in file grpcomm_direct_group.c at line 1090
[hostname:53112] PRTE ERROR: Not found in file grpcomm_direct_group.c at line 124
...hang...

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Hilarious - I get a completely different failure signature, and it comes from the MPI layer (no error reports from PRRTE or PMIx):

$ mpirun -np 4 ./intercomm_from_group
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
[rhc-node01:76677] ompi_group_dense_lookup: invalid peer index (2)
[rhc-node01:76677] *** Process received signal ***
[rhc-node01:76677] Signal: Segmentation fault (11)
[rhc-node01:76677] Signal code: Address not mapped (1)
[rhc-node01:76677] Failing at address: 0x48
[rhc-node01:76677] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff884f47a0]
[rhc-node01:76677] [ 1] /opt/hpc/external/ompi/lib/libmpi.so.0(ompi_intercomm_create_from_groups+0x1c0)[0xffff87e66308]
[rhc-node01:76677] [ 2] /opt/hpc/external/ompi/lib/libmpi.so.0(PMPI_Intercomm_create_from_groups+0x1d4)[0xffff87f19274]
[rhc-node01:76677] [ 3] ./intercomm_from_group[0x400b88]
[rhc-node01:76677] [ 4] /lib64/libc.so.6(+0x27300)[0xffff87c69300]
[rhc-node01:76677] [ 5] /lib64/libc.so.6(__libc_start_main+0x98)[0xffff87c693d8]
[rhc-node01:76677] [ 6] ./intercomm_from_group[0x400970]
[rhc-node01:76677] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 76677 on node rhc-node01 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

and looking at it with gdb:

(gdb) where
#0  ompi_intercomm_create_from_groups (local_group=0x2cc5ed80, local_leader=0, remote_group=0x2cc613f0, remote_leader=2, tag=0x400bf8 "the tag", info=0x2cc67ce0,
    errhandler=0x420078 <ompi_mpi_errors_abort>, newintercomm=0xfffff9f19b20) at communicator/comm.c:1779
#1  0x0000ffff87f19274 in PMPI_Intercomm_create_from_groups (local_group=0x2cc5ed80, local_leader=0, remote_group=0x2cc613f0, remote_leader=2, tag=0x400bf8 "the tag",
    info=0x2cc67ce0, errhandler=0x420078 <ompi_mpi_errors_abort>, newintercomm=0xfffff9f19b20) at intercomm_create_from_groups.c:85
#2  0x0000000000400b88 in main ()
(gdb) print leader_procs
$1 = (ompi_proc_t **) 0x2cc6bb00
(gdb) print leader_procs[0]
$2 = (ompi_proc_t *) 0x2cbe39e0
(gdb) print leader_procs[0]->super.proc
There is no member named proc.
(gdb) print leader_procs[0]->super.proc_name
$3 = {jobid = 2092761089, vpid = 0}
(gdb) print leader_procs[1]->super.proc_name
Cannot access memory at address 0x48

indicating that this line:

        leader_procs[1] = ompi_group_get_proc_ptr (remote_group, remote_leader, true);

returned trash. Could be slight differences in PMIx/PRRTE hashes.

@bosilca
Copy link
Member

bosilca commented Dec 2, 2024

The code is incorrect, on a 4 ranks run the remote_leader for the MPI_Intercomm_create_from_groups call cannot be 2 here as there are only two processes on the remote group. Assuming the code wanted to let the last rank in the remote_group be the leader, you need to add

MPI_Group_size(rgroup, &remote_leader);
remote_leader--;

before the call to MPI_Intercomm_create_from_groups.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Thanks @bosilca - that fixed the segfault. Now it just hangs, but hopefully that's a bug I can do something about.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Looks like the intercomm_create_from_group failure is caused by the underlying code passing PMIx different group IDs from the participants. Using Jeff's provided example, I'm seeing "the tag-OMPIi-[[19550,1],0]" and "the tag-OMPIi-[[19550,1],2]" - so the two groups don't match and things hang. Haven't dug deeper to see where the mistake was made.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 2, 2024

Of course, I am assuming that there shouldn't be two disjoint PMIx groups being constructed, each with two procs in it - is that assumption correct?

@hppritcha
Copy link
Member

okay this test is not correct.

@hppritcha
Copy link
Member

remote leader in both cases needs to be 0.

@hppritcha
Copy link
Member

hpritchard@er-head:~/ompi-er2/examples> (fix_for_issue10895)!mpicc
mpicc -o test test.c
hpritchard@er-head:~/ompi-er2/examples> (fix_for_issue10895)mpirun -np 4 ./test
Hey the remote group size is 2 but i''m putting in this for remote leader! 2
Hey the remote group size is 2 but i''m putting in this for remote leader! 0
Hey the remote group size is 2 but i''m putting in this for remote leader! 2
Hey the remote group size is 2 but i''m putting in this for remote leader! 0
[er-head.usrc:3071320] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 0
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
Calling MPI_Intercomm_create_from_groups()
[er-head.usrc:3071317] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 0
[er-head.usrc:3071319] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 0
[er-head.usrc:3071318] calling PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 0
[er-head.usrc:3071320] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 4294967295
[er-head.usrc:3071319] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],2] size 2 ninfo 2 cid_base 4294967295
[er-head.usrc:3071317] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 4294967294
[er-head.usrc:3071318] PMIx_Group_construct - tag the tag-OMPIi-[[59914,1],0] size 2 ninfo 2 cid_base 4294967294
[er-head.usrc:3071320] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967295
[er-head.usrc:3071318] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967294
[er-head.usrc:3071317] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967294
[er-head.usrc:3071319] PMIx_Get PMIX_GROUP_LOCAL_CID 6 for cid_base 4294967295
[er-head.usrc:3071317] ompi_group_dense_lookup: invalid peer index (2)
[er-head.usrc:3071319] calling PMIx_Group_construct - tag the tag-OMPIi-LC-[[59914,1],0] size 2 ninfo 2 cid_base 0
[er-head:3071317:0:3071317] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x48)
==== backtrace (tid:3071317) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x000000000006c880 ompi_intercomm_create_from_groups()  /home/hpritchard/ompi-er2/ompi/communicator/comm.c:1779
 2 0x000000000010dfdd PMPI_Intercomm_create_from_groups()  /home/hpritchard/ompi-er2/ompi/mpi/c/intercomm_create_from_groups.c:85
 3 0x0000000000400c5d main()  ???:0
 4 0x000000000003ad85 __libc_start_main()  ???:0
 5 0x0000000000400a3e _start()  ???:0
=================================
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
``1`

@hppritcha
Copy link
Member

your 'c' version of the python code is incorrect.

@jsquyres
Copy link
Member

jsquyres commented Dec 4, 2024

okay this test is not correct.

Ok, perhaps I translated it from python incorrectly. In the original Python test, it's definitely not 0 in both cases. But perhaps I missed some other part of setup...? Shrug.

@hppritcha
Copy link
Member

trying to beef up param checking. i'll probably do that in a separate pr.

hppritcha added a commit to hppritcha/ompi that referenced this pull request Dec 4, 2024
The MPI_Comm_create_from_group and especially the
MPI_Intercomm_create_from_groups functions are recent additions
to the standard (MPI 4.0) and users may get confused easily
trying to use them.

So better parameter checking is needed.

Related to open-mpi#12906 where an incorrect code example showed up.

Signed-off-by: Howard Pritchard <[email protected]>
brennan-carson pushed a commit to uofl-capstone-open-mpi/prrte that referenced this pull request Dec 5, 2024
add fetch depth 0

Till we figure out what got busted in upstream pmix/prrte combo.
See what's happening with

open-mpi/ompi#12906

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this pull request Dec 10, 2024
The MPI_Comm_create_from_group and especially the
MPI_Intercomm_create_from_groups functions are recent additions
to the standard (MPI 4.0) and users may get confused easily
trying to use them.

So better parameter checking is needed.

Related to open-mpi#12906 where an incorrect code example showed up.

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit a0486e0)
hppritcha added a commit to hppritcha/ompi that referenced this pull request Dec 16, 2024
The MPI_Comm_create_from_group and especially the
MPI_Intercomm_create_from_groups functions are recent additions
to the standard (MPI 4.0) and users may get confused easily
trying to use them.

So better parameter checking is needed.

Related to open-mpi#12906 where an incorrect code example showed up.

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit a0486e0)
@rhc54
Copy link
Contributor Author

rhc54 commented Dec 16, 2024

Closing this for now - will reopen when upstream is complete

@rhc54 rhc54 closed this Dec 16, 2024
@rhc54 rhc54 deleted the topic/chk branch December 16, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: main test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants