Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_create_from_group works incorrectly when used with non-overlapping groups but same stringtag argument #10895

Closed
hppritcha opened this issue Oct 5, 2022 · 18 comments

Comments

@hppritcha
Copy link
Member

Turns out the way MPI_Comm_create_from_group is using the PMIx_Group_construct method doesn't correctly handle the case where multiple MPI processes are invoking MPI_Comm_create_group using different, non-overlapping MPI groups, but the same stringtag argument.

The wording in the MPI-4 standard does not indicate this is incorrect:

If a nonempty group is specified, then all MPI processes in that group must call
the function and each of these MPI processes must provide the same arguments, including
a group that contains the same members with the same ordering, and identical stringtag
value.

@dalcini

@hppritcha
Copy link
Member Author

I'm marking this as a blocker because its solution will likely involve changes to a parameter in the mpi.h.in header file - which is incorrect in any case for the maximum allowed stringtag length.

@rhc54
Copy link
Contributor

rhc54 commented Oct 5, 2022

So you are interpreting that statement as only requiring that all processes calling the function must pass the same stringtag, but not that the stringtag must be unique - yes? That strikes me as an oversight in the standard that needs to be addressed as it frankly makes no sense. Why "tag" a group if it isn't unique? What possible purpose could the tag serve in that case?

@hppritcha
Copy link
Member Author

Screen Shot 2022-10-05 at 11 41 19 AM

@bosilca
Copy link
Member

bosilca commented Oct 5, 2022

The stringtag is not about identifying the group, because otherwise the MPI standard would be standardizing how external software would have to name their groups. Instead, the stringtag is an identifier for the operation on this particular group. To understands why this is necessary one must look at the API of call, and notice that this API does not have a base communicator, which means there is no dedicated communication channel for this operation, aka. that the only way the processes can order concurrent calls to this function is by uniquely identifying them with the tuple group and tag. Without taking both in account you will either prevent concurrent calls to overlapping groups, or unordered calls with the same group.

@hppritcha
Copy link
Member Author

This goes back to the above figure a process sets. We have a blurb in the MPI-4 standard, chapter 11, that states

If two MPI processes get the same process set name, then the intersection of the two 
process sets shall either be the empty set or identcial to the union of the two process sets.

So the MPI://SELF process set is an example of the former in the above statement, so I could see how a programmer may interpret that to mean that there's no reason for the stringtag to be unique if the group handle argument to the MPI_Comm_create_from_group came from non-overlapping process sets.

Any rate, I plan to fix this by appending the vpid of the first proc in the group to the supplied stringtag before turning over to PMIx_Group_construct. Thank goodness we retained the verbiage about restrictions on the group argument concerning order of processes in the group.

@bosilca
Copy link
Member

bosilca commented Oct 5, 2022

Thank goodness we retained the verbiage about restrictions on the group argument concerning order of processes in the group.

What would have been the order of processes in the new communicator otherwise ?

@rhc54
Copy link
Contributor

rhc54 commented Oct 5, 2022

The stringtag is not about identifying the group,

I see - thanks for clarifying!

Any rate, I plan to fix this by appending the vpid of the first proc in the group to the supplied stringtag

I grok what you are saying, but that still doesn't result in a unique PMIx group name since the proc might be involved in multiple MPI_Comm_create_from_group calls and there is no guarantee that the user will only put that proc once in the first position, is there?

@rhc54
Copy link
Contributor

rhc54 commented Oct 5, 2022

Couple of responses/questions, but I'll post them separately.

What would have been the order of processes in the new communicator otherwise ?

I have to agree with this question. If a collection of procs calls this function using arbitrary ordering, then it isn't clear to me what the resulting communicator "rank" should be for the participants. If we follow typical MPI logic, it would be the position of the proc in the provided array - but that would differ across the collection. So it would seem that the ordering must be the same if you want to result in a usable MPI communicator, yes?

@rhc54
Copy link
Contributor

rhc54 commented Oct 5, 2022

@hppritcha I wonder if we are getting caught in a confusion caused by overloading of the "group" term. Are you equating a given MPI communicator with a specific PMIx group? If so, that has some implications that might not be acceptable to the MPI community. Biggest difference is that we have unique string "names" for a group that helps us distinguish between them, even if they have the same membership. You lack that in your communicators and I'm not sure how you'd add it.

If two MPI processes get the same process set name, then the intersection of the two
process sets shall either be the empty set or identcial to the union of the two process sets.

If this refers or attempts to correlate with PMIx process sets, then that wouldn't be correct. Because of the dynamic process set support, a process (in PMIx parlance) doesn't belong to just one process set. So two processes could find themselves members of the same process set, but that set could be completely orthogonal to any other set to which they belong.

Again, it may be that the terminology unfortunately overlaps and causes confusion. I'm only commenting out of concern regarding what PMIx can and cannot support, wanting to ensure that we do support whatever you are attempting to do.

@hppritcha
Copy link
Member Author

@rhc54 getting side tracked today I'll answer questions here tomorrow. we really are only using PMIx groups as a rendezvous mechanism not as something for tagging comms, groups, etc.

@rhc54
Copy link
Contributor

rhc54 commented Oct 5, 2022

Gotcha - will look forward to your answers when you have time. FWIW, I attempted to clear up some terminology confusion here (#10886 (comment)) that might be relevant in this topic as well.

@hppritcha
Copy link
Member Author

@rhc54 the only connection that the MPI communicator generated from MPI_Comm_create_from_group has to a PMIx group is the PMIX_GROUP_CONTEXT_ID we got back from the PMIx_Group_construct call. We use the PMIx group constructor/destructor much like in the pmix example - group_lcl_cid.c, so by the time we return from the call, the pmix group has been destructed. I think as long as we're using PMIx groups in this way we shouldn't have a problem.

Note we do have an advice to users that reads thus:

Advice to users. The stringtag argument is used to distinguish concurrent communicator
construction operations issued by different entities. As such, it is important
to ensure that this argument is unique for each concurrent call to
MPI_COMM_CREATE_FROM_GROUP. Reverse domain name notation convention [1]
is one approach to constructing unique stringtag arguments. See also example 11.10.
(End of advice to users.)

We may want to beef this up some for MPI 4.1.

@rhc54
Copy link
Contributor

rhc54 commented Oct 6, 2022

Ah, okay - that should work then so long as you don't have two concurrent calls that include the same vpid in the first position as that would result in conflicting group IDs. Still noodling over possible solutions - or is that an illegal scenario under MPI? I wonder if OMPI couldn't just generate a unique tag internally, maybe using some thread-safe internal counter that you integrate into the tag - e.g., "create-from-group-N" for the nth call?

@hppritcha
Copy link
Member Author

That approach may be the way to go. we do that already for the mpi_intercomm_create_from_groups method.

@hppritcha
Copy link
Member Author

got clarification from @rhc54 about potential max length of grp string arg to PMIx_Group_construct. seems to be a mistake in some older versions of the PMIx standard. removing the blocker status here as there will be no need to update mpi.h.in.

@dalcinl
Copy link
Contributor

dalcinl commented Oct 6, 2022

@hppritcha Sorry, I missed this issue, I was traveling. BTW, my GH username is @dalcinl, with final "L" (as in Lima) , not "I" (as in India). Let me know once you have PR to try out your fix.

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2022

@hppritcha Been thinking about it some more and the indexing approach to the stringtag passed to PMIx_Group_connect isn't going to work. Problem is that all the procs must call your MPI function, and they all must pass the same string tag to PMIx_Group_connect so we can properly perform the rendezvous. The index isn't guaranteed to be the same across all procs, so that doesn't work.

What I would suggest is that you use the procs themselves as the tag. For example, suppose that you have ranks 0,1,6 from nspace foo and ranks 2,8 from nspace bar executing your comm_create call. Then a tag like this would work:

cfg:foo:016:bar:28

I only added the colons for clarity - no reason to include them. Remember, the tag doesn't have to be parsable - it just has to be unique. I put a symbol identifying the MPI operation to separate this from the same procs participating in some other MPI operation.

Note that this also assumes the ordering of procs in the call is consistent across the procs calling the function - but I think MPI is going to require that anyone as per the above discussion. It also assumes that the same procs cannot engage in multiple concurrent calls to the same MPI operation - not sure if that is correct as perhaps different threads from within the same procs could execute it?

hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 3, 2022
Looks like there may be at least one place in ompi where being able to
compute a sha of a user supplied string plus internally generated
data could be useful, so add function to do this to the opal util
toolbox.

Related to open-mpi#10895

This code is derived from a sha256 implementation at

https://github.com/B-Con/crypto-algorithms.git

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 3, 2022
Looks like there may be at least one place in ompi where being able to
compute a sha of a user supplied string plus internally generated
data could be useful, so add function to do this to the opal util
toolbox.

Related to open-mpi#10895

This code is derived from a sha256 implementation at

https://github.com/B-Con/crypto-algorithms.git

Signed-off-by: Howard Pritchard <[email protected]>
@jsquyres jsquyres added this to the v5.0.0 milestone Nov 4, 2022
hppritcha added a commit to hppritcha/ompi that referenced this issue Jan 18, 2023
Looks like there may be at least one place in ompi where being able to
compute a sha of a user supplied string plus internally generated
data could be useful, so add function to do this to the opal util
toolbox.

Related to open-mpi#10895

This code is derived from a sha256 implementation at

https://github.com/B-Con/crypto-algorithms.git

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 14d7ca6)
@jsquyres jsquyres modified the milestones: v5.0.0, v5.0.1 Oct 30, 2023
@janjust janjust modified the milestones: v5.0.1, v5.0.2 Jan 8, 2024
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
Looks like there may be at least one place in ompi where being able to
compute a sha of a user supplied string plus internally generated
data could be useful, so add function to do this to the opal util
toolbox.

Related to open-mpi#10895

This code is derived from a sha256 implementation at

https://github.com/B-Con/crypto-algorithms.git

Signed-off-by: Howard Pritchard <[email protected]>
@jsquyres jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024
hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 3, 2024
to avoid potential race conditions between successive calls
to MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups
when using the same tag argument value.

The PMIx group constructor grp string argument has different semantics
from the tag requirements for these MPI constructors, so use
discriminators to avoid potential race conditions using PMIx group
ops.

Related to open-mpi#10895

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 3, 2024
to avoid potential race conditions between successive calls
to MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups
when using the same tag argument value.

The PMIx group constructor grp string argument has different semantics
from the tag requirements for these MPI constructors, so use
discriminators to avoid potential race conditions using PMIx group
ops.

Related to open-mpi#10895

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 3, 2024
to avoid potential race conditions between successive calls
to MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups
when using the same tag argument value.

The PMIx group constructor grp string argument has different semantics
from the tag requirements for these MPI constructors, so use
discriminators to avoid potential race conditions when using PMIx group
ops.

Related to open-mpi#10895

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 10, 2024
to avoid potential race conditions between successive calls
to MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups
when using the same tag argument value.

The PMIx group constructor grp string argument has different semantics
from the tag requirements for these MPI constructors, so use
discriminators to avoid potential race conditions when using PMIx group
ops.

Related to open-mpi#10895

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 46ff698)
@hppritcha
Copy link
Member Author

closed via #12960

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants