-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sessions:deworldify behavior of pmix pset lookup #10886
Conversation
PMIX_INFO_CREATE(query.qualifiers, 2); | ||
PMIX_INFO_LOAD(&query.qualifiers[0], PMIX_PSET_NAME, pset_name, PMIX_STRING); | ||
PMIX_INFO_LOAD(&query.qualifiers[1], PMIX_QUERY_REFRESH_CACHE, &refresh, PMIX_BOOL); | ||
goto fn_try_again; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the backend implementation in PRRTE for this query, but I will add it ASAP. FWIW, I don't believe you will need to refresh the cache, but I will check and report back once I have it implemented.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
could someone from nvidia |
It worked for me. Attached raw log. |
1fc662b
to
e45025a
Compare
There seems to be a related case in the
The condition could be changed to something like this:
|
I think this is where we are getting into terminology confusion. In PMIx, a "process set" is immutable and assigned for each application at time of execution, assuming the user provides a string "process set" name. Dynamic collections of procs are called "groups" in PMIx, and one can assemble any arbitrary collection of procs into a group. Unfortunately, MPI chose to define "process set" to be the equivalent of a PMIx "group". Querying PMIx "pset_names" is only going to return the currently executing PMIx "process sets". This can change in a persistent DVM as more jobs are submitted, but it change due to any dynamic creation of MPI "process sets". For that info, you would have to query the current set of PMIx groups. |
e45025a
to
d591211
Compare
bot:aws:retest |
It turns out that the existing ompi_instance_group_pmix_pset implementation assumes an MPI_COMM_WORLD type of model. This prevents the ability to use more dynamically generated process sets, possibly using an external agent. Switch to using the pmix pset membership query to find new pset membership. Related to open-mpi#10862 Related to openpmix/prrte#1906 prrte changes in above referenced PR are necessary for creating groups/communicators from psets defined by --pset option on the mpirun command line. Signed-off-by: Howard Pritchard <[email protected]>
d591211
to
541a17b
Compare
Can't speak to the OMPI code portions, but the PMIx queries look okay to me. Have you checked about the cache refresh yet? I don't remember if it was necessary or not. Thanks for cleaning up the query on the PRRTE side of things! |
@wenduwan please review when you have a chance |
Running tests... |
@wenduwan how did testing go? |
Test passed a long time ago.. I forgot to come back. |
@wenduwan now that 5.0.2rc1 is out could you review this PR? |
I don't understand this change and its context, but it's so old that I think it's a good idea to merge and finally run it through MTT etc. |
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]>
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]>
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 4baeb9f)
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 4baeb9f)
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 4baeb9f)
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 4baeb9f)
It turns out that the existing ompi_instance_group_pmix_pset implementation assumes an MPI_COMM_WORLD type of model.
This prevents the ability to use more dynamically generated process sets, possibly using an external agent.
Swith to using the pmix pset membership query to find new pset membership.
Related to #10862
Signed-off-by: Howard Pritchard [email protected]