Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sessions:deworldify behavior of pmix pset lookup #10886

Merged
merged 1 commit into from
Jan 26, 2024

Conversation

hppritcha
Copy link
Member

It turns out that the existing ompi_instance_group_pmix_pset implementation assumes an MPI_COMM_WORLD type of model.

This prevents the ability to use more dynamically generated process sets, possibly using an external agent.

Swith to using the pmix pset membership query to find new pset membership.

Related to #10862

Signed-off-by: Howard Pritchard [email protected]

@hppritcha hppritcha marked this pull request as draft October 3, 2022 21:28
PMIX_INFO_CREATE(query.qualifiers, 2);
PMIX_INFO_LOAD(&query.qualifiers[0], PMIX_PSET_NAME, pset_name, PMIX_STRING);
PMIX_INFO_LOAD(&query.qualifiers[1], PMIX_QUERY_REFRESH_CACHE, &refresh, PMIX_BOOL);
goto fn_try_again;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the backend implementation in PRRTE for this query, but I will add it ASAP. FWIW, I don't believe you will need to refresh the cache, but I will check and report back once I have it implemented.

@hppritcha
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hppritcha
Copy link
Member Author

could someone from nvidia
@janjust post what's going on with mellanox ci. I try to access the logs but azure gets borked up and never brings up the logs.

@janjust
Copy link
Contributor

janjust commented Oct 4, 2022

It worked for me. Attached raw log.
raw_azp_log.txt

@hppritcha hppritcha force-pushed the dworldify_pset_lookup branch from 1fc662b to e45025a Compare October 4, 2022 21:19
@HawkmoonEternal
Copy link

There seems to be a related case in the ompi_instance_get_nth_pset function.
The current implementation only refreshes the process set names once in the beginning, thus not allowing for the discovery of dynamically added process sets later on:

 int ompi_instance_get_nth_pset (ompi_instance_t *instance, int n, int *len, char *pset_name)
{
    if (NULL == ompi_mpi_instance_pmix_psets && n >= ompi_instance_builtin_count) {
        ompi_instance_refresh_pmix_psets (PMIX_QUERY_PSET_NAMES);
    }
...

The condition could be changed to something like this:

if (NULL == ompi_mpi_instance_pmix_psets || (size_t) n >= (ompi_instance_builtin_count + ompi_mpi_instance_num_pmix_psets)) {

@rhc54
Copy link
Contributor

rhc54 commented Oct 5, 2022

I think this is where we are getting into terminology confusion. In PMIx, a "process set" is immutable and assigned for each application at time of execution, assuming the user provides a string "process set" name. Dynamic collections of procs are called "groups" in PMIx, and one can assemble any arbitrary collection of procs into a group.

Unfortunately, MPI chose to define "process set" to be the equivalent of a PMIx "group". Querying PMIx "pset_names" is only going to return the currently executing PMIx "process sets". This can change in a persistent DVM as more jobs are submitted, but it change due to any dynamic creation of MPI "process sets". For that info, you would have to query the current set of PMIx groups.

@hppritcha
Copy link
Member Author

bot:aws:retest

It turns out that the existing ompi_instance_group_pmix_pset
implementation assumes an MPI_COMM_WORLD type of model.

This prevents the ability to use more dynamically generated process
sets, possibly using an external agent.

Switch to using the pmix pset membership query to find new pset
membership.

Related to open-mpi#10862
Related to openpmix/prrte#1906

prrte changes in above referenced PR are necessary for creating
groups/communicators from psets defined by --pset option on the
mpirun command line.

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha force-pushed the dworldify_pset_lookup branch from d591211 to 541a17b Compare January 11, 2024 22:53
@hppritcha hppritcha requested review from rhc54 and wenduwan January 12, 2024 15:14
@rhc54
Copy link
Contributor

rhc54 commented Jan 12, 2024

Can't speak to the OMPI code portions, but the PMIx queries look okay to me. Have you checked about the cache refresh yet? I don't remember if it was necessary or not.

Thanks for cleaning up the query on the PRRTE side of things!

@hppritcha
Copy link
Member Author

@wenduwan please review when you have a chance

@hppritcha hppritcha removed the request for review from rhc54 January 23, 2024 22:03
@wenduwan
Copy link
Contributor

Running tests...

@hppritcha
Copy link
Member Author

@wenduwan how did testing go?

@wenduwan
Copy link
Contributor

Test passed a long time ago.. I forgot to come back.

@hppritcha
Copy link
Member Author

@wenduwan now that 5.0.2rc1 is out could you review this PR?

@wenduwan
Copy link
Contributor

I don't understand this change and its context, but it's so old that I think it's a good idea to merge and finally run it through MTT etc.

@hppritcha hppritcha merged commit 3ae723f into open-mpi:main Jan 26, 2024
10 checks passed
hppritcha added a commit to hppritcha/pmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request Feb 5, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Feb 5, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants