Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

singleton mode with v4.x fails when MCA binding policy is "numa" on AMD hardware #11097

Closed
jngrad opened this issue Nov 21, 2022 · 13 comments
Closed

Comments

@jngrad
Copy link

jngrad commented Nov 21, 2022

Background information

What version of Open MPI are you using?

v4.1.2

Describe how Open MPI was installed

From Ubuntu 22.04 package manager: libopenmpi-dev 4.1.2-2ubuntu1 and hwloc 2.7.0-2.

Also reproducible when building from sources in a Docker container:

dpkg-buildpackage commands (click to expand)
echo "deb-src http://archive.ubuntu.com/ubuntu/ jammy universe" >> /etc/apt/sources.list
apt-get update
apt-get install -y fakeroot
apt-get source libopenmpi3
apt-get build-dep -y libopenmpi3
cd openmpi-4.1.2/
dpkg-buildpackage -rfakeroot -b
cp -r debian /local/openmpi-debian-patched # copy to mounted folder to use on host machine

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04
  • Computer hardware: reproducible on the following CPUs:
    • AMD Ryzen Threadripper 1950X 16-core processor with hyperthreading enabled
    • AMD EPYC 7351P 16-core processor with hyperthreading enabled
  • Network type: not relevant

Details of the problem

On AMD Ryzen and AMD EPYC, the MCA binding policy "numa" fails to set the processor affinity and generates a fatal error when running the executable in singleton mode. Running the executable with mpiexec -n 1 fixes the error.

MWE:

#include <mpi.h>
int main() {
  MPI_Init(NULL, NULL);
  MPI_Finalize();
}

Error message:

$ mpicxx mwe.cc
$ OMPI_MCA_hwloc_base_binding_policy="l3cache" ./a.out ; echo $?
0
$ OMPI_MCA_hwloc_base_binding_policy="none" ./a.out ; echo $?
0
$ OMPI_MCA_hwloc_base_binding_policy="core" ./a.out ; echo $?
0
$ OMPI_MCA_hwloc_base_binding_policy="numa" ./a.out ; echo $?
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  Setting processor affinity failed failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[coyote10:741448] Local abort before MPI_INIT completed completed successfully,
but am not able to aggregate error messages, and not able to guarantee that all
other processes were killed!
1
$ OMPI_MCA_hwloc_base_binding_policy="numa" mpiexec -n 1 ./a.out ; echo $?
0

The issue also existed on v4.0.3, but it could be fixed with a binary patch that overrode the value of HWLOC_OBJ_NODE=0xd by 0xc at orte/mca/ess/base/ess_base_fns.c#L242 in the libopen-rte.so file. This is no longer possible in v4.1.2.

@edgargabriel
Copy link
Member

I will have a look this week

@edgargabriel
Copy link
Member

@jngrad first, I want to confirm that I can reproduce the issue with v4.1.x. I have however two question before deciding on the path forward on this.

First, I think the issue is not present in the upcoming 5.0.x branch. It is a bit difficult to evaluate since the options are not necessarily directly applicable between 4.1.x and 5.0.x, but running the following command line works for me with v5.0.x

OMPI_MCA_rmaps_default_mapping_policy="numa" ./hello_world_mpi

Would you have a way to compile and test Open MPI v5.0.x and see whether this works for you and does what you would expect it to do?

Second, I would like to understand a bit more about the use-case that you are targeting, what is the goal in applying mapping/binding to a sequential singleton process? In my mind process mapping/binding is important for communication between processes, it is not entirely clear to me what the benefit is for a sequential process, and I would like to understand this a bit better.

@rhc54
Copy link
Contributor

rhc54 commented Nov 29, 2022

OMPI_MCA_rmaps_default_mapping_policy="numa" ./hello_world_mpi

I believe that won't do anything - the correct MCA param is PRTE_MCA_... and not OMPI_MCA_.

However, it will definitely work either way for a singleton as the singleton no longer spins off a copy of mpirun to support it until it calls MPI_Comm_spawn...and then the mapper setting only affects how the spawned procs are placed.

@edgargabriel
Copy link
Member

@rhc54 I tried both. When using mpirun combined with setting rmaps_base_verbose I saw the same impact in terms of the output generated, whether using the OMPI or the PRTE version of the variable name

@rhc54
Copy link
Contributor

rhc54 commented Nov 29, 2022

Sounds like someone setup the OMPI personality to "remap" the name if it detects that it is a PRTE framework (might even have been me - been a long time). 🤷‍♂️

Regardless, it won't have any impact on the problem you are pursuing (at least, for OMPI v5)

@edgargabriel
Copy link
Member

so bottom line is, the problem does not exist in 5.0?

@rhc54
Copy link
Contributor

rhc54 commented Nov 29, 2022

Shouldn't, no - if they call "comm_spawn", then we will fork/exec a copy of "mpirun", so that might make it appear again. Hopefully, the default binding policy won't be a problem (it is different in v5), but you might want to check it.

@jngrad
Copy link
Author

jngrad commented Nov 29, 2022

@jngrad first, I want to confirm that I can reproduce the issue with v4.1.x. I have however two question before deciding on the path forward on this.

Thank you for looking into this.

First, I think the issue is not present in the upcoming 5.0.x branch. It is a bit difficult to evaluate since the options are not necessarily directly applicable between 4.1.x and 5.0.x, but running the following command line works for me with v5.0.x

OMPI_MCA_rmaps_default_mapping_policy="numa" ./hello_world_mpi

Would you have a way to compile and test Open MPI v5.0.x and see whether this works for you and does what you would expect it to do?

Open MPI v5.0.x seems to work as expected. I was able to reproduce the bug on v4.1.2 in a Fedora 36 Docker container. I then built Open MPI v5.0.x (9c2418e) there and ran the MWE with both the OMPI_MCA_... and the PRTE_MCA_... environment variables, and could not reproduce the bug on both EPYC and Ryzen.

Second, I would like to understand a bit more about the use-case that you are targeting, what is the goal in applying mapping/binding to a sequential singleton process? In my mind process mapping/binding is important for communication between processes, it is not entirely clear to me what the benefit is for a sequential process, and I would like to understand this a bit better.

Our application is written as a Python package using Cython to bind to the MPI-parallel C++ core. The singleton mode is used to get an interactive Python prompt. At some point in time, all our NUMA workstations got configured to export environment variables OMPI_MCA_hwloc_base_binding_policy and OMPI_MCA_rmaps_base_mapping_policy to numa to improve performance of our simulation software as well as our app, but then our app was affected by this regression on AMD workstations. I considered configuring the app launcher to set the binding to none when running in singleton mode, but colleagues pointed out that binding to NUMA domains avoid the process jumping out of the original domain, which would trigger an unnecessary fetch to repopulate the CPU cache in the new domain.

Our users never reported this issue to us in 2 years, so our small team is probably the only one that ran into this use case. At the moment, we have reconfigured our workstations to bind to l3cache, which is quite close to numa for the affected architectures (they have two L3 caches per NUMA domain).

@rhc54
Copy link
Contributor

rhc54 commented Nov 30, 2022

Open MPI v5.0.x seems to work as expected. I was able to reproduce the bug on v4.1.2 in a Fedora 36 Docker container. I then built Open MPI v5.0.x (9c2418e) there and ran the MWE with both the OMPI_MCA_... and the PRTE_MCA_... environment variables, and could not reproduce the bug on both EPYC and Ryzen.

Just to clarify: the reason those envars didn't cause a problem is that OMPI v5 is ignoring them for the singleton. Prior OMPI release series attempt to self-bind the singleton according to the directive in the envar - OMPI v5 does not.

This is an important point, so let me state it clearly - singleton procs are not bound by OMPI v5. Only the child jobs created by calling MPI_Comm_spawn will be bound, and the envar can influence that policy.

Please also note that binding to "NUMA" for these advanced multi-chip packages is problematic and likely not your best choice. NUMA domains aren't that clearly defined any more, and binding to "package" (or "socket" in older OMPI's) is generally recommended in its place. @edgargabriel can probably give you more info and/or get you better direction.

@qkoziol
Copy link
Contributor

qkoziol commented Jun 5, 2023

This is in the "to do" list for issue #10480, which I am planning to close, in favor of breaking it up into individual issues. For now, I'll assign the same tags to this issue as #10480, but I don't understand the action to take on this issue. Can @edgargabriel @jngrad or @rhc54 please comment to indicate if or what any action is necessary before closing this issue?

@qkoziol
Copy link
Contributor

qkoziol commented Jun 5, 2023

Please remove those labels if they are not appropriate here.

@jngrad
Copy link
Author

jngrad commented Jun 6, 2023

@qkoziol From my side: we configured our job scheduler to use l3cache instead of numa, and modified our app to emit a non-fatal warning that explains this issue when the app is binding to numa with OpenMPI 4.x on an unsupported CPU.

@edgargabriel
Copy link
Member

I would suggest to close this ticket. We can always reopen itI the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants