-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trouble running openmpi+pmix in rootless podman-hpc container #12146
Comments
I would consider this error message: [nid200052:1125877] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 81
[nid200053:2173502] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 81 to indicate a failure. A simple "hello" might work, but something is wrong.
I notice you didn't include
Not surprising - I'd guess that the problem lies in getting the PMIx socket connect across the container boundary. Note that we do have others who run containers that have root as the user, even under Slurm, so we know that it can be done. I don't know the reason for this particular error. One thing I find curious:
What is this |
Hi @rhc54, Thanks for your quick reply.
Noted, that makes sense.
For the test with the failure, it does include the
This is the pmix version that I found that comes as an Ubuntu jammy package. I don't really understand why they call the directory pmix2. I'm glad you have seen cases where running this as root should work. Do you have any advice for how I should go about troubleshooting this? |
I immediately get suspicious when I see that this is PMIx "4.1.2-2ubuntu1" - it sounds like they have modified the release. I would strongly advise against using any software that has been modified by the packager. I'd suggest downloading a copy of 4.2.7 and building it locally, or just use the copy that is embedded in OMPI. I especially recommend that due to the "UNSUPPORTED TYPE" error. Something is broken between your Slurm and OMPI PMIx connections. I'd start by trying to understand what that might be. Do you know what version of PMIx your Slurm is using? If you download and build OMPI outside the container, are you able to |
Thanks @rhc54. Got it, thanks for that advice about packaging. I'll work and building and testing with my own pmix. We do have a version of OMPI on Perlmutter that @rgayatri23 and @hppritcha built. We also have pmix v4 support in Slurm:
Testing with slurm+openmpi+mpi4py outside a container, I do actually see a similar UNSUPPORTED TYPE error.
For comparison, I also see a similar OUT OF RESSOURCE error when I force mpirun to launch across 2 nodes, but notably no UNSUPPORTED TYPE error.
Do you think this could mean there's some issue in our slurm and/or pmix installation? |
Yeah, something isn't right. Let me do a little digging tonight to see what those errors might mean. We need to get you running cleanly outside the container before introducing the container into the mix. Just to be sure:
|
Thanks @rhc54. I agree, that sounds like a good course of action. Yes, the mpirun comes from the openmpi module on Perlmutter. I didn't build it myself, but I think Rahul and Howard did. Here's the top part of
Here's the mpirun:
Yes, I built the mpi4py package on top of this openmpi:
Since I don't see pmix mentioned in |
Yes - the configure line shows (because it doesn't explicitly include a Sorry to keep nagging with questions: what happens if you run a C "hello" version? In other words, take the python and mpi4py out of the equation? |
No problem, I appreciate your help. Sure, here's srun:
and mpirun:
|
Hi @lastephey and @rhc54, thanks for looking into the issue. |
Can you please provide the configuration line for that 5.0.0 release? FWIW: that error message ordinarily indicates that there is a disconnect between the PMIx code in |
What is the PMIx version used by SLURM on the host? note your what if you build a container with |
Thats essentially what we have done in our non-container, bare metal installation. Here is the configure command ./configure CC=cc FC=ftn CXX=CC CFLAGS="--cray-bypass-pkgconfig" CXXFLAGS="--cray-bypass-pkgconfig" FCFLAGS="--cray-bypass-pkgconfig" LDFLAGS="--cray -bypass-pkgconfig" --enable-orterun-prefix-by-default --prefix=${ompi_install_dir} --with-cuda=$CUDA_HOME --with-cuda-libdir=/usr/lib64 --with-ucx=n o --with-verbs=no --enable-mpi-java --with-ofi --enable-mpi1-compatibility --with-pmix=internal --disable-sphinx |
Thanks, what are the SLURM and PMIx versions running on the host (running SLES if I understand correctly)?
returns? |
As expected, I cannot reproduce this problem. It is undoubtedly due to an issue of PMIx version confusion, most likely being caused by some kind of The problem is caused by confusion over whether or not the messaging buffer between the PMIx server and client has been packed in "debug mode" (i.e., where it contains explicit information on the data type being packed at each step) vs "non-debug mode" (where it doesn't contain the data type info). This causes the unpacking procedure to mistake actual data as being the data type, and things go haywire. In this case, the error message is caused by the unpack function interpreting a value as being the length of a string, and that value is enormous (because it isn't really the packed string length). We have handshake code that can detect which mode the other side's messaging buffer is in, but something in your environment is causing that to fail. If you do a PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_FULLY_DESC in the output. |
To answer your question @ggouaillardet,
|
@rhc54 thanks for looking into this and trying to reproduce.
We do see the If I understood you correctly, we'll need to track down the pmix version difference as you mentioned. Is there an easy way to determine either which pmix version ompi was built with, or which pmix version is currently in use? |
To add to @lastephey 's question, for the configure option of |
If you look at the output from your last run, you'll see that the server puts its version in the environment of the proc: PMIX_VERSION=4.2.3 I can post a little program that will get the client's version and print it out. PMIx guarantees interoperability, so the difference there isn't the issue. The problem is that the client thinks the server is using one buffer type, when it is actually using the other. The question is: "why"? You might try running But we do still handshake to deal with potential buffer type differences at runtime, so it is puzzling. Kind of fishing in the dark right now to see if something pops up. I suppose if you can/want to grant me access to the machine, I can poke at it a bit for you. Up to you - I honestly don't know how many more question/answer rounds it will take to try and make sense of this. |
Here's the output from mpirun:
|
And just in case it's useful, here's the mpirun output inside a test container:
|
Hmmm...well, that all looks okay. Just for grins, let's try pushing |
Actually, I have to eat my words. The PMIx version shown above when executing So I honestly have no idea how |
Hi @rhc54, I see, I think that makes sense. We'll work on that. Yes, I believe 4.2.3 is the system pmix. Perlmutter has been down on and off since yesterday afternoon, so it might take a bit before we can do more testing. I did test with |
@rgayatri23 and I did some more testing today- I'll try to summarize what we did. First, I should clarify something that had both Rahul and I confused- his 5.0.0 build used So in terms of OpenMPI being built with one pmix and maybe using another, I am not sure about that. I tested with Rahul's 5.0.0 build which used its own internal PMIx located at I checked and my test application was linked to this PMIx.
I also used
Here's with srun + openmpi 5 internal pmix:
Here's with mpirun 5.0 + internal pmix:
Given that the mpirun tests are "clean" (i.e. don't have any of the warnings we showed earlier), do you think we can infer anything? Maybe building with To contrast, here's mpirun from 5.0rc12 (i.e. --with-pmix=external)
|
Let's please drop the rc12 build - it's old, there were many changes made before official release, etc. Can we just focus on a real official release?
Please do not use
This shows we now have identified a working combination. We can now step forward with this combination. Let's ask what happens if you simply |
Sure, here are some tests using 5.0.0.
Testing with mpirun looks clean:
Testing with srun shows the same issue we reported earlier:
I think slurm is using |
I'm afraid I simply cannot reproduce those results using PMIx v4.2.3 for the server and v4.2.6 for the client. I'm also unable to reproduce it when the client uses PMIx v5.0 or head of the master branch. Everything works just fine. That said, I do see a code path that might get to that point when running under Slurm (which probably does not provide the apps with their binding info). I'll try to explore that next. |
@wenduwan @hppritcha I believe the problem here is that the I'm afraid I cannot debug it further as I can't reproduce it on any machine available to me. Can someone perhaps trace down the OPAL code to see where the thread goes wrong? |
I think I may have tracked this down to a "free" that wasn't followed by setting the free'd field to @lastephey If I give you a diff for OMPI v5.0.0, would you folks be able to apply it and recompile so you can test it? |
@rhc54 , Yes we can compile and test it if you can give us a patch. |
Hmmm I was NOT able to reproduce the issue on AWS but I can see it is a latent bug. IIRC Ralph pushed a fix in https://github.com/openpmix/openpmix/commits/master @rgayatri23 could you give it a try? |
The following patch: diff --git a/src/hwloc/pmix_hwloc.c b/src/hwloc/pmix_hwloc.c
index b485036d..40d5b40e 100644
--- a/src/hwloc/pmix_hwloc.c
+++ b/src/hwloc/pmix_hwloc.c
@@ -1016,6 +1016,7 @@ pmix_status_t pmix_hwloc_get_cpuset(pmix_cpuset_t *cpuset, pmix_bind_envelope_t
}
if (0 != rc) {
hwloc_bitmap_free(cpuset->bitmap);
+ cpuset->bitmap = NULL;
return PMIX_ERR_NOT_FOUND;
}
if (NULL == cpuset->source) { needs to be applied to the |
Thanks @rhc54 . |
Either one should be fine - that diff was from 5.0.0 |
Thanks very much @rhc54 for trying to reproduce and giving us a possible patch! |
Ok sorry I was confused. Did not read the entire message. The patch is for openpmix. |
Ignore my previous message. Just realized what you meant. |
It looks like the issue persists even with the patch rgayatri@nid200472:/pscratch/sd/r/rgayatri/HelloWorld> srun -n4 --mpi=pmix ./mpihello.ex
[nid200472:383798] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19263286.1/shared_mem_cuda_pool.nid200472 could be created.
[nid200472:383798] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[nid200473:373446] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19263286.1/shared_mem_cuda_pool.nid200473 could be created.
[nid200473:373446] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[nid200472:383789] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268
[nid200472:383789] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624
[nid200472:383789] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417
[nid200473:373437] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268
[nid200473:373437] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624
[nid200473:373437] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417
Lrank from MPI = 0Hello from processor nid200472, rank = 0 out of 4 processors
************************************************************************************
Lrank from MPI = 1Hello from processor nid200472, rank = 1 out of 4 processors
************************************************************************************
Lrank from MPI = 2Hello from processor nid200473, rank = 2 out of 4 processors
************************************************************************************
Lrank from MPI = 3Hello from processor nid200473, rank = 3 out of 4 processors
************************************************************************************
rgayatri@nid200472:/pscratch/sd/r/rgayatri/HelloWorld> Works fine with rgayatri@nid200472:/pscratch/sd/r/rgayatri/HelloWorld> mpirun -np 4 ./mpihello.ex
Lrank from MPI = 0Hello from processor nid200472, rank = 0 out of 4 processors
Lrank from MPI = 0Hello from processor nid200472, rank = 1 out of 4 processors
Lrank from MPI = 0Hello from processor nid200472, rank = 2 out of 4 processors
Lrank from MPI = 0Hello from processor nid200472, rank = 3 out of 4 processors |
No, that's a different error output from elsewhere in the code. Looks to me like you hit an error trying to create a shared memory backing file, which then falls into a bunch of other problems. Afraid I don't know anything about the CUDA support to know where |
Thats the first part about CUDA error (which I think I know how to resolve). But the 2nd part still shows the following error Or was the patch to resolve the following error There are multiple errors so its a bit confusing. Sorry about that. |
may want to try running with
and see if the warning messages change. |
Thanks @hppritcha , this solved one of the issues. (UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 105) rgayatri@nid200484:/pscratch/sd/r/rgayatri/HelloWorld> PMIX_DEBUG=hello srun -N1 --ntasks-per-node=2 --mpi=pmix --gpus-per-task=1 --gpu-bind=none ./mpihello.ex
[nid200484:1228192] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19266816.3/shared_mem_cuda_pool.nid200484 could be created.
[nid200484:1228192] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
Lrank from MPI = 0Hello from processor nid200484, rank = 0 out of 2 processors
************************************************************************************
Lrank from MPI = 1Hello from processor nid200484, rank = 1 out of 2 processors
************************************************************************************ So now we only need to understand the cuda issue. My tricks did not work on resolving it. |
I think the following issue #11831 is trying to address a similar situation that I am observing with the cuda issue. |
How much space is in /tmp ? If you could rebuild Open MPI with
we might get more info about why the creation of the file for shared memory is not succedding? |
Here is the relevant information that I saw with debug enabled: [nid200305:346908] shmem: mmap: shmem_ds_resetting
[nid200305:346908] shmem: mmap: backing store base directory: /tmp/spmix_appdir_73349_19269759.0/shared_mem_cuda_pool.nid200305
[nid200305:346908] WARNING: opal_path_df failure!
[nid200305:346908] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19269759.0/shared_mem_cuda_pool.nid200305 could be created.
[nid200305:346908] shmem: mmap: shmem_ds_resetting
[nid200305:346908] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[nid200305:346911] shmem: mmap: shmem_ds_resetting
[nid200305:346911] shmem: mmap: backing store base directory: /dev/shm/sm_segment.nid200305.73349.4fb90000.3
[nid200305:346911] shmem: mmap: create successful (id: 59, size: 16777216, name: /dev/shm/sm_segment.nid200305.73349.4fb90000.3)
[nid200305:346911] shmem: mmap: attach successful (id: 59, size: 16777216, name: /dev/shm/sm_segment.nid200305.73349.4fb90000.3)
[nid200305:346909] shmem: mmap: shmem_ds_resetting
|
opal_path_df is trying to stat /tmp/spmix_appdir_73349_19269759.0/ and getting an error. |
That makes no sense - that envar just controls debugging output. It has no influence over the pack/unpack system. I suspect all it did was tell the code "don't tell me about errors". The value of the envar is used to set the verbosity level - passing nothing but a string (e.g., "hello") just means that What the error message is saying is that we were unable to complete the allgather of connection information across the procs. So your simple "hello" might work, but a real application will almost certainly fail. I suspect it has something to do with the problems in setting up the backing store as that directory/file name is one of the things we pass. |
I'm getting a little lost here and want to see if i can reproduce on perlmutter. |
This is my configure line and using ./configure CC=cc FC=ftn CXX=CC CFLAGS=--cray-bypass-pkgconfig CXXFLAGS=--cray-bypass-pkgconfig FCFLAGS=--cray-bypass-pkgconfig LDFLAGS=--cray-bypass-pkgconfig --enable-orterun-prefix-by-default --prefix=/pscratch/sd/r/rgayatri/software/gnu/openmpi/5.0.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7 --with-cuda-libdir=/usr/lib64 --with-ucx=no --with-verbs=no --enable-mpi-java --with-ofi --enable-mpi1-compatibility --with-pmix=internal --disable-sphinx --enable-debug |
My best bet would be to manually instrument
|
are these runs being done on a GPU partition or CPU? I built ompi 4.1.6 without --enable-debug and with PMIx internal and when running on the CPU partition see this:
|
On the GPU side I see this:
I recall we poked around with this previously and determined that the pid's here were coming from the transient slurmstepd that subsequently exec'd the reall application. |
You might check to ensure that you have the same Slurm running on all the nodes, and that each slurmd is in fact getting the same PMIx lib. If that error is coming from the slurmd, then I'll bet you that the slurmd on another node is picking up a debug PMIx lib, while this slurmd is using a non-debug one. We handle such cross-over between slurmd and application proc - but not between slurmd's. They must be using the same library, including same debug setting. |
hmm... i suspect for this NERSC system that the same slurm is indeed running on all the nodes. I also noticed NERSC has several SLURM_PMIX envariables set by default. Any idea why these are set? (question for NERSC SLURM specialist). I did some more careful testing of 4.1.6 and making absolutely sure I was using the PMIx that NERSC has used for building the PMIx plugin. I stopped using cuda and used the --disable-mca-dso config option to pull in the libpmix.so into the executable (the two don't mix). See below:
To be more specific, I was using the pmix stuff from the NERSC PMIX RPM - pmix-4.2.3-2_nersc.x86_64
I also noticed with my hello world program that the PMIX ERROR messages being emitted by what seems to be the slurmstepd daemons is not determinstic:
What I think is we have at least two problems here, possibly three.
|
So NERSC updated PMIx to 4.2.7 in the recent maintenance and I built OpenMPI/5.0.0 using the newer PMIx. |
I'm just catching up on this -- has the discussion moved from v4.1.x to v5.0.x? I'm guessing that if there are fixes that are needed on the run-time side of things, it will be significantly easier to get them in a v5.0.x-related release (e.g., for PMIx and/or PRRTE). |
Background information
Dear OpenMPI developers,
I'm going to describe an issue that @hppritcha kindly helped me troubleshoot during SC a few weeks ago. It may actually be a Slurm issue rather than an OpenMPI issue, but I wanted to share the information I have with you.
Background- we are trying to get support working for OpenMPI using podman-hpc, a relatively new container runtime at NERSC. By default podman-hpc runs in rootless mode, where the user inside the container appears to be root. Our current methodology is to use either srun/mpirun to launch the MPI/PMI wireup outside the container, and then have this connect to OpenMPI installed with PMI support inside the container. Although I've done tests both with and without Slurm support, I'll focus on the Slurm case here to keep things simple. However I am happy to provide more information about mpirun launch if you would like it.
The situation- we are able to make this setup work using PMI2, but it is not currently working using PMIx. However, we have observed that when we run the container as the user (i.e. using
--userns=keep-id
) rather than in rootless mode, our OpenMPI+PMIx test does succeed. Since running with--userns=keep-id
can be substantially slower, we would really like to enable containers running as root.What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
We are running with OpenMPI version 4.1.6 as suggested by @hppritcha. For this test we have built a single container image with OpenMPI built with both PMI2 and PMIx. We toggle between PMI2 and PMIx support in our tests.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
We developed a Containerfile recipe for this build:
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.N/A
Please describe the system on which you are running
Details of the problem
We are running the same hello world mpi4py test with both OpenMPI PMI2 and OpenMPI PMIx. Additionally, we are running the same test both with and without
--userns=keep-id
. All tests succeed except the PMIx + userns=keep-id test. We are using a PMI2 and PMIx helper module.Running pmi2 test succeeds:
Running pmix + userns=keep-d test succeeds:
Running pmix test fails:
It's not clear to me if this is a Slurm issue or an OpenMPI/PMIx issue. I haven't been able to make this work with mpirun either, but I left out those details here since this issue is already quite long. @hppritcha suggested we may want to file an issue with Slurm support, but I wanted to share this with you all first before we go down that route.
Thanks very much for your help,
Laurie Stephey
cc @rgayatri23
The text was updated successfully, but these errors were encountered: