-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User's PMIx call fails after MPI_Init + UCX/HCOLL #6982
Comments
Thanks @jsquyres ! |
@artpol84 @karasevb @jladd-mlnx Ping. |
I have noticed some weird issues with what I suppose is the memory allocator by the UCX PML. In my case the issue arise if I dlopen cuda (via cublas) after an MPI_Init where the UCX PML is loaded but not enabled (so it is dlclosed). The issue manifest as a segfault deep inside dlopen while allocating some string. If I prevent the UCX PML from loading, then the code works just fine. |
@angainor could you try to run with added |
@karasevb unfortunately, still the same:
|
@bosilca disabling patcher may help in your case. |
@angainor I'll try to repro next week. |
@angainor, I tried to reproduce with the following results: First, I had to fix the return value, without the fix I was getting an error messages about procs exiting with non-zero status: --- repro.c 2019-12-17 11:23:36.216701105 -0800
+++ repro_new.c 2019-12-17 11:34:33.000156999 -0800
@@ -96,5 +96,6 @@
}
if(myproc.rank == 0) printf("PMIx finalized\n");
+ return 0;
// MPI_Finalize();
} I was using HPCX that is based on ompi-v4.0.x (af04a9d) and running on 2 nodes:
For me the patched reproducer is passing successfully (note that I'm explicitly requesting UCX PML):
Is there anything I am missing? P.S. in order to build the repro I had to explicitly link it with OMPI's PMIx:
otherwise, I was getting an error:
How are you building your repro? Can it be that a different PMIx version is used? It doesn't look like that, otherwise, the presence or absence of UCX wouldn't matter. I noticed that you provide LD_LIBRARY_PATH in your command invocation, why do you need it?
|
@artpol84 I'll be able to look at it in more detail tomorrow, but some thoughts: You have to run on different compute nodes. If you start the ranks on the same node, it works. I was building OpenMPI against an external PMIx version, which was also used to link the reproducing program. I did not use the internal PMIx, also for the reasons you've described.
|
Thanks, @angainor
Yeah, I noticed that and that's why above I explicitly said I was using 2 nodes:
Ok, this could be a key difference. I'll try to build my own OMPI and link it against PMIx.
I see, I probably should try to run without Slurm as well. |
@artpol84 The same happens when I run from within slurm, running as |
@artpol84 I checked again running on the same node with slurm, and also in this case things work. FYR, this is the list of modules I load:
I have a custom version of UCX and PMIx, but I also compile openmpi against HPCx and hcoll. Maybe it matters. |
@rhc54 @artpol84 I believe I've found the simplest use-case that breaks things, and it seems it is not related to UCX/OpenMPI. I can reproduce the behavior using PMIx code only. The problem is triggered by the following sequence of PMIx pseudo-calls:
If I do a That is of course a problem :) To make my code work I have to add an explicit |
The code sequence you show cannot work as it is missing a call to |
@rhc54 No, the |
Maybe you could provide us with a complete test code - I confess I'm getting confused. |
Of course! Lines 132-133 break things. If the second
The complete code without any MPI calls:
|
And you are running this using mpirun within OMPI (I see master and v4.0.x cited above), yes? I assume you built OMPI against the same external version of PMIx you used to compile your program - what version was that? I see a v2.1.4 cited above, but also something about using the OMPI internal PMIx, which would not be supported/possible. |
Right now I used OpenMPI 4.0.2 to run the test. It is compiled against an external |
Kewl - thx! Let me poke into it a bit. |
Hmmm...well, I tested it against PMIx master with PRRTE master, and it works fine. I added your code to the "test" area, but I can remove it if you like - up to you. I wanted to keep it for further tests to ensure this continued to work. You'll find it here: openpmix/prrte#294 Please take a look and see if I am doing something wrong/different. From what I saw, it appears the issue is in the ORTE code, not PMIx - and that PRRTE is doing it correctly. We are getting ready to replace ORTE with PRRTE, but that will not be released until OMPI v5 this summer. |
@rhc54 Sure, please use the code as you see fit. Just to make sure: did you comment out the second I can look at PRRTE tomorrow. I remember it did not work with the original code reported in this issue. But then I called |
Yes - in fact, I removed all fences from the code path and it still worked. |
@rhc54 I tested with clean clones of prrte and pmix master. To reproduce the error you have to run the |
Okay, I might fiddle a bit with the reproducer as it only fails if I have two nodes and run one process per node. I suspect it is finding a problem in the PMIx server's code path when a host asks for modex info - just the code itself only involves rank 0 and 1, and so if those two ranks are on the same node it works just fine. I'll take a look at it today. |
@rhc54 Thanks! Yes, the ranks need to run on different compute nodes. So it seems that it is a problem with how the servers exchange the data between them. |
Not precisely. I think the problem is that the two host daemons are sending requests to the other side, and those requests are being passed down into the respective PMIx server libraries. The problem is that those libraries are (a) looking for the requested key and not finding it, and then (b) checking to see if we have already received data from the target process. If (b) is true, then they immediately respond with "not found" instead of caching the request until the data appears. Problem is that the remote client will "hang" forever in PMIx_Get if the target proc never posts the data. This makes use of the I'll try to have something for you to play with later in the day. |
@artpol84 FYI As discussed with @rhc54 in an issue reported in the PMIx repo, it seems something in the OpenMPI runtime 'breaks' the PMIx infrastructure so that it is not possible to distribute user's keys if a call to
PMIx_Set
+PMIx_Commit
is made after theMPI_Init
call. That is,PMIx_Get
fails on the clients with error -46. If the user's code sets the custom key beforeMPI_Init
, then the code works as expected.What's puzzling is that I only observe this problem when the UCX pml and HCOLL are enabled. I compile the code attached at the end of this post against OMPI master + it's internal PMIx, but I see the same behavior for OMPI 4.0.1 + PMIx 2.1.4:
If I turn off UCX and HCOLL, things work as expected:
Here is the reproducing code. To compile one needs to pass include and link path to the PMIx installation used by OpenMPI. I'd appreciate any insight.
The text was updated successfully, but these errors were encountered: