-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI Connect/accept broken except when from within a single mpirun #3458
Comments
Fixed in master by checking for ompi-server presence (if launched by mpirun), or availability of publish/lookup support if direct launched, and outputting a friendly show-help message if not. Operation of ompi-server was also repaired for v3.0. Backports to the 2.x series are not planned. |
(apologies in advance if I should have opened a new issue instead) @rhc54 Thanks very much for looking into this - I was one of the ones hoping to use this feature. Unfortunately it still seems to be giving an error (though a different one this time):
I've attached a relatively relatively simple reproducer, which for me gives the above errors on today's master (1f799af). A different test, using connect/accept within a single mpirun instance, still works fine. |
I've gone back and looked at where this stands, and found that I had fixed ompi-server, but there was still some work left to resolve the cross-mpirun connect issue. I've taken it as far as I have time right now and will commit those changes. However, it won't fix the problem and so we won't port it to a release branch. There remains an issue over how the callbacks are flowing at the end of the connect operation. An object is apparently being released at an incorrect time. I'm not sure who will be picking this up. Sorry I can't be of more help. |
@hppritcha Just a reminder - this is still hanging around. |
FWIW: the connect/disconnect support was never implemented in ORTE for the v2.x series |
Perhaps this is a stupid question, as I am not very familiar with Github. Is this actively being worked on? I am running into this problem as well. |
Not at the moment it is considered a low priority, I'm afraid, and we don't have anyone focused on it. |
Thanks for the response. If anyone reading this is interested, I have written a reusable workaround for this problem. If anyone shows interest I'll clean it up and put it in a public repo. (It's not a source modification, its a separate .h file) |
Yes i am interested. We had to disable some functionality when running on OpenMPI because of this. How did you get around it? |
I have the code up here (https://github.com/derangedhk417/mpi_controller). It's just a basic wrapper around some POSIX shared memory functions. It makes use of semaphores to handle synchronization between the controller and the child. I haven't exactly made this super user friendly, but it should do the trick. I'll try to add some documentation in the next few hours. Notes:
|
Another option was brought to my attention today. If you know that one of the mpirun executions will always be running, then you can point the other mpirun's to it as the "ompi-server" like this: $ mpirun -n 3 --report-uri myuri.txt myapp &
$ mpirun -n 2 --ompi-server file:myuri.txt myotherapp This makes the first mpirun act as the global server. I'm not sure it will solve the problem, but it might be worth trying. |
@rhc54 This hasn't worked at least for me, unfortunately :( Has there been any updates on this? |
Not really - the developer community judged it not worth fixing and so it has sat idle. Based on current plans, it will be fixed in this year's v5.0 release - but not likely before then. Note that you can optionally execute your OMPI job against the PMIx Reference RTE (PRRTE). I believe this is working in that environment. See https://pmix.org/support/how-to/running-apps-under-psrvr/ for info. |
@rhc54 I wanted to let you know that support for these APIs are important to us in Dask. See dask/dask-mpi#25 Our use-case is we need a way to create MPI processes from already existing processes (without launching a new process) and build up a communicator among these processes. |
This issue is also a blocker for our use of OpenMPI with our MPI job manager (mpi_jm) which we use to increase job utilization on large supercomputers for sub-nuclear physics simulations (https://arxiv.org/pdf/1810.01609.pdf). This has forced us to use MVAPICH, which compared to OpenMPI (or Spectrum MPI) results in reduced performance, but correctness is godliness in comparison. (We here being CalLat - collaboration of physicists centred at LLNL and LBNL, using Summit, Sierra, Titan, etc.). |
Okay, you've convinced me - I'll free up some time this week and fix it. Not sure when it will be released, however, so please be patient. |
Okay, you guys - the fix is here: #6439 Once it gets thru CI I'll post a PR to backport it to the release branches. |
I can't thank you enough for this! Thank you thank you thank you! |
@rhc54 Can this issue be closed? |
MPI_Comm_connect/MPI_Comm_accept in 4.0.2 still do not work except when from a single mpirun. We're stuck at 1.6.5 and can not upgrade to any latest Open MPI releases. Please help fix. The error message from slave process using 4.0.2 MPI_Connect: The user has called an operation involving MPI_Comm_connect and/or MPI_Accept Please ensure the tool is running, and provide each mpirun with the MCA
|
@q2luo I'm not sure how to respond to your request. The error message you show indicates that the mpirun starting the slave process was not given the URI of the ompi-server. Cross-mpirun operations require the support of ompi-server as a rendezvous point. You might want to try it again, ensuring you follow the required steps. If that doesn't work, please post exactly what you did to encounter the problem. |
@rhc54 I started paying attention to the threads related to this same issue since 2015. I tried many different versions of OpenMPI releases, the last working release is 1.6.5, all releases 1.7.1 or higher have the same problem. I also tried pointing ompi-server with the URI, but no success. Your May 5 2017 description, at the beginning of this thread, describes the issue very well. In fact, OpenMPI release "list of changes" file also document it as a known issue in 3.0 section: " -MPI_Connect/accept between applications started by different mpirun We use OpenMPI in the following way, example below assumes using 8 hosts from LSF:
The above 2 steps are to realize the same goal as "mpirun -n 8". "mpirun -n 8" works fine for all OpenMPI releases, but semiconductor industry doesn't allow this usage due to IT policies. Thanks and regards. |
Look, I'm happy to help, but you have to provide enough information so I can do so. I need to know how you actually are starting all these programs. Do you have ompi-server running somewhere that all the hosts can reach over TCP? What was the cmd line to start the programs on each host? I don't know who you mean by "semiconductor industry", but I know of at least one company in that collective that doesn't have this issue 😄 This appears to be a pretty extreme use case, so it isn't surprising that it might uncover some problems. |
Each application is started with "mpirun -n 1" on a host acquired by LSF. I tried in-house by starting an ompi-server and let each individual mpirun pointing to it, but connect/accept still fails. On the other hand, even if it works, it would be impractical to use because it requires the company IT starting and maintaining a central ompi-server. Yes, all hosts can reach over TCP, because SSH based approach via "mpirun -n 8" works, 1 LSF bsub command with "mpirun -n 8" also works. |
So let me summarize. You tried doing this with a central ompi-server in a lab setup and it didn't work. Regardless, you cannot use a central ompi-server in your production environment. I can take a look to ensure that ompi-server is still working on v4.0.2. However, without ompi-server, there is no way this configuration can work on your production system. The very old v1.6 series certainly would work, but it involves a runtime that doesn't scale to today's cluster sizes - so going back to that approach isn't an option. On the positive side, you might get IBM to add PMIx integration to LSF - in which case, you won't need ompi-server any more. MIght be your best bet. |
@rhc54 Thanks for your explanation. Even if adding PMIx to LSF/MPI hook works, the same problem will be still faced for RTDA, SGE/UGE, etc. grids. v1.6 series have some serious issues, such memory corruption, network interface cards recognition, etc. All those issue are fixed in latest 3.x and 4.x releases from my testings. Our application normally needs up to 256 hosts each with physical memory at least 512GB up to 3TB, it's normally impossible to acquire 64 such big memory machines instantly so that "mpirun -n 64" will almost never succeed (unless ITs set aside 64 hosts dedicated for one job/person to use). Instead, 64 hosts are normally obtained sequentially by 64 independent grid commands and the time to acquire all these 64 machines can normally span from minutes to hours. I wonder how connect/accept work in the case of default mode "mpirun -n 32" ? is ompi-server not used or the public API connect/accept not used in this default mode ? Thanks and regards. |
The closest MPI comes to really supporting your use-case of the "rolling start" is the MPI Sessions work proposed for v4 of the standard. In the meantime, what I would do is:
This will allow proper wireup of your connect/accept logic. From your description of your scheduler, it shouldn't cause you any additional delays in getting the desired resources. You might even get your IT folks to setup a "high priority" queue for the secondary submission. |
Thank you for taking the time to submit an issue!
Background information
Multiple users have reported that MPI connect/accept no longer works when executed between two applications started by separate cmd lines. This includes passing the "port" on the cmd line, and use of ompi-server as the go-between
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Sadly, this goes back to the 2.x series and continues thru 3.x to master
Details of the problem
When we switched to PMIx for our wireup, the "port" no longer represents a typical TCP URI. It instead contains info PMIx needs for publish/lookup to rendezvous. Fixing the problem requires a little thought as application procs no longer have access to the OOB, and we'd rather not revert back to doing so.
The text was updated successfully, but these errors were encountered: