Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI library used is the one compiled with the application and not the one used for mpirun. #11269

Closed
wckzhang opened this issue Jan 5, 2023 · 35 comments

Comments

@wckzhang
Copy link
Contributor

wckzhang commented Jan 5, 2023

Moved ticket from - openpmix/openpmix#2898

[ec2-user@ip-10-0-0-28 ~]$ ~/ompi5install/bin/mpirun -np 1 ldd ~/osu4/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
<snip>
	libmpi.so.40 => /opt/amazon/openmpi/lib64/libmpi.so.40 (0x00007f6e0e757000)
<snip>
	libopen-pal.so.40 => /opt/amazon/openmpi/lib64/libopen-pal.so.40 (0x00007f6e0d358000)
<snip>
[ec2-user@ip-10-0-0-28 ~]$ ~/ompi5install/bin/mpirun -np 1 ldd ~/osu5/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
<snip>
	libmpi.so.80 => /home/ec2-user/ompi5install/lib/libmpi.so.80 (0x00007f325d42a000)
<snip>
	libopen-pal.so.80 => /home/ec2-user/ompi5install/lib/libopen-pal.so.80 (0x00007f325c08d000)
<snip>

Summary:
I compiled OMB against both Open MPI 5.0.x and 4.1.4. Since no ABI break should have occurred, I expected to be able to run the 5.0.x mpirun to launch the 4.1.4 compiled OMB. However, this did not work. The root cause of the issue is that the MPI library used will be the one the application compiles against. I'm not convinced this is the behavior that we want to have.

@rhc54
Copy link
Contributor

rhc54 commented Jan 6, 2023

I think the problem is that prterun picks up the various "prefix" settings you push into the environment in your mpirun wrapper and is passing them down to the app. Since both of us use PMIx, that's where the intersection generates a problem.

The fix likely is in the PRRTE project - basically, omitting the "prefix" settings from those passed down to the application. Unfortunately, you might have a case where that needs to be done, so perhaps the correct answer is to define a new "application" prefix that is passed specifically down to the app and ignored by PRRTE? Or maybe the other way around if that helps with backward compatibility.

@ggouaillardet
Copy link
Contributor

you must also make sure Open MPI was built with -enable-new-dtags otherwise the rpath used at build time takes precedence over $LD_LIBRARY_PATH

@jsquyres
Copy link
Member

jsquyres commented Jan 8, 2023

@bwbarrett
Copy link
Member

William should go verify that we did default to using runpath, but I'm pretty sure we did.

The larger issue is that when we rewrote the mpirun executable to deal with prrte, we didn't appear to add any of the code to have PRRTE set the LD_LIBRARY_PATH for applications when prefix is set (either explicitly or, in this case, because full path to mpirun was given), which means that that the linker is falling back to the runpath.

@wckzhang
Copy link
Contributor Author

Looks like it's defaulted to rpath:

[ec2-user@ip-10-0-0-28 lib]$ readelf -d ./libmpi.so | egrep -i 'rpath|runpath'
 0x000000000000000f (RPATH)              Library rpath: [/home/ec2-user/ompi5install/lib]
[ec2-user@ip-10-0-0-28 lib]$ readelf -d /opt/amazon/openmpi/lib64/libmpi.so | egrep -i 'rpath|runpath'
 0x000000000000000f (RPATH)              Library rpath: [/opt/amazon/openmpi/lib64]

@bwbarrett
Copy link
Member

That's... not awesome. Can you include your config.log?

@wckzhang
Copy link
Contributor Author

From the compile command in config.log

  $ ./configure --prefix=/home/ec2-user/ompi5install --with-libfabric=/opt/amazon/efa/

Do you need the entire log?

@bwbarrett
Copy link
Member

Yes, or at least the part when it is figuring out runpath vs. rpath. And probably throw in mpicc -showme as well.

@wckzhang
Copy link
Contributor Author

[ec2-user@ip-10-0-0-28 ~]$ ./ompi5install/bin/mpicc -showme
gcc -I/home/ec2-user/ompi5install/include -pthread -L/home/ec2-user/ompi5install/lib -Wl,-rpath -Wl,/home/ec2-user/ompi5install/lib -Wl,--enable-new-dtags -lmpi

config.log

@wckzhang
Copy link
Contributor Author

@rhc54 Hi Ralph, reading our mpirun shim code, it looks like we set the local libdir as an env used by prrte:

    /*
     * set environment variable for our install location
     * used within the OMPI prrte schizo component
     */

    setenv("OMPI_LIBDIR_LOC", opal_install_dirs.libdir, 1);

I can confirm that this libdir is the libdir of the mpirun, but it looks like LD_LIBRARY_PATH does not match on the compute nodes. Is PRRTE supposed to be using this env to set LD_LIBRARY_PATH or is there another way to propagate this information to PRRTE?

@bwbarrett
Copy link
Member

I think that was supposed to be added (by OMPI developers, not Ralph), to the schizo framework. Maybe it never did?

@rhc54
Copy link
Contributor

rhc54 commented Jan 20, 2023

I see it there:

$ ack OMPI_LIBDIR_LOC ~/pmix/prrte
/Users/rhc/pmix/prrte/src/mca/schizo/ompi/schizo_ompi.c
289:    ompi_install_dirs_libdir = getenv("OMPI_LIBDIR_LOC");
$

and then I see it used to deal with jar files:

251:static char *ompi_install_dirs_libdir = NULL;
263:        asprintf(&str, fmt, app->app.argv[index], ompi_install_dirs_libdir, jarfile);
289:    ompi_install_dirs_libdir = getenv("OMPI_LIBDIR_LOC");
290:    if (NULL == ompi_install_dirs_libdir) {
310:            if (NULL == strstr(app->app.argv[i], ompi_install_dirs_libdir)) {
313:                    asprintf(&value, "-Djava.library.path=%s%s", dptr, ompi_install_dirs_libdir);
315:                    asprintf(&value, "-Djava.library.path=%s:%s", dptr, ompi_install_dirs_libdir);
326:        asprintf(&value, "-Djava.library.path=%s", ompi_install_dirs_libdir);
342:            value = pmix_os_path(false, ompi_install_dirs_libdir, "mpi.jar", NULL);
364:                value = pmix_os_path(false, ompi_install_dirs_libdir, "mpi.jar", NULL);
388:            value = pmix_os_path(false, ompi_install_dirs_libdir, "mpi.jar", NULL);
$

but nothing else.

@wckzhang
Copy link
Contributor Author

It looks like this was added purely for java support:

    prrte schizo: add an env variable for OMPI libdir
    
    Adding support for java in the prrte ompi schizo requires that
    the component know where the OMPI libs (libmpi_java.so) are installed.

I'll see if this can be extended to also extend the LD_LIBRARY_PATH. I see PRRTE has some code for this in plm_ssh_module.c, should the ompi schizo component be using the pass_libpath mca var to propogate this?

@rhc54
Copy link
Contributor

rhc54 commented Jan 20, 2023

You don't want to use an MCA param - we are discouraging those in PRRTE. Since this is a setting solely for the app, you probably want to simply add a pmix_envar_t to the app description that directs PRRTE to prepend the value to the existing envar. I'll take care of it and point you to it as an example of how to do these things.

@rhc54
Copy link
Contributor

rhc54 commented Jan 21, 2023

You know, just thinking about it, I don't think this will solve the problem you encountered. What this will do is set the library path so that whatever PRRTE launches will "see" the OMPI libraries associated with the prterun being used to launch the job (assuming, of course, that prterun comes from the OMPI build).

What you wanted was to ensure that the OMPI job being launched used the MPI libraries it was built against - and not the ones associated with the build of prterun.

So this seems to actually accomplish the opposite of what you seek - yes?

@wckzhang
Copy link
Contributor Author

I think the idea was that if you were using a specific mpirun to launch a job, you would want that MPI library to be used, @jsquyres @bwbarrett do you have opinions on what the exact behavior we are looking for should be?

@rhc54
Copy link
Contributor

rhc54 commented Jan 21, 2023

I personally don't care which direction you go, but please note that there was a lengthy debate about this before settling on the current behavior (I forget if it was on some issue ticket or the devel mailing list). I believe @devreal was one of those advocating for the current behavior (don't have mpirun dictate the MPI library)?

I'd rather not get caught in a yo-yo situation, so perhaps you folks should hammer out what you want? Maybe controlled by some kind of option?

@gpaulsen
Copy link
Member

In today's call, @wckzhang agreed to update this issue with a consensus of what is desired for v5.0.0. Thanks!

@wckzhang
Copy link
Contributor Author

We talked about this issue in the OMPI weekly telecon today and here's the summary of our consensus:

We do not think the current behavior is intuitive. If the user specifies --prefix (or specifies the absolute path), we think that the LD_LIBRARY_PATH on the remote node should be set to include the mpirun's libmpi before invoking the application but after invoking the prted.

This is the current documentation of prefix behavior, but it should be changed:

--prefix <dir>: Prefix directory that will be used to set the PATH and LD_LIBRARY_PATH on the remote node before invoking Open MPI or the target process. See the [Remote Execution](https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man1/mpirun.1.html#man1-mpirun-remote-execution) section, below.

@wckzhang
Copy link
Contributor Author

@rhc54 I just realized that the behavior of setting LD_LIBRARY_PATH is different when you run mpirun on my head node vs a compute node, when I launch mpirun on a head node, it does populate my LD_LIBRARY_PATH

[ec2-user@ip-10-0-0-28 ompi]$ ~/ompi5install/bin/mpirun -np 2 --hostfile ~/hostfile  ~/osu5/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
.
.
The env value is: /home/ec2-user/ompi5install/lib:/home/ec2-user/ompi5install/lib: for rank: 0
The env value is: /home/ec2-user/ompi5install/lib:/home/ec2-user/ompi5install/lib: for rank: 1

But when launching mpirun on a compute node:

[ec2-user@c5n-dy-c5n18xlarge-1 ~]$ ~/ompi5install/bin/mpirun -np 2 --hostfile ~/hostfile  ~/osu5/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
.
.
The env value is: (null) for rank: 0
The env value is: (null) for rank: 1

That behavior seems a bit unexpected, is this intentional behavior on prrte's end?

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2023

I'm unaware of any reason for that difference, nor what in the code would cause it. I'd take a hard look at the environment on each of those nodes to see if that envar is already set there, or being set by some .bashrc script.

@bwbarrett
Copy link
Member

Is one launching using slurm and another ssh?

@wckzhang
Copy link
Contributor Author

What I've realized from digging into it is that my Open MPI installations' libmpi.so are all configured with RPATH and not RUNPATH. This is despite not specifying their behavior, thus representing the default.

When I directly specify "LDFLAGS=-Wl,--enable-new-dtags", it properly uses RUNPATH, and the linker does support runpath:

configure:215369: checking if linker supports RUNPATH
configure:215382: gcc -o conftest -O3 -DNDEBUG  -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -Wshadow -Werror-implicit-function-declaration -fno-strict-aliasing -pedantic -Wall -Wformat-truncation=0 -finline-functions -mcx16 -iquote$(top_srcdir)    -Wl,--enable-new-dtags conftest.c -lpthread -lrt -lm -lutil    >&5
conftest.c:660:1: warning: function declaration isn't a prototype [-Wstrict-prototypes]
 main ()
 ^~~~
configure:215382: $? = 0
configure:215385: result: yes (-Wl,--enable-new-dtags)

I am currently digging into why the default is not RUNPATH.

@wckzhang
Copy link
Contributor Author

It looks like the wrapper compilers are doing the right thing and adding --enable-new-dtags. So the problem looks to be just libmpi.so is not being compiled with --enable-new-dtags.

[[ec2-user@ip-10-0-0-28 ompi]$ ~/ompi5install/bin/mpicc --showme
gcc -I/home/ec2-user/ompi5install/include -pthread -L/home/ec2-user/ompi5install/lib -Wl,-rpath -Wl,/home/ec2-user/ompi5install/lib -Wl,--enable-new-dtags -lmpi
[ec2-user@ip-10-0-0-28 ompi]$ ~/ompi5install/bin/ompi_info --all | grep wrapper
 Fort f08 using wrappers: yes
            MCA pml base: parameter "pml_wrapper" (current value: "", data source: default, level: 9 dev/all, type: string, synonym of: pml_base_wrapper)
            MCA pml base: parameter "pml_base_wrapper" (current value: "", data source: default, level: 9 dev/all, type: string, synonyms: pml_wrapper)
[ec2-user@ip-10-0-0-28 ompi]$ ~/ompi5install/bin/ompi_info --all --parsable | grep wrapper:extra
[ec2-user@ip-10-0-0-28 ompi]$ ~/ompi5install/bin/ompi_info --all --parsable | grep wrapper
compiler:fortran:08_wrappers:yes
option:wrapper:cflags:-pthread
option:wrapper:cxxflags:-pthread 
option:wrapper:fcflags:
option:wrapper:ldflags:-L${libdir}  -Wl,-rpath -Wl,${libdir} -Wl,--enable-new-dtags
option:wrapper:libs:-lmpi
mca:pml:base:param:pml_wrapper:value:
[ec2-user@ip-10-0-0-28 ompi]$ readelf -d ~/ompi5install/lib/libmpi.so | egrep -i 'rpath|runpath'
 0x000000000000000f (RPATH)              Library rpath: [/home/ec2-user/ompi5install/lib]

@wzamazon
Copy link
Contributor

Isn't this behavior what we want?

If libmpi.so use RUNPATH, then it may happen that libmpi.so from ompi5 try to use mca_mtl_ofi.so from openmpi4. Is that what we want?

I thought we want the executable (like osu_latency) to use RUNPATH, so that it will use the libmpi.so on LD_LIBRARY_PATH. It seems that mpicc is doing that.

@bwbarrett
Copy link
Member

I agree with Wei - intended behavior is that the libraries produced by MPI (libmpi.so, libopen-pal.so, etc.) are runpath'ed. I think we even wanted MPI tools (mpirun, mpicc, etc.) to be rpathed. We have strong feelings about our library dependencies for our tools and libraries.

But for the user's application (the thing compiled with mpicc), it should be runpathed if runpath is supported. So reading your last update, I'm confused, if you look at the executable you're trying to run, is it reporting RUNPATH or RPATH?

@wckzhang
Copy link
Contributor Author

The behavior I see is:

When I compile libmpi without RUNPATH, and then I change LD_LIBRARY_PATH, the path of libmpi shown under ldd <osu_binary> does not change to the LD_LIBRARY_PATH. However, when I do compile libmpi with RUNPATH, it does change. This behavior seems tied to the way libmpi was compiled as when I change LD_LIBRARY_PATH for libfabric libraries, it changes regardless of how I compiled Open MPI.

Now to me, this doesn't seem to be the behavior we want because libmpi does not change when using LD_LIBRARY_PATH, but maybe my understanding is flawed.

@wckzhang
Copy link
Contributor Author

I'm going to write some simplified libraries and applications and try to nail down the exact behavior of RUNPATH/RPATH due to the confusion.

@wckzhang
Copy link
Contributor Author

wckzhang commented Jan 26, 2023

I think potentially the source of confusion is that RUNPATH was used and the LD_LIBRARY_PATH was searched, but it was not detecting libmpi.so.80 (because libmpi.so.40 was in the path only) and so it went to RUNPATH, should be simple enough to clear up

@wckzhang
Copy link
Contributor Author

That indeed has been the issue. If I compile two different installations of ompi 5.0.x with default behavior, it doesn't have an issue switching between the two (confirmed that the readelf of libmpi.so was compiled with RPATH). So the root cause of this is that the search path is looking for a newer version of the libmpi library.

As this is actually the behavior we want and the actual wrapper compilers do modify the applications behavior to RUNPATH, I think everything works and there isn't a problem.

I have also understood why this behavior occurs:

[ec2-user@ip-10-0-0-28 ompi]$ ~/tmpompi5install/bin/mpirun -np 1 ldd ~/osu5/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency | grep libmpi
	libmpi.so.80 => /home/ec2-user/ompi5install/lib/libmpi.so.80 (0x00007f1e39dac000)
[ec2-user@ip-10-0-0-28 ompi]$ ~/ompi5install/bin/mpirun -np 1 ldd ~/osu5/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency | grep libmpi
	libmpi.so.80 => /home/ec2-user/ompi5install/lib/libmpi.so.80 (0x00007f7314258000)

This may be another bug, but it appears LD_LIBRARY_PATH is NOT set when you're running locally.
When I forced it to run on a remote node:

[ec2-user@ip-10-0-0-28 ompi]$ ~/tmpompi5install/bin/mpirun -n 1 --hostfile ~/hostfile ldd ~/osu5/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency | grep libmpi
	libmpi.so.80 => /home/ec2-user/tmpompi5install/lib/libmpi.so.80 (0x00007f7ac2a22000)
[ec2-user@ip-10-0-0-28 ompi]$ ~/ompi5install/bin/mpirun -n 1 --hostfile ~/hostfile ldd ~/osu5/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency | grep libmpi
	libmpi.so.80 => /home/ec2-user/ompi5install/lib/libmpi.so.80 (0x00007f674f183000)

It worked as expected which was the main source of confusion.

I also confirmed that running the binary itself switched between the libraries.

To sum it up:

  1. 4.1.x compiles libmpi.so.40 and 5.0.x compiles libmpi.so.80. An application compiled against libmpi.so.80 will only search for libmpi.so.80+ as Open MPI is forward compatible, this is intended behavior.
  2. Open MPI compiles libmpi.so with RPATH behavior because it does not want its dependencies messed with, this is intended behavior.
  3. Open MPI compiles applications with its wrappers which includes RUNPATH behavior, this is intended.
  4. When launching mpirun, the prefix value is ignored when run locally - this may be unintended behavior.

@wckzhang
Copy link
Contributor Author

I don't think that prefix value being ignored locally is unintended behavior either based on the man page:

--prefix <dir>: Prefix directory that will be used to set the PATH and LD_LIBRARY_PATH on the remote node before invoking Open MPI or the target process. See the [Remote Execution](https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man1/mpirun.1.html#man1-mpirun-remote-execution) section, below.

It specifically calls out "remote node", and thus I don't see any issues here (though we should fix this prefix doc since there are some inaccuracies - before Open MPI is incorrect, it's before the application). I will resolve this issue

@wzamazon
Copy link
Contributor

4.1.x compiles libmpi.so.40 and 5.0.x compiles libmpi.so.80. An application compiled against libmpi.so.80 will only search for libmpi.so.80+ as Open MPI is forward compatible, this is intended behavior.

What about the following scenario:

application compiled against libmpi.so.40, libmpi.so.80 is in LD_LIBRARY_PATH.

Would the intention be for the application to use libmpi.so.80? Is that the case?

@wckzhang
Copy link
Contributor Author

I tested this just now:

[ec2-user@ip-10-0-0-28 pt2pt]$ export LD_LIBRARY_PATH=/home/ec2-user/ompi5install/lib
[ec2-user@ip-10-0-0-28 pt2pt]$ ldd osu_bw | grep libmpi
	libmpi.so.40 => /opt/amazon/openmpi/lib64/libmpi.so.40 (0x00007f477a477000)
.
.
<Additional debug>
[ec2-user@ip-10-0-0-28 pt2pt]$ LD_DEBUG=libs ldd ./osu_latency > tmp.txt

<tmp.txt output>
     28417:     find library=libmpi.so.40 [0]; searching
     28417:      search path=/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1:/home/ec2-user/ompi5install/lib/tls/haswell/x86_64:/home/ec2-user/ompi5install/lib/tls/haswell:/home/ec2-user/ompi5install/lib/tls/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/tls/avx512_1:/home/ec2-user/ompi5install/lib/tls/x86_64:/home/ec2-user/ompi5install/lib/tls:/home/ec2-user/ompi5install/lib/haswell/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/haswell/avx512_1:/home/ec2-user/ompi5install/lib/haswell/x86_64:/home/ec2-user/ompi5install/lib/haswell:/home/ec2-user/ompi5install/lib/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/avx512_1:/home/ec2-user/ompi5install/lib/x86_64:/home/ec2-user/ompi5install/lib            (LD_LIBRARY_PATH)
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/libmpi.so.40
     28417:      search path=/opt/amazon/openmpi/lib64/tls/haswell/avx512_1/x86_64:/opt/amazon/openmpi/lib64/tls/haswell/avx512_1:/opt/amazon/openmpi/lib64/tls/haswell/x86_64:/opt/amazon/openmpi/lib64/tls/haswell:/opt/amazon/openmpi/lib64/tls/avx512_1/x86_64:/opt/amazon/openmpi/lib64/tls/avx512_1:/opt/amazon/openmpi/lib64/tls/x86_64:/opt/amazon/openmpi/lib64/tls:/opt/amazon/openmpi/lib64/haswell/avx512_1/x86_64:/opt/amazon/openmpi/lib64/haswell/avx512_1:/opt/amazon/openmpi/lib64/haswell/x86_64:/opt/amazon/openmpi/lib64/haswell:/opt/amazon/openmpi/lib64/avx512_1/x86_64:/opt/amazon/openmpi/lib64/avx512_1:/opt/amazon/openmpi/lib64/x86_64:/opt/amazon/openmpi/lib64            (RUNPATH from file ./osu_latency)
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/libmpi.so.40

It does indeed search the LD_LIBRARY_PATH first, but it is ONLY looking for libmpi.so.40 and not libmpi.so.80.

@lrbison
Copy link
Contributor

lrbison commented Jan 26, 2023

I take it this means that an application compiled against 4.1.x will not be able to run if there is only 5.0.x available on the system. Is that intended?

@rhc54
Copy link
Contributor

rhc54 commented Jan 26, 2023

I don't think that prefix value being ignored locally is unintended behavior either based on the man page:

It depends on what you want that prefix to mean. It seems to be getting a little confused in all this discussion, so perhaps you need to split things?

There is the prefix that you want to pass to PRRTE and PMIx that tell us where to find those executables and libraries. This prefix is ignored for prterun because there is no way to alter it once we have started execution - you have to have already set the PATH and LD_LIBRARY_PATH before starting the tool.

There is the prefix that tells us where to find the MPI libraries. This is something we can set prior to exec'ing the application process.

IF you are using the internal prterun, then those two will be the same thing. However, IF you are using an external prterun, then they will be very different.

You need a way to tell prterun and prun how to set the LD_LIBRARY_PATH for your MPI app. This "prefix" should not be ignored when run locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants