Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applications compiled with 4.1.4 will not use libmpi from 5.0.x #11347

Closed
wckzhang opened this issue Jan 26, 2023 · 3 comments
Closed

Applications compiled with 4.1.4 will not use libmpi from 5.0.x #11347

wckzhang opened this issue Jan 26, 2023 · 3 comments

Comments

@wckzhang
Copy link
Contributor

A continuation of - #11269 - but the issue has diverged enough to create a new issue in my opinion.

Even though Open MPI does not have an ABI break and is forward compatible, it looks like applications will only look for the specific version of libmpi, in the case of 4.1.4, this is libmpi.so.40. Open MPI 5.0.x currently has the version of libmpi.so.80. Thus when the application searches for a libmpi, it does not select the libmpi.so.80 belonging to Open MPI.

[ec2-user@ip-10-0-0-28 pt2pt]$ export LD_LIBRARY_PATH=/home/ec2-user/ompi5install/lib
[ec2-user@ip-10-0-0-28 pt2pt]$ ls /home/ec2-user/ompi5install/lib/ | grep libmpi.so
libmpi.so
libmpi.so.80
libmpi.so.80.0.0
[ec2-user@ip-10-0-0-28 pt2pt]$ ldd osu_bw | grep libmpi
	libmpi.so.40 => /opt/amazon/openmpi/lib64/libmpi.so.40 (0x00007f477a477000)
.
.
<Additional debug>
[ec2-user@ip-10-0-0-28 pt2pt]$ LD_DEBUG=libs ldd ./osu_latency > tmp.txt

<tmp.txt output>
     28417:     find library=libmpi.so.40 [0]; searching
     28417:      search path=/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1:/home/ec2-user/ompi5install/lib/tls/haswell/x86_64:/home/ec2-user/ompi5install/lib/tls/haswell:/home/ec2-user/ompi5install/lib/tls/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/tls/avx512_1:/home/ec2-user/ompi5install/lib/tls/x86_64:/home/ec2-user/ompi5install/lib/tls:/home/ec2-user/ompi5install/lib/haswell/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/haswell/avx512_1:/home/ec2-user/ompi5install/lib/haswell/x86_64:/home/ec2-user/ompi5install/lib/haswell:/home/ec2-user/ompi5install/lib/avx512_1/x86_64:/home/ec2-user/ompi5install/lib/avx512_1:/home/ec2-user/ompi5install/lib/x86_64:/home/ec2-user/ompi5install/lib            (LD_LIBRARY_PATH)
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/haswell/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/tls/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/haswell/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/avx512_1/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/x86_64/libmpi.so.40
     28417:       trying file=/home/ec2-user/ompi5install/lib/libmpi.so.40
     28417:      search path=/opt/amazon/openmpi/lib64/tls/haswell/avx512_1/x86_64:/opt/amazon/openmpi/lib64/tls/haswell/avx512_1:/opt/amazon/openmpi/lib64/tls/haswell/x86_64:/opt/amazon/openmpi/lib64/tls/haswell:/opt/amazon/openmpi/lib64/tls/avx512_1/x86_64:/opt/amazon/openmpi/lib64/tls/avx512_1:/opt/amazon/openmpi/lib64/tls/x86_64:/opt/amazon/openmpi/lib64/tls:/opt/amazon/openmpi/lib64/haswell/avx512_1/x86_64:/opt/amazon/openmpi/lib64/haswell/avx512_1:/opt/amazon/openmpi/lib64/haswell/x86_64:/opt/amazon/openmpi/lib64/haswell:/opt/amazon/openmpi/lib64/avx512_1/x86_64:/opt/amazon/openmpi/lib64/avx512_1:/opt/amazon/openmpi/lib64/x86_64:/opt/amazon/openmpi/lib64            (RUNPATH from file ./osu_latency)
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/haswell/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/tls/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/haswell/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/avx512_1/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/avx512_1/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/x86_64/libmpi.so.40
     28417:       trying file=/opt/amazon/openmpi/lib64/libmpi.so.40
@wckzhang
Copy link
Contributor Author

My personal thoughts are that the behavior of not selecting a library with different major versions is correct linking behavior based on what I've read (Source: https://dcreager.net/shared-library-versions/):

When you compile some other code that depends on this library, the build chain will find the shared library file using the non-versioned libfoo.so filename, extract the library's SONAME from that file, and record that as the dependency. So, at runtime, when /usr/bin/foo is loaded, the dynamic linker will see these NEEDED entries, and look for a library file called libfoo.so.1.

That behavior matches what is going on with the libmpi.so.40/80 behavior. What I don't think is correct is if we should have bumped the major version at all. What I am reading indicates that major version should be bumped when a backward incompatible change has been introduced. If we have only introduced new functions without breaking old functions, then bumping minor version seems to be more appropriate (Source: https://www.baeldung.com/linux/shared-object-filenames)

Y represents the minor version.  A backward compatible change increments the minor version. For example, adding an isolated feature like a new function. Also note, if a Z number is not present then this may also represent bug fixes.

@bwbarrett
Copy link
Member

The LIbtool versions in the VERSION file for 5.0.x are set incorrectly. The current was set at 80 (which is reasonable, as the current for v4.1.x is at 70 and we usually bump by 10 between minor release branches). But the revision and age were set to zero, which means that there is no backwards compatibility.

The revision should be zeroed out, since we've added interfaces since 4.1.x, but since we very carefully did not change or remove interfaces, we should have set age to v4.1.x's age + 10 (because we incremented current by 10).

@wckzhang
Copy link
Contributor Author

This is now resolved since #11365 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants