Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGI F08 symbol Issue on POWER #3075

Closed
jjhursey opened this issue Mar 1, 2017 · 14 comments
Closed

PGI F08 symbol Issue on POWER #3075

jjhursey opened this issue Mar 1, 2017 · 14 comments
Assignees
Labels
Milestone

Comments

@jjhursey
Copy link
Member

jjhursey commented Mar 1, 2017

MTT found this issue with Open MPI using the PGI compiler (16.10 on ppc64le) and the F08 module.

As you can see below the use mpi module is fine, but the use mpi_f08 module is not.

shell$  mpirun -np 2 ./ring_usempi
Process 0 sending 10 to  1 tag 201 ( 2 processes in ring)
Process 0 sent to  1
Process 0 decremented value:  9
Process 0 decremented value:  8
Process 0 decremented value:  7
Process 0 decremented value:  6
Process 0 decremented value:  5
Process 0 decremented value:  4
Process 0 decremented value:  3
Process 0 decremented value:  2
Process 0 decremented value:  1
Process 0 decremented value:  0
Process  0 exiting
Process  1 exiting
shell$ mpirun -np 2 ./ring_usempif08
Process 0 sending 10 to  1 tag 201 ( 2 processes in ring)
[mpi10:137972] *** An error occurred in MPI_Recv
[mpi10:137972] *** reported by process [4003397633,1]
[mpi10:137972] *** on communicator MPI_COMM_WORLD
[mpi10:137972] *** MPI_ERR_TYPE: invalid datatype
[mpi10:137972] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mpi10:137972] ***    and potentially your MPI job)
[mpi10:137959] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[mpi10:137959] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Attaching a debugger the datatype being passed into the MPI_Recv (and MPI_Send - which also triggers an error, but MPI_Recv just won the race to issue an error here) is 0 instead 7.

The ring_usempif08 program has a weak symbol for the datatype:

shell$ nm ring_usempif08 | grep _integer
00000000100200b8 V ompi_f08_mpi_integer

Which if not resolved probably defaults to 0 causing the issue.

The issue is in master and the release branches. The Fortran marshaling code is complex so I'm filing a bug to get some more visibility on the problem.

@jjhursey jjhursey added the bug label Mar 1, 2017
@jjhursey jjhursey self-assigned this Mar 1, 2017
@jsquyres
Copy link
Member

jsquyres commented Mar 1, 2017

@jjhursey and I chatted about this issue this afternoon; he got a bit of an education on Fortran. 😄

We narrowed the issue down a bit (the _integer symbol is a red herring). Short version:

  • MPI_INTEGER is a BIND(C) fortran declaration with name ompi_f08_mpi_integer.
  • This symbol -- ompi_mpi_f08_integer -- is instantiated properly in ompi/mpi/use-mpi-f08/constants.c, and properly exists in the created libmpi_usempif08.so.
  • The compiler is creating a weak symbol in main for MPI_INTEGER of the name ompi_f08_mpi_integer.
    • For those of you sports fans following along at home: yes, that's the same name as the actual symbol created by constants.c.
  • The compiler (or linker?) somehow fails to join these two, resulting in some kind of bogus instance of ompi_f08_mpi_integer from main()'s point of view.
  • The wrong value for the datatype (and communicator!) therefore gets passed in to MPI_SEND and MPI_RECV.
  • Chaos ensues. 🆘

@jjhursey's next step is to talk to the PGI compiler team.

@jsquyres jsquyres changed the title PGI F08 Symbol Issue PGI F08 symbol Issue on POWER Mar 1, 2017
@jjhursey
Copy link
Member Author

jjhursey commented Mar 2, 2017

I was able to create a small reproducer for this (possible) compiler bug. It works with GNU and XL compilers, but fails with PGI. I'll work on getting that reproducer to the PGI group and see what they have to say.

Thanks @jsquyres for the Fortran (and ompi fortran build) refresher. Makes me love C even more 😄

@jjhursey
Copy link
Member Author

jjhursey commented Mar 2, 2017

Here is a link to the reproducer:

shell$ make gnu
rm -f *.o *.mod *.so app
gcc  -g -c constants.c -MD -fpic -DPIC -o constants.o
gfortran  -g -c my_lib.f90 -o my_lib.o
gfortran  -g -shared -fpic  -o libmy_lib.so constants.o my_lib.o
gfortran  -g app.f90 -o app -L. -lmy_lib
shell$ ./app 
 MY_INTEGER        is            7
 MY_INTEGER%MY_VAL is            7
shell$ make xl
rm -f *.o *.mod *.so app
xlc  -g -c constants.c -MD -fpic -DPIC -o constants.o
/opt/ibm/xlC/13.1.2/bin/.orig/xlc: warning: 1501-269 fpic is not supported on this Operating System platform.  Option fpic will be ignored.
xlf  -g -c my_lib.f90 -o my_lib.o
** my_lib   === End of Compilation 1 ===
1501-510  Compilation successful for file my_lib.f90.
xlf  -g -G  -o libmy_lib.so constants.o my_lib.o
xlf  -g app.f90 -o app -L. -lmy_lib
** app   === End of Compilation 1 ===
1501-510  Compilation successful for file app.f90.
shell$ ./app 
 MY_INTEGER        is  7
 MY_INTEGER%MY_VAL is  7
shell$ make pgi
rm -f *.o *.mod *.so app
pgcc  -g -c constants.c -MD -fpic -DPIC -o constants.o
pgfortran  -g -c my_lib.f90 -o my_lib.o
pgfortran  -g -shared -fpic  -o libmy_lib.so constants.o my_lib.o
pgfortran  -g app.f90 -o app -L. -lmy_lib
shell$ ./app 
 MY_INTEGER        is             0
 MY_INTEGER%MY_VAL is             0

@jjhursey
Copy link
Member Author

jjhursey commented Mar 7, 2017

PGI confirmed this is an issue on both of their linuxpower and linux86-64 builds. They filed the issue as TPR 23919.

@jsquyres
Copy link
Member

jsquyres commented Mar 7, 2017

@jjhursey Your Fortran mastery achievement (level 2) has been unlocked.

image

@gpaulsen
Copy link
Member

That image is epic.

@hppritcha hppritcha added this to the v2.0.4 milestone Jun 1, 2017
@hppritcha
Copy link
Member

Should we just update the compiler section of the NEWS and close this issue. Picture is cute but cobol would probably be more relevant.

@jjhursey
Copy link
Member Author

I think we should add a note to the NEWS or README whichever you think is most appropriate.

I'd like to keep this issue open until we have a final solution from the providers. There is nothing OMPI needs to do, but this ticket will give the providers a point of reference when we follow up.

I received this email from the PGI folks regarding the problem.

We have determined that the problem you have reported is a result of
a bad dynamic link. The error will go away if you use an older version of binutils (ld),
like 2.25. It explains why it works on some linux systems here, and not on others.

We reported the issue to gnu binutils people

I've asked them for a reference to that discussion. I'll post a link if there is one.

@jjhursey
Copy link
Member Author

Here is the binutils Bugzilla link:

It looks like it'll be fixed in the 2.28 release.

@jjhursey
Copy link
Member Author

See #2606 (comment) for some language for the README

jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 12, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <[email protected]>
jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 13, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 1c6a253)
Signed-off-by: Joshua Hursey <[email protected]>
jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 13, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 1c6a253)
Signed-off-by: Joshua Hursey <[email protected]>
jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 13, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 1c6a253)
Signed-off-by: Joshua Hursey <[email protected]>
@goduck777
Copy link

goduck777 commented Nov 23, 2019

I encounter this issue on my cluster (OpenPOWER) using pgfortran. It seems that this problem still exists with the new version of pgfortran and ld.

Here is the result using the test case.

pgi-bug $ld --version
GNU ld version 2.30-54.el7
Copyright (C) 2018 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.

pgi-bug $pgfortran --version
pgfortran 19.9-0 linuxpower target on Linuxpower
PGI Compilers and Tools
Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.

pgi-bug $make pgi
rm -f *.o *.mod *.so app
pgcc  -g -c constants.c -MD -fpic -DPIC -o constants.o
pgfortran  -g -c my_lib.f90 -o my_lib.o
pgfortran  -g -shared -fpic  -o libmy_lib.so constants.o my_lib.o
pgfortran  -g app.f90 -o app -L. -lmy_lib

pgi-bug $./app
 MY_INTEGER        is             0
 MY_INTEGER%MY_VAL is             0

pgi-bug $gfortran --version
GNU Fortran (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

pgi-bug $make gnu
rm -f *.o *.mod *.so app
gcc  -g -c constants.c -MD -fpic -DPIC -o constants.o
gfortran  -g -c my_lib.f90 -o my_lib.o
gfortran  -g -shared -fpic  -o libmy_lib.so constants.o my_lib.o
gfortran  -g app.f90 -o app -L. -lmy_lib

pgi-bug $./app
 MY_INTEGER        is            7
 MY_INTEGER%MY_VAL is            7

@ggouaillardet
Copy link
Contributor

This is very likely a known issue, fixed in master and soon to be merged into the v4.0.x release branch. Which version of Open MPI are you running? Can you give the master branch a try?

@goduck777
Copy link

I have tested the code in the master branch and can confirm that the error is fixed there.

@jjhursey
Copy link
Member Author

I can also confirm that this is fixed.

shell$ pgcc  --version

pgcc (aka nvc) 20.9-0 linuxpower target on Linuxpower 
PGI Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
shell$ pgfortran --version

pgfortran (aka nvfortran) 20.9-0 linuxpower target on Linuxpower 
PGI Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
shell$ ./configure  --prefix=/tmp/install/ompi-master-pgi CC=pgcc CXX=pgc++ FC=pgfortran --enable-mpi-fortran
make -j 20 > /dev/null
make -j 20 install > /dev/null
shell$  cd examples
shell$ make ring_usempif08 ring_usempi
mpifort -g  ring_usempif08.f90  -o ring_usempif08
mpifort -g  ring_usempi.f90  -o ring_usempi
shell$ mpirun -np 2 ./ring_usempi
Process 0 sending 10 to  1 tag 201 ( 2 processes in ring)
Process 0 sent to  1
Process 0 decremented value:  9
Process 0 decremented value:  8
Process 0 decremented value:  7
Process 0 decremented value:  6
Process 0 decremented value:  5
Process 0 decremented value:  4
Process 0 decremented value:  3
Process 0 decremented value:  2
Process 0 decremented value:  1
Process 0 decremented value:  0
Process  0 exiting
Process  1 exiting
shell$ mpirun -np 2 ./ring_usempif08
Process 0 sending 10 to  1 tag 201 ( 2 processes in ring)
Process 0 sent to  1
Process 0 decremented value:  9
Process 0 decremented value:  8
Process 0 decremented value:  7
Process 0 decremented value:  6
Process 0 decremented value:  5
Process 0 decremented value:  4
Process 0 decremented value:  3
Process 0 decremented value:  2
Process 0 decremented value:  1
Process 0 decremented value:  0
Process  0 exiting
Process  1 exiting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants