Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/jenkins: F08 failures on osx #4374

Closed
hzhou opened this issue Mar 6, 2020 · 5 comments · Fixed by #5682
Closed

bug/jenkins: F08 failures on osx #4374

hzhou opened this issue Mar 6, 2020 · 5 comments · Fixed by #5682

Comments

@hzhou
Copy link
Contributor

hzhou commented Mar 6, 2020

Test Name | Duration | Age
summary_junit_xml. - ./f08/pt2pt/statusesf08 1 | 72 ms | 2
summary_junit_xml. - ./f08/coll/vw_inplacef08 4 | 0.1 sec | 2
summary_junit_xml. - ./f08/coll/red_scat_blockf08 4 | 89 ms | 2
summary_junit_xml. - ./f08/coll/nonblocking_inpf08 4 | 83 ms | 2
summary_junit_xml. - ./f08/datatype/structf 2 | 0.11 sec | 2
summary_junit_xml. - ./f08/rma/aintf08 2 | 73 ms | 2
summary_junit_xml. - ./f08/topo/dgraph_unwgtf90 4 | 90 ms
@raffenet raffenet changed the title bug/jenkins: F08 failures for ch3/tcp intel build on osx bug/jenkins: F08 failures for intel build on osx Nov 5, 2020
@hzhou
Copy link
Contributor Author

hzhou commented Apr 4, 2021

gcc-10 on osx just fails dgraph_unwgtf90:

not ok  - ./f08/topo/dgraph_unwgtf90 4
  ---
  Directory: ./f08/topo
  File: dgraph_unwgtf90
  Num-procs: 4
  Timeout: 180
  Date: "Sat Apr  3 20:22:30 2021"
  ...
## Test output (expected 'No Errors'):
## Fatal error in internal_Dist_graph_create: Invalid argument, error stack:
## internal_Dist_graph_create(122): MPI_Dist_graph_create(MPI_COMM_WORLD, n=1, sources=0x7ffee2d4f308, degrees=0x7ffee2d4f32c, destinations=0x7ffee2d4f320, weights=0x0, MPI_INFO_NULL, reorder=1, comm_dist_graph=0x7ffee2d4f31c) failed
## internal_Dist_graph_create(93).: Null pointer in parameter weights
## Fatal error in internal_Dist_graph_create: Invalid argument, error stack:
## internal_Dist_graph_create(122): MPI_Dist_graph_create(MPI_COMM_WORLD, n=1, sources=0x7ffeebded308, degrees=0x7ffeebded32c, destinations=0x7ffeebded320, weights=0x0, MPI_INFO_NULL, reorder=1, comm_dist_graph=0x7ffeebded31c) failed
## internal_Dist_graph_create(93).: Null pointer in parameter weights
## Fatal error in internal_Dist_graph_create: Invalid argument, error stack:
## internal_Dist_graph_create(122): MPI_Dist_graph_create(MPI_COMM_WORLD, n=1, sources=0x7ffeed96b308, degrees=0x7ffeed96b32c, destinations=0x7ffeed96b320, weights=0x0, MPI_INFO_NULL, reorder=1, comm_dist_graph=0x7ffeed96b31c) failed
## internal_Dist_graph_create(93).: Null pointer in parameter weights
## Fatal error in internal_Dist_graph_create: Invalid argument, error stack:
## internal_Dist_graph_create(122): MPI_Dist_graph_create(MPI_COMM_WORLD, n=1, sources=0x7ffee14eb308, degrees=0x7ffee14eb32c, destinations=0x7ffee14eb320, weights=0x0, MPI_INFO_NULL, reorder=1, comm_dist_graph=0x7ffee14eb31c) failed
## internal_Dist_graph_create(93).: Null pointer in parameter weights

Looks like the issue with MPI_UNWEIGHTED or MPI_WEIGHTS_EMPTY.

@hzhou hzhou changed the title bug/jenkins: F08 failures for intel build on osx bug/jenkins: F08 failures on osx Nov 17, 2021
@hzhou
Copy link
Contributor Author

hzhou commented Nov 17, 2021

These are external global variable linkage issues. libmpifort.dylib and libpmpi.dylib each contains e.g. MPIR_C_MPI_UNWEIGHTED and it is not resolved (or resolved wrong) during dynamic linking.

On linux:

~/work/pull_requests/mpich-main$ nm _inst/lib/libmpi.so |grep WEIGHT
00000000028110d0 B MPIR_C_MPI_UNWEIGHTED
0000000002811080 B MPIR_C_MPI_WEIGHTS_EMPTY
00000000027ac430 D MPI_UNWEIGHTED
00000000027ac428 D MPI_WEIGHTS_EMPTY
~/work/pull_requests/mpich-main$ nm _inst/lib/libmpifort.so |grep WEIGHT
00000000000af940 B MPIR_C_MPI_UNWEIGHTED
00000000000af8a0 B MPIR_C_MPI_WEIGHTS_EMPTY
00000000000af718 B MPIR_F_MPI_UNWEIGHTED
00000000000af6f8 B MPIR_F_MPI_WEIGHTS_EMPTY
0000000000046f20 t MPIR_IS_UNWEIGHTED
                 U MPI_UNWEIGHTED
                 U MPI_WEIGHTS_EMPTY

On osx:

[~/hzhou/mpich-main] nm _inst/lib/libpmpi.0.dylib |grep WEIGHT
00000000027ce460 S _MPIR_C_MPI_UNWEIGHTED
00000000027ce458 S _MPIR_C_MPI_WEIGHTS_EMPTY
00000000027ce408 S _MPIR_F_MPI_UNWEIGHTED
00000000027ce3e8 S _MPIR_F_MPI_WEIGHTS_EMPTY
00000000027c5bf0 S _MPI_UNWEIGHTED
00000000027c5be8 S _MPI_WEIGHTS_EMPTY
[~/hzhou/mpich-main] nm _inst/lib/libmpifort.0.dylib |grep WEIGHT
000000000008b368 S _MPIR_C_MPI_UNWEIGHTED
000000000008b370 S _MPIR_C_MPI_WEIGHTS_EMPTY
000000000008b208 S _MPIR_F_MPI_UNWEIGHTED
000000000008b1e8 S _MPIR_F_MPI_WEIGHTS_EMPTY
                 U _MPI_UNWEIGHTED
                 U _MPI_WEIGHTS_EMPTY

@hzhou
Copy link
Contributor Author

hzhou commented Nov 17, 2021

Currently

mpifort -show
gfortran-10 ... -L.../_inst/lib -lmpifort -lmpi -lpmpi

If we swap the order into -lmpi -lpmpi -lmpifort, the test passes

This whole Fortran / C interoperability is really a fragile hack! For our purpose, we really wish fortran never define the symbol in the first place, but of course there are another side of the usage will prefer otherwise. What Fortran needed is external just as C. The only way and the simple way of interoperability is to implement C concepts in Fortran. Anything else is just hacking. On that, what if Fortran allows type casting? A whole chunk of complexity can be avoided with that.

@hzhou
Copy link
Contributor Author

hzhou commented Dec 16, 2021

If we swap the order into -lmpi -lpmpi -lmpifort, the test passes

@raffenet The correct link order is to link the higher-layer library before lower one. Since libmpifort.so is on top of libmpi.so, so -lmpifort -lmpi -lpmpi is the correct order.

The reversing order is really a hack and not supposed to be guaranteed to work.

@raffenet
Copy link
Contributor

If we swap the order into -lmpi -lpmpi -lmpifort, the test passes

@raffenet The correct link order is to link the higher-layer library before lower one. Since libmpifort.so is on top of libmpi.so, so -lmpifort -lmpi -lpmpi is the correct order.

The reversing order is really a hack and not supposed to be guaranteed to work.

OK, got it. I agree the reverse order is not something we should use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants