Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvapich2 test 31 (coarray_navier_stokes) failed #312

Closed
LaHaine opened this issue Jan 16, 2017 · 10 comments
Closed

mvapich2 test 31 (coarray_navier_stokes) failed #312

LaHaine opened this issue Jan 16, 2017 · 10 comments
Assignees

Comments

@LaHaine
Copy link
Contributor

LaHaine commented Jan 16, 2017

This is on CentOS 7.3. I have managed to build and test opencoarrays successfully using gcc 6.1.0 from devtoolset-6 and the included mpich. I have then switched to mvapich2-2.2 compiled using the same gcc and now one test is failing:

[pax10] /batch/test/opencoarrays/prerequisites/builds/opencoarrays/1.8.3 > mpiexec -np 2 /batch/test/opencoarrays/prerequisites/builds/opencoarrays/1.8.3/src/tests/integration/pde_solvers/navier-stokes/coarray_navier_stokes
Assertion failed in file src/mpid/ch3/channels/mrail/src/rdma/ch3_win_fns.c at line 368: node_comm_ptr != NULL
Assertion failed in file src/mpid/ch3/channels/mrail/src/rdma/ch3_win_fns.c at line 368: node_comm_ptr != NULL
[cli_0]: aborting job:
internal ABORT - process 0
[cli_1]: aborting job:
internal ABORT - process 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 16830 RUNNING AT pax10
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

This might also be a bug in mvapich2.

@rouson
Copy link
Member

rouson commented Jan 16, 2017

We have decide to disable And delete this test due to a lack of portability. It relies on a binary FFT library that was written in assembly language. It's safe to ignore the failure. Thanks for reporting it. I'll remove the rest from our test suite shortly.

@zbeekman
Copy link
Collaborator

Hi @LaHaine,

Thanks for reporting this. I'm tempted to just dismiss this test failure because the NS tests use some pre-compiled (or maybe written in assembly?) FFT libraries that usually are the cause of all sorts of issues, as @rouson noted. (See, for example, #297.) However, due to the specific nature of the error, it appears to be unrelated to the FFT libraries upon first inspection. Further research shows that mvapich2-2.2 is based on MPICH 3.1.4 which tests fine.

This error appears to be an assertion in mvapich having to do with RDMA and MPI windows... I'm wondering if it would be worthwhile for someone like @afanfa who has a deep expertise both of the library internals and in MPI3 to take a quick look at this.

Also, it's too bad that this doesn't generate a backtrace, that would be instrumental in localizing the source of this issue, if it is indeed legitimate, in the OpenCoarrays library.

@zbeekman
Copy link
Collaborator

@rouson: I am in the process of disabling the test. I want to keep a recipe to build it, but remove it from the "all" target so that you have to ask to build it, and then also remove it from automatically being run in the tests.

@LaHaine
Copy link
Contributor Author

LaHaine commented Jan 18, 2017

@zbeekman: That would be best. BTW it also crashes for me with openmpi 1.10.4:

mpirun -np 2 ./src/tests/integration/pde_solvers/navier-stokes/coarray_navier_stokes
nx = 128   ny = 128   nz = 128
viscos =   0.000      shear =   0.000
b11 b22 b33 b12 =   1.000  1.000  1.000  0.000
nsteps =      5       output_step =      1
----------------- running on    2 images -------------------
message size (MB) =   8.00
 OS provides random number generator

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          pax10 (PID 3288)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
#0  0x7F520989EC47
#1  0x7F520989DE40
#2  0x7F5208D9F24F
#3  0x7F5208EB953E
#0  0x7FC88A06FC47
#1  0x7FC88A06EE40
#2  0x7FC88957024F
#3  0x7FC88968A53E
#4  0x4072C8 in transpose_x_y.3766 at coarray-shear_coll.F90:?
#4  0x4072C8 in transpose_x_y.3766 at coarray-shear_coll.F90:?
#5  0x40A8A7 in solve_navier_stokes_
#6  0x40CDD9 in MAIN__ at coarray-shear_coll.F90:?
#5  0x40A8A7 in solve_navier_stokes_
#6  0x40CDD9 in MAIN__ at coarray-shear_coll.F90:?
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3289 on node pax10 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
[pax10.zeuthen.desy.de:03286] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:warn-fork
[pax10.zeuthen.desy.de:03286] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 18, 2017

@LaHaine The OpenMPI error is much more helpful! Would it be possible to rebuild OpenCoarrays adding following cmake flag: -DCMAKE_BUILD_TYPE=Debug (or maybe -DCMAKE_BUILD_TYPE=DebWithRelInfo if the first doesn't trigger the error) and post the resulting backtrace with line numbers? That would be very helpful!

Also if you know how to Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages that would be helpful too!

I think I need to run all the tests through valgrind --leakcheck and valgrind --helcheck

@LaHaine
Copy link
Contributor Author

LaHaine commented Jan 18, 2017

Oh, again, forgot the additional mca parameter:

[pax10] /batch/test/opencoarrays/prerequisites/builds/opencoarrays/1.8.3 % /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/bin/mpirun --mca orte_base_help_aggregate 0 --mca plm \^tm --mca ras \^tm  -np 2 ./src/tests/integration/pde_solvers/navier-stokes/coarray_navier_stokes
nx = 128   ny = 128   nz = 128
viscos =   0.000      shear =   0.000
b11 b22 b33 b12 =   1.000  1.000  1.000  0.000
nsteps =      5       output_step =      1
----------------- running on    2 images -------------------
message size (MB) =   8.00
 OS provides random number generator

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          pax10 (PID 18005)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          pax10 (PID 18006)
  MPI_COMM_WORLD rank: 1

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
#0  0x7F6FC94FDC47
#1  0x7F6FC94FCE40
#2  0x7F6FC89FE24F
#0  0x7F63FAB79C47
#1  0x7F63FAB78E40
#2  0x7F63FA07A24F
#3  0x4034AC in __run_size_MOD_copy3 at coarray-shear_coll.F90:190
#4  0x40EA2F in transpose_x_y.3766 at coarray-shear_coll.F90:429 (discriminator 1)
#5  0x406424 in solve_navier_stokes_ at coarray-shear_coll.F90:349
#3  0x4034AC in __run_size_MOD_copy3 at coarray-shear_coll.F90:190
#4  0x40EA2F in transpose_x_y.3766 at coarray-shear_coll.F90:429 (discriminator 1)
#5  0x406424 in solve_navier_stokes_ at coarray-shear_coll.F90:349
#6  0x40F888 in MAIN__ at coarray-shear_coll.F90:253
#6  0x40F888 in MAIN__ at coarray-shear_coll.F90:253
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 18006 on node pax10 exited on signal 7 (Bus error).
--------------------------------------------------------------------------

@zbeekman
Copy link
Collaborator

zbeekman commented Feb 8, 2017

I have a strong hunch that this is due to either a) the gfortran runtime's library having problems with the random number intrinsics and thread safety or less likely b) calls like those to system_clock. I did some experiments with setting the random seed, and harvesting PRNs trying to get a different seed on each image, but always got the same random numbers on all images... I'll try to localize at some point if I can find the time... If you, @LaHaine feel up to it, you could try remove the random number generation (and replace with some other signal, maybe a sine wave or something like that, and also remove the calls to system_clock() to see if either of those resolve this issue.

@rouson
Copy link
Member

rouson commented Feb 9, 2017

On an orthogonal note, @afanfa and I are experiencing the exact opposite problem of what @zbeekman reported earlier: we are getting different PRN sequences even when we pass the same seed in serial code. We observe this behavior with gfortran 7.0.0 build dated 20170108 and with a more recent 7.0.1 build, but we get the expected behavior (same sequence) with gfortran 6.3.0. There seems to have been some problems introduced into the gfortran random number generator last year. I'm attempting to isolate the issue and report the bug to the gfortran developers.

@rouson
Copy link
Member

rouson commented Feb 9, 2017

We figured out our issue. I don't know if it affects the case discussed in this thread, but the behavior of random_seed changed between gfortran 6.3.0 and 7.1.0 and, on a related note, Fortran 2015 introduces a new random_init() function that I expect will be very useful for both reproducibility and thread safety so I recommend reading about it in the draft Fortran 2015 standard.

@zbeekman
Copy link
Collaborator

@LaHaine We're going to close this issue, since we're a bit perplexed by it, and this test has some odd assembly code in it. We've removed the test and think that the issue may lie in mvapich or in some compiler intrinsics as discussed above. There is an MVAPICH mailing listserv that you could try emailing for more information about the failed assertion: http://mvapich.cse.ohio-state.edu/mailinglists/. If you hear anything insightful that indicates an error in OpenCoarrays, please let us know and we can reopen the issue. Right now there is no easy way to localize the problem or even reproduce and test it.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants