Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some python tests fail silently on some architectures with OpenMPI 2.X #3333

Closed
reinaual opened this issue Nov 21, 2019 · 2 comments · Fixed by #3335
Closed

some python tests fail silently on some architectures with OpenMPI 2.X #3333

reinaual opened this issue Nov 21, 2019 · 2 comments · Fixed by #3335
Assignees

Comments

@reinaual
Copy link
Contributor

Executing tests on my local machine which dont have specified the MAX_NUM_PROC are shown as passed although they were never executed. This happens for example with make -j8 check_python ARGS='-R experimental_decorator -V' which results in

...
    Start 32: experimental_decorator

32: Test command: /usr/bin/mpiexec "-n" "13" "/home/areinauer/Documents/hiwi/espresso/build/pypresso" "/home/areinauer/Documents/hiwi/espresso/build/testsuite/python/experimental_decorator.py"
32: Test timeout computed to be: 300
32: --------------------------------------------------------------------------
32: A request was made to bind to that would result in binding more
32: processes than cpus on a resource:
32: 
32:    Bind to:     NUMA
32:    Node:        ibex
32:    #processes:  6
32:    #cpus:       6
32: 
32: You can override this protection by adding the "overload-allowed"
32: option to your binding directive.
32: --------------------------------------------------------------------------
1/1 Test #32: experimental_decorator ...........   Passed    0.11 sec

The following tests passed:
	experimental_decorator

100% tests passed, 0 tests failed out of 1
@jngrad
Copy link
Member

jngrad commented Nov 21, 2019

Notes on @reinaual's environment:

  • the CPU is hyperthreaded to expose 24 logical cores, mpiexec -n 12 doesn't produce the warning
  • with mpiexec -n 12 an espresso runtime error is raised
  • with mpiexec -n 4 a different espresso runtime error is raised
  • with mpiexec -n 1 and without mpi the test runs fine
  • uses mpiexec (OpenRTE) 2.1.1

The thread warning probably exits with error code 0 or produces a signal that isn't caught by ctest, which means those tests fail silently. This can be resolved by simply capping $NP to 4.

By default, OpenMPI should allow spawning more threads than cores. I cannot reproduce the warning on my machine. We need to find out which mpirun flag can be used to detect whether this binding safety is enabled, so that we can introduce a guard to prevent ctest from running if there are not enough cores for the value of $NP. @mkuron any idea? I've looked for the documentation of --bind-to core:overload-allowed but couldn't find anything helpful.

@mkuron
Copy link
Member

mkuron commented Nov 21, 2019

Please don't change --bind-to. Just use -oversubscribe. That will run with as many processes as you specify, even if that puts multiple processes on the same core. Then you don't need any guards etc.

@jngrad jngrad changed the title local false pass in python testsuite some python tests fail silently on some architectures with OpenMPI 2.X Nov 21, 2019
@jngrad jngrad self-assigned this Nov 21, 2019
@jngrad jngrad added this to the Espresso 4.1.2 milestone Nov 21, 2019
@bors bors bot closed this as completed in d588ea8 Nov 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants