-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kokkos: Timing-based test Kokkos_CoreUnitTest_Default_MPI_1 randomly failing in CUDA PR builds #11940
Comments
Because timing-based tests are a problem when machines are stressed trying to run as many tests at a time as possible, these tests tend to randomly fail in such situations. Therefore, one can argue that there should be no timing-based tests run as part of the Trilinos PR builds. Therefore, one can argue that this single unit test Therefore, I will add these disables to PR #11863 and then can be approved as part of that. |
So what is interesting about this is that according to: Trilinos/packages/framework/ini-files/config-specs.ini Lines 358 to 359 in f9519b8
The unit test Therefore, I will fix the names of the Kokkos tests in that file and add that as a commit to PR #11863. Any objections? |
@bartlettroscoe thanks for updating the file, this will also resolve the |
These changes should now be in Trilinos develop |
Indeed. Closing as complete. |
@trilinos/tpetra, @trilinos/framework, @sebrowne, @ndellingwood
Description
As shown in this query showing:
the test:
Kokkos_CoreUnitTest_Default_MPI_1
looks to be randomly failing out in the CUDA build:
rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables
When the test fails, it shows:
Looking at that the code in
kokkos/core/unit_test/TestSharedSpace.cpp
, that test failes when:evaluates to
false
. So this is looks to be a timing-based test (and a timing-based tests in a CUDA build none the less where contention for the GPU is a real problem).I verified that this test fails randomly even with the native Kokkos build system as part of reproducing the failure in #11863 (comment).
NOTE: As shown in this query showing:
this test has only run 9 times in this build since since 2023-05-01. (This is because the Kokkos tests only get run when a global file is changed or when Kokkos is changed.) And before PR #11863, the last time this test was run in this build was on 2023-05-16. Perhaps something changed in how these tests are run that is causing this test to fail?
The text was updated successfully, but these errors were encountered: