Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos: Timing-based test Kokkos_CoreUnitTest_Default_MPI_1 randomly failing in CUDA PR builds #11940

Closed
bartlettroscoe opened this issue Jun 1, 2023 · 6 comments
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: Kokkos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jun 1, 2023

@trilinos/tpetra, @trilinos/framework, @sebrowne, @ndellingwood

Description

As shown in this query showing:

image

the test:

  • Kokkos_CoreUnitTest_Default_MPI_1

looks to be randomly failing out in the CUDA build:

  • rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables

When the test fails, it shows:

[ RUN      ] defaultdevicetype.shared_space
Page size as reported by os: 4096 bytes 
Allocating 100 pages of memory in SharedSpace.
Behavior found: 
SharedSpace is as fast as local space on repeated access: 0, we expect true 

Please look at the following timings. The first access in a different ExecutionSpace is not evaluated for the test. As we expect the memory to migrate during the first access it might have a higher cycle count than subsequent accesses, depending on your hardware. If the cycles are more than 1.5 times the cycles for pure local memory access, we assume a page migration happened.

################SHARED SPACE####################
DeviceExecutionSpace timings of run 0:
TimingResult contains 10 results:
Duration of loop 0 is 945 clock cycles. Migration assumed.
Duration of loop 1 is 270 clock cycles. 
Duration of loop 2 is 268 clock cycles. 
Duration of loop 3 is 268 clock cycles. 
Duration of loop 4 is 268 clock cycles. 
Duration of loop 5 is 268 clock cycles. 
Duration of loop 6 is 268 clock cycles. 
Duration of loop 7 is 268 clock cycles. 
Duration of loop 8 is 268 clock cycles. 
Duration of loop 9 is 268 clock cycles. 
HostExecutionSpace timings of run 0:
TimingResult contains 10 results:
Duration of loop 0 is 21 clock cycles. 
Duration of loop 1 is 16 clock cycles. 
Duration of loop 2 is 16 clock cycles. 
Duration of loop 3 is 16 clock cycles. 
Duration of loop 4 is 16 clock cycles. 
Duration of loop 5 is 15 clock cycles. 
Duration of loop 6 is 16 clock cycles. 
Duration of loop 7 is 16 clock cycles. 
Duration of loop 8 is 16 clock cycles. 
Duration of loop 9 is 16 clock cycles. 
DeviceExecutionSpace timings of run 1:
TimingResult contains 10 results:
Duration of loop 0 is 438468 clock cycles. Migration assumed.
Duration of loop 1 is 279 clock cycles. 
Duration of loop 2 is 297 clock cycles. 
Duration of loop 3 is 449 clock cycles. Migration assumed.
Duration of loop 4 is 268 clock cycles. 
Duration of loop 5 is 268 clock cycles. 
Duration of loop 6 is 275 clock cycles. 
Duration of loop 7 is 469 clock cycles. Migration assumed.
Duration of loop 8 is 276 clock cycles. 
Duration of loop 9 is 651 clock cycles. Migration assumed.
HostExecutionSpace timings of run 1:
TimingResult contains 10 results:
Duration of loop 0 is 19 clock cycles. 
Duration of loop 1 is 16 clock cycles. 
Duration of loop 2 is 16 clock cycles. 
Duration of loop 3 is 16 clock cycles. 
Duration of loop 4 is 16 clock cycles. 
Duration of loop 5 is 16 clock cycles. 
Duration of loop 6 is 16 clock cycles. 
Duration of loop 7 is 16 clock cycles. 
Duration of loop 8 is 16 clock cycles. 
Duration of loop 9 is 16 clock cycles. 
DeviceExecutionSpace timings of run 2:
TimingResult contains 10 results:
Duration of loop 0 is 445564 clock cycles. Migration assumed.
Duration of loop 1 is 279 clock cycles. 
Duration of loop 2 is 275 clock cycles. 
Duration of loop 3 is 274 clock cycles. 
Duration of loop 4 is 275 clock cycles. 
Duration of loop 5 is 274 clock cycles. 
Duration of loop 6 is 275 clock cycles. 
Duration of loop 7 is 275 clock cycles. 
Duration of loop 8 is 277 clock cycles. 
Duration of loop 9 is 277 clock cycles. 
HostExecutionSpace timings of run 2:
TimingResult contains 10 results:
Duration of loop 0 is 20 clock cycles. 
Duration of loop 1 is 16 clock cycles. 
Duration of loop 2 is 16 clock cycles. 
Duration of loop 3 is 16 clock cycles. 
Duration of loop 4 is 16 clock cycles. 
Duration of loop 5 is 16 clock cycles. 
Duration of loop 6 is 16 clock cycles. 
Duration of loop 7 is 16 clock cycles. 
Duration of loop 8 is 16 clock cycles. 
Duration of loop 9 is 16 clock cycles. 
################LOCAL SPACE####################
TimingResult contains 10 results:
Duration of loop 0 is 326 clock cycles. 
Duration of loop 1 is 269 clock cycles. 
Duration of loop 2 is 269 clock cycles. 
Duration of loop 3 is 267 clock cycles. 
Duration of loop 4 is 268 clock cycles. 
Duration of loop 5 is 268 clock cycles. 
Duration of loop 6 is 268 clock cycles. 
Duration of loop 7 is 268 clock cycles. 
Duration of loop 8 is 268 clock cycles. 
Duration of loop 9 is 268 clock cycles. 
TimingResult contains 10 results:
Duration of loop 0 is 16 clock cycles. 
Duration of loop 1 is 16 clock cycles. 
Duration of loop 2 is 16 clock cycles. 
Duration of loop 3 is 16 clock cycles. 
Duration of loop 4 is 16 clock cycles. 
Duration of loop 5 is 16 clock cycles. 
Duration of loop 6 is 16 clock cycles. 
Duration of loop 7 is 16 clock cycles. 
Duration of loop 8 is 16 clock cycles. 
Duration of loop 9 is 16 clock cycles. 
../Trilinos/packages/kokkos/core/unit_test/TestSharedSpace.cpp:216: Failure
Value of: passed
  Actual: false
Expected: true
[  FAILED  ] defaultdevicetype.shared_space (69 ms)

Looking at that the code in kokkos/core/unit_test/TestSharedSpace.cpp, that test failes when:

bool passed = (fastAsLocalOnRepeatedAccess);

evaluates to false. So this is looks to be a timing-based test (and a timing-based tests in a CUDA build none the less where contention for the GPU is a real problem).

I verified that this test fails randomly even with the native Kokkos build system as part of reproducing the failure in #11863 (comment).

NOTE: As shown in this query showing:

image

this test has only run 9 times in this build since since 2023-05-01. (This is because the Kokkos tests only get run when a global file is changed or when Kokkos is changed.) And before PR #11863, the last time this test was run in this build was on 2023-05-16. Perhaps something changed in how these tests are run that is causing this test to fail?

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Kokkos impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area labels Jun 1, 2023
@bartlettroscoe
Copy link
Member Author

Because timing-based tests are a problem when machines are stressed trying to run as many tests at a time as possible, these tests tend to randomly fail in such situations. Therefore, one can argue that there should be no timing-based tests run as part of the Trilinos PR builds.

Therefore, one can argue that this single unit test defaultdevicetype.shared_space should be disabled in Trilinos PR non-performance testing (i.e. when PERFORMANCE is missing from Trilinos_TEST_CATEGORIES). And based on that argument, the test Kokkos_CoreUnitTest_CudaTimingBased should not be run in Trilinos non-performance testing as well.

Therefore, I will add these disables to PR #11863 and then can be approved as part of that.

@bartlettroscoe
Copy link
Member Author

So what is interesting about this is that according to:

opt-set-cmake-var KokkosCore_UnitTest_Default_MPI_1_EXTRA_ARGS STRING : --gtest_filter=-*defaultdevicetype.shared_space
opt-set-cmake-var KokkosCore_UnitTest_Default_EXTRA_ARGS STRING : --gtest_filter=-*defaultdevicetype.shared_space

The unit test defaultdevicetype.shared_space was already being disabled. What happened is that somehow my refactoring scripts in PR #11808 somehow did not do these renamings.

Therefore, I will fix the names of the Kokkos tests in that file and add that as a commit to PR #11863.

Any objections?

@ndellingwood
Copy link
Contributor

Therefore, I will fix the names of the Kokkos tests in that file and add that as a commit to PR #11863.

@bartlettroscoe thanks for updating the file, this will also resolve the Kokkos_CoreUnitTest_CudaTimingBased failures as there are several occurrences of the KokkosCore_UnitTest_CudaTimingBased_MPI_1_DISABLE naming in the .ini file, among other Kokkos tests that will need the updated renaming treatment

@bartlettroscoe
Copy link
Member Author

FYI: This is fixed in commit 59c2e4d of PR #11863. That PR just needs to be merged and this this issue can be closed.

@ndellingwood
Copy link
Contributor

These changes should now be in Trilinos develop

@bartlettroscoe
Copy link
Member Author

These changes should now be in Trilinos develop

Indeed. Closing as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: Kokkos type: bug The primary issue is a bug in Trilinos code or tests
Projects
Development

No branches or pull requests

2 participants