-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KokkosCore_UnitTest_Cuda_MPI_1 failing in ATDM Trilinos 'waterman', 'ats2'/'vortex', 'ride', 'sems-rhel7' CUDA 'opt' builds starting 2020-02-04 #6799
Comments
@bartlettroscoe that test |
The test:
failed in the build:
yesterday as shown here showing:
@ndellingwood, is this another timing problem? Should we expect to be seeing more random failures like this? |
@bartlettroscoe I don't think that test ( |
I did not reproduce failure of that test within a Kokkos VOTD develop branch nor Trilinos VOTD develop branch, and I see nothing in the test depending on unreliable pass/fail criteria that would cause randomness in results. @crtrott can running multiple tests on a GPU using |
I tried something simple to see if I could reproduce, launched a job on Waterman where I ran the test 10000 times in a Kokkos build and in a Trilinos build (using the ATDM environment configuration provided earlier) but saw no occurrences of the failure. |
@ndellingwood, it may only occur when running it with all of the other tests. I have updated the instructions as such. |
@ndellingwood, could this happen when multiple kernels are running on the same GPU at the same time? |
@bartlettroscoe I'm not certain, not clear to me if this could impact the atomic operations or disrupt something in the test that's pounding the atomics |
FYI: As shown in this query, there are more builds that show the error which include:
This is not a fluke. Again, the error is in the unit test
|
FYI: As shown in this query, we are also seeing failures in the unit test
with history:
|
FYI: As shown in this query and this query, this test:
is also now failing every testing day starting 2020-03-22 in the build:
showing the errors:
and
|
@ndellingwood, sorry I missed this comment of yours from before. Yes, we can disable just those unit tests for just the ATDM Trilinos builds (or just the CUDA builds) as described in: I think we likely just want to disable these for all ATDM Trilinos CUDA builds? IF that is the case, the instructions for doing that are in: and use the CMake cache var
But I think you want to put this in the file:
in the if block for CUDA builds and just disable these unit tests in all CUDA builds |
FYI: As shown here, we also saw this test:
failing today 2020-03-25 in the build:
showing:
That suggests that these unit tests should be disabled in all of that ATDM Trilinos CUDA builds for the time being. @ndellingwood, do you just want me to do this and create the PR and have you review it? |
@bartlettroscoe that would be great thanks, I pinged the corresponding kokkos issue as well with your comment. |
@ndellingwood, okay, I assigned this issue to myself for now and will work to get these unit tests disabled. |
@crtrott yes, I can group the tests together, and if you like I'll place them in a new test with a descriptive name like @bartlettroscoe mentioned |
I opened kokkos/kokkos#3405 and self-assigned with the request to group these time-based tests into a common executable |
Test results for issue #6799 as of 2020-10-04
Tests with issue trackers Passed: twip=6 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=6
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-10-11
Tests with issue trackers Passed: twip=6 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=1
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-10-18
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-10-25
Tests with issue trackers Passed: twip=6 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=1
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-11-01
Tests with issue trackers Passed: twip=6 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=6
Tests with issue trackers Missing: twim=1
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-11-08
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-11-15
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-11-22
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-11-29
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-12-06
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-12-13
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-12-21
Tests with issue trackers Passed: twip=2 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=2
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2020-12-27
Tests with issue trackers Passed: twip=2 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=2
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2021-01-10
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Test results for issue #6799 as of 2021-01-17
Tests with issue trackers Passed: twip=7 Detailed test results: (click to expand)Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat. |
Shouldn't this be closed? I mean this has been passing all the time now or? |
Basically it looks like this was fixed in Kokkos 3.2.01 |
Closing since grover posted more than two consecutive comments showing this test as passing. |
CC: @trilinos/kokkos, @kddevin (Trilinos Data Services Product Lead)
Next Action Status
Unit tests
cuda.debug_pin_um_to_host
andcuda.debug_serial_execution
are fragile and need to be rewritten (see kokkos/kokkos#2506). These two unit tests are disabled in all ATDM Trilinos CUDA PR builds in PR #7407 and has been merged to 'atdm-nightly' in commit 4804b08. Next: Waiting for confirmation on CDash that this test is passing and the unit tests are not running ATDM Trilinos CUDA builds starting testing day 2020-05-21 ...Description
As shown in this query and this query the test:
KokkosCore_UnitTest_Cuda_MPI_1
in the builds:
Trilinos-atdm-waterman-cuda-9.2-opt
Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
Trilinos-atdm-waterman_cuda-9.2_shared_opt
started failing and timing out on 'waterman' on testing day 2020-02-04, which was the first day after the Kokkos 2.99 update.
As shown in this query, when the test does not timeout, it fails the unit test
cuda.debug_pin_um_to_host
showing:(That output gives zero clue why the test failed but at least it gives a line number.)
Current Status on CDash
Steps to Reproduce
One should be able to reproduce this failure on the machine 'waterman' as described in:
The specific commands given for the system 'waterman' are provided at:
The exact commands to reproduce this failing test, for the build
Trilinos-atdm-waterman-cuda-9.2-opt
, for example, should be:The text was updated successfully, but these errors were encountered: