Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test NOX_Tpetra_1DFEM_MPI_4 random failure showing "Concurrent modification of host and device views in DualView" starting 2020-02-09 #6790

Closed
bartlettroscoe opened this issue Feb 10, 2020 · 9 comments
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Kokkos pkg: NOX pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/kokkos, @trilinos/tpetra, @trilinos/nox, @kddevin (Data Services Product Lead), @rppawlo (Nonlinear Solvers Product Lead)

As shown here, the test:

  • NOX_Tpetra_1DFEM_MPI_4

in the build:

  • Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-debug

failed with the error:

0. NOX_Tpetra_1DFEM_AnalyticJacobian_NoPrec_UnitTest ... Kokkos::DualView::modify_host ERROR: Concurrent modification of host and device views in DualView "MV::DualView"
[sems-srn-rhel6-slave-01:15867] *** Process received signal ***
[sems-srn-rhel6-slave-01:15867] Signal: Aborted (6)
[sems-srn-rhel6-slave-01:15867] Signal code:  (-6)
Kokkos::DualView::modify_host ERROR: Concurrent modification of host and device views in DualView "MV::DualView"
Kokkos::DualView::modify_host ERROR: Concurrent modification of host and device views in DualView "MV::DualView"
[sems-srn-rhel6-slave-01:15869] *** Process received signal ***
[sems-srn-rhel6-slave-01:15870] *** Process received signal ***
[sems-srn-rhel6-slave-01:15870] Signal: Aborted (6)
[sems-srn-rhel6-slave-01:15870] Signal code:  (-6)
[sems-srn-rhel6-slave-01:15869] Signal: Aborted (6)
[sems-srn-rhel6-slave-01:15869] Signal code:  (-6)
Kokkos::DualView::modify_host ERROR: Concurrent modification of host and device views in DualView "MV::DualView"
[sems-srn-rhel6-slave-01:15868] *** Process received signal ***
[sems-srn-rhel6-slave-01:15868] Signal: Aborted (6)
[sems-srn-rhel6-slave-01:15868] Signal code:  (-6)
[sems-srn-rhel6-slave-01:15867] [ 0] /lib64/libpthread.so.0[0x309a60f7e0]
[sems-srn-rhel6-slave-01:15870] [ 0] /lib64/libpthread.so.0[0x309a60f7e0]
[sems-srn-rhel6-slave-01:15870] [ 1] [sems-srn-rhel6-slave-01:15869] [ 0] /lib64/libpthread.so.0[0x309a60f7e0]
[sems-srn-rhel6-slave-01:15869] [ 1] [sems-srn-rhel6-slave-01:15867] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x309a2324f5]
[sems-srn-rhel6-slave-01:15867] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x309a2324f5]
[sems-srn-rhel6-slave-01:15869] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x309a2324f5]
[sems-srn-rhel6-slave-01:15870] [ 2] /lib64/libc.so.6(abort+0x175)[0x309a233cd5]
[sems-srn-rhel6-slave-01:15870] [ 3] /lib64/libc.so.6(abort+0x175)[0x309a233cd5]

This is the first such failure I can find since the Kokkos 2.99 promotion on 2020-02-02. However, given this appears to be a random failure and given the large cost of the last major random error found in Trilinos, I thought it would be good to raise this issue early in case this turns into something big. (But this might just be a defect in this unit test and not in production code. Or might just be a hardware fluke that we never see again, who knows.)

@bartlettroscoe bartlettroscoe added pkg: Kokkos pkg: Tpetra pkg: NOX client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area type: bug The primary issue is a bug in Trilinos code or tests labels Feb 10, 2020
@mhoemmen
Copy link
Contributor

@ndellingwood just fixed the one Tpetra issue. Usually this is an issue in using Tpetra, where users just assume that they can get views without respecting the sync / modify interface. The only thing that changed was that Kokkos now checks DualView flags by default in a debug build. It didn't used to do that.

@ndellingwood
Copy link
Contributor

Looks like an error in nox/test/tpetra/ME_Tpetra_1DFEM_def.hpp, I tested a fix that made the test pass for me, though there may be some additional issues with missing braces. Off to a meeting then I'll put in a PR.

ndellingwood added a commit that referenced this issue Feb 10, 2020
Address issue #6790

 Changes to be committed:
	modified:   test/tpetra/ME_Tpetra_1DFEM_def.hpp
@ndellingwood
Copy link
Contributor

PR #6793 issued.
There may be problems or confusion about handling of uninitialized Tpetra MVs, where the "device" view is marked modified (hence the failure due to the modify_host call immediately after). It may be assumed that creation of MVs (regardless) returns a MV with host and device views in a "clean" sync state.

@mhoemmen can you clarify if an initialized Tpetra MV returns with views having clean sync states?

@mhoemmen
Copy link
Contributor

@ndellingwood wrote:

can you clarify if an initialized Tpetra MV returns with views having clean sync states?

Not necessarily. MultiVector reserves the right to do the initialization wherever it likes. Even uninitialized MultiVector objects may be marked modified on either side in debug mode (since debug mode prefills with NaN).

@mhoemmen
Copy link
Contributor

@ndellingwood Just to clarify:

Tpetra::MultiVector<> X(map, numVecs);
// X may either be fully sync'd, or may be modified on either side.
assert((X.need_sync_host() && ! X.need_sync_device()) ||
  (! X.need_sync_host() && X.need_sync_device()) ||
  (! X.need_sync_host() && ! X.need_sync_device()));

@ndellingwood
Copy link
Contributor

Thanks for the clarification @mhoemmen !

ndellingwood added a commit that referenced this issue Feb 10, 2020
Address issue #6790

 Changes to be committed:
  modified:   test/tpetra/ME_Tpetra_1DFEM_def.hpp
@ndellingwood
Copy link
Contributor

PR #6793 merged.

@bartlettroscoe bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Feb 11, 2020
kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Feb 12, 2020
Address issue trilinos#6790

 Changes to be committed:
  modified:   test/tpetra/ME_Tpetra_1DFEM_def.hpp
@bartlettroscoe bartlettroscoe added the ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs label Feb 24, 2020
@grover-trilinos
Copy link

Test results for issue #6790 as of 2020-08-16

Tests with issue trackers Passed: twip=1

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=1

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
sems-rhel6 Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-debug NOX_­Tpetra_­1DFEM_­MPI_­4 Passed Completed 30 0 30 #6790

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

@grover-trilinos
Copy link

Test results for issue #6790 as of 2020-08-23

Tests with issue trackers Passed: twip=1

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=1

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
sems-rhel6 Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-debug NOX_­Tpetra_­1DFEM_­MPI_­4 Passed Completed 29 0 29 #6790

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Kokkos pkg: NOX pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants