Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MueLu_UnitTestsTpetra_MPI* tests failing in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting 2019-06-02 #5310

Closed
fryeguy52 opened this issue Jun 4, 2019 · 8 comments
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

Bug Report

CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Description

As shown in this query the tests:

  • MueLu_UnitTestsTpetra_MPI_1
  • MueLu_UnitTestsTpetra_MPI_4

are failing in the build since 2019-06-02:

  • Trilinos-atdm-waterman_cuda-9.2_shared_opt
New commits on 2019-06-02
*** Base Git Repo: Trilinos
3f8ed2b:  Merge remote-tracking branch 'origin/develop' into atdm-nightly
Author: Roscoe A. Bartlett <[email protected]>
Date:   Sat Jun 1 21:05:19 2019 -0600

d7322ba:  Merge pull request #5287 from william76/xpetra-eti-TpetraBlockCrsMatrix-v001
Author: Chris Siefert <[email protected]>
Date:   Sat Jun 1 13:52:33 2019 -0600

e873f77:  Tpetra: Allow CrsMatrix with StaticProfile to resize during import/export (#5268)
Author: Tim Fuller <[email protected]>
Date:   Sat Jun 1 08:29:53 2019 -0600

M	packages/tpetra/core/src/Tpetra_CrsGraph_decl.hpp
M	packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp
M	packages/tpetra/core/src/Tpetra_CrsMatrix_decl.hpp
M	packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp
M	packages/tpetra/core/src/Tpetra_Details_crsUtils.hpp
M	packages/tpetra/core/test/CrsMatrix/CMakeLists.txt
A	packages/tpetra/core/test/CrsMatrix/CrsMatrix_StaticImportExport.cpp
M	packages/tpetra/core/test/CrsMatrix/Tpetra_Test_CrsMatrix_WithGraph.hpp

dd2d23e:  Xpetra: ETI TpetraBlockCrsMatrix bug fixes #4 (compiles)
Author: William McLendon <[email protected]>
Date:   Thu May 30 17:16:10 2019 -0600

M	packages/muelu/test/unit_tests/BlackBoxPFactory.cpp
M	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp
M	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp

b738ce6:  Xpetra: ETI TpetraBlockCrsMatrix bug fixes #3
Author: William McLendon <[email protected]>
Date:   Thu May 30 09:19:32 2019 -0600

M	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp
M	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp

95840ee:  Xpetra: ETI TpetraBlockCrsMatrix bug fixes #2
Author: William McLendon <[email protected]>
Date:   Wed May 29 17:35:14 2019 -0600

M	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp
M	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp

488da42:  Xpetra: ETI TpetraBlockCrsMatrix bug fixes #1
Author: William McLendon <[email protected]>
Date:   Tue May 28 18:03:22 2019 -0600

A	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp
A	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp

df2dc43:  Xpetra: ETI TpetraBlockCrsMatrix initial commit
Author: William McLendon <[email protected]>
Date:   Tue May 28 17:40:05 2019 -0600

M	packages/xpetra/src/CMakeLists.txt
D	packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix.hpp
M	packages/xpetra/src/Utils/ClassList/SC-LO-GO-NO.classList
M	packages/xpetra/src/Utils/ExplicitInstantiation/ETI_SC_LO_GO_NO_classes.cmake

Current Status on CDash

The current status of these tests can be found here

Steps to Reproduce

One should be able to reproduce this failure on as described in:

More specifically, the commands given for are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman_cuda-9.2_shared_opt
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
 $TRILINOS_DIR
$ make NP=16
$ <system-run-tests-command>
@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: MueLu client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area labels Jun 4, 2019
@fryeguy52 fryeguy52 changed the title PackageName: General Summary of the Bug MueLu: MueLu_UnitTestsTpetra_MPI* tests failing on ATDM waterman build Jun 4, 2019
@bartlettroscoe
Copy link
Member

@trilinos/muelu,

These are showing errors like shown here showing:

 ...
 STS::magnitude(diagVec->norm1() - diagVec->getGlobalLength()) < 100*TMT::eps() = false == true = true : FAILED ==> /gpfs1/jenkins/serrano-slave/workspace/Trilinos-atdm-serrano-intel-opt-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests_kokkos/TentativePFactory_kokkos.cpp:362
 ...
 [FAILED]  (0.00249 sec) TentativePFactory_kokkos_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_MakeTentativeVectorBasedUsingDefaultNullSpace_UnitTest
 Location: /gpfs1/jenkins/serrano-slave/workspace/Trilinos-atdm-serrano-intel-opt-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests_kokkos/TentativePFactory_kokkos.cpp:264

Can someone please update this unit testing code to use TEST_FLOATING_EQUALITY() so we can see the actual numbers to see how far this is failing by? For example, you could use:

TEST_FLOATING_EQUALITY(
   STS::magnitude(diagVec->norm1()),
   STS::magnitude(diagVec->getGlobalLength()),
   STS::magnitude(100*TMT::eps()), 

That would print the numbers being compared and the tolerance so we can see why this is failing.

@lucbv
Copy link
Contributor

lucbv commented Jun 5, 2019

@bartlettroscoe sure I can work on this, I have modified a few things in MueLu recently to start using the epsilon function from Teuchos instead of hard coding a value and it could be the reason for this failure.

lucbv added a commit to lucbv/Trilinos that referenced this issue Jun 5, 2019
Three updates are made to the testing infrastructure of MueLu:
      1) Fixing a bug in Structured Aggregation unit_test
      2) Using more appropriate Teuchos macros in TentativePFactory_kokkos unit_test
      3) reducing the amount of smoothing in StructuredRegion to speed up these tests
@lucbv
Copy link
Contributor

lucbv commented Jun 5, 2019

@bartlettroscoe @fryeguy52 I have a PR in progress that updates a few tests in MueLu that were failing for various reasons. Among them the offending unit-test pointed to above.

I also want to point at that the pull request auto-tester does not seem to turn on the Teuchos_GLOBALLY_REDUCE_UNITTEST_RESULTS which seems odd and potentially dangerous!

@bartlettroscoe
Copy link
Member

@lucbv said:

I also want to point at that the pull request auto-tester does not seem to turn on the Teuchos_GLOBALLY_REDUCE_UNITTEST_RESULTS which seems odd and potentially dangerous!

Right. Someone needs to clean up all of the flaky Trilinos tests so we can enable that. But for tests you control, just use the unit test driver Teuchos_StandardParallelUnitTestMain.cpp will will globally reduce unit test results.

trilinos-autotester added a commit that referenced this issue Jun 5, 2019
Automatically Merged using Trilinos Pull Request AutoTester
PR Title: MueLu: updating some tests/unit_tests, see issue #5310
PR Author: lucbv
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 6, 2019
…s:develop' (8380623).

* trilinos-develop: (78 commits)
  MueLu: fixing an issue in the structured region driver data
  Updates to handle when a non RK stepper observer is passed in. (trilinos#5324)
  Xpetra: add forced map writing to unit test
  MueLu: updating some tests/unit_tests, see issue trilinos#5310
  Xpetra: add option to force writing of all maps
  Fixes to make stk build when BoostLib is not enabled.
  Add Intel 18.0.5 compiler to ATDM environment.
  Tpetra: Fix trilinos#4627
  Xpetra: don't write maps unnecessarily
  Fix trilinos#5151.
  Fix stk cmake files to disable unit and doc tests that depend on STKNGP_TEST if STKNGP_TEST is not enabled.
  MueLu: removing the Epetra path tests for unit-tests-kokkos, see issue trilinos#4325
  Xpetra: fix MultiVector unit-test for Epetra=OFF configuration, see issue trilinos#5300
  fix another type of quoting error
  TrilinosCouplings: More updates for Avatar-as-external-package
  Correct shell quoting in the package file replacement for python
  SEACAS: bug fix for may snapshot
  Python testing - resolve issues found during testing
  Stk update (trilinos#5289)
  Xpetra: fixing a specialization of the Xpetra::TpetraOperator see issue trilinos#5293
  ...
@lucbv
Copy link
Contributor

lucbv commented Jun 6, 2019

@fryeguy52 @bartlettroscoe there is some progress, at least the serial test is now passing see this query.
@csiefer2 do you have any time to look at the issue with the RAPShift factory on waterman?

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 6, 2019
…s:develop' (8380623).

* trilinos-develop: (81 commits)
  epetra: Fix for unguarded macro defined in SLU
  epetra: Fix for unguarded SuperLU macro
  MueLu: fixing an issue in the structured region driver data
  Xpetra: ETI conversion for Xpetra_TpetraExport.hpp
  Updates to handle when a non RK stepper observer is passed in. (trilinos#5324)
  Xpetra: add forced map writing to unit test
  MueLu: updating some tests/unit_tests, see issue trilinos#5310
  Xpetra: add option to force writing of all maps
  Fixes to make stk build when BoostLib is not enabled.
  Add Intel 18.0.5 compiler to ATDM environment.
  Tpetra: Fix trilinos#4627
  Xpetra: don't write maps unnecessarily
  Fix trilinos#5151.
  Fix stk cmake files to disable unit and doc tests that depend on STKNGP_TEST if STKNGP_TEST is not enabled.
  MueLu: removing the Epetra path tests for unit-tests-kokkos, see issue trilinos#4325
  Xpetra: fix MultiVector unit-test for Epetra=OFF configuration, see issue trilinos#5300
  fix another type of quoting error
  TrilinosCouplings: More updates for Avatar-as-external-package
  Correct shell quoting in the package file replacement for python
  SEACAS: bug fix for may snapshot
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 7, 2019
…s:develop' (8380623).

* trilinos-develop: (86 commits)
  Fixed review erros for AMGX rebuilding
  MueLu: Avatar interface mods
  TrilinosCouplings: More cleanup to example
  Fixed rebuild issue for MueLu+AMGX run
  epetra: Fix for unguarded macro defined in SLU
  epetra: Fix for unguarded SuperLU macro
  MueLu: fixing an issue in the structured region driver data
  Fix std::complex<> interface to GEEV/GGEVX in Teuchos::LAPACK
  Xpetra: ETI conversion for Xpetra_TpetraExport.hpp
  Updates to handle when a non RK stepper observer is passed in. (trilinos#5324)
  Xpetra: add forced map writing to unit test
  MueLu: updating some tests/unit_tests, see issue trilinos#5310
  Xpetra: add option to force writing of all maps
  Fixes to make stk build when BoostLib is not enabled.
  Add Intel 18.0.5 compiler to ATDM environment.
  Tpetra: Fix trilinos#4627
  Xpetra: don't write maps unnecessarily
  Fix trilinos#5151.
  Fix stk cmake files to disable unit and doc tests that depend on STKNGP_TEST if STKNGP_TEST is not enabled.
  MueLu: removing the Epetra path tests for unit-tests-kokkos, see issue trilinos#4325
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 8, 2019
…s:develop' (8380623).

* trilinos-develop: (89 commits)
  bug fixes
  Panzer: fix multiblock test for order check
  Tpetra: Fix trilinos#5336
  Fixed review erros for AMGX rebuilding
  MueLu: Avatar interface mods
  TrilinosCouplings: More cleanup to example
  Fixed rebuild issue for MueLu+AMGX run
  epetra: Fix for unguarded macro defined in SLU
  epetra: Fix for unguarded SuperLU macro
  MueLu: fixing an issue in the structured region driver data
  Fix std::complex<> interface to GEEV/GGEVX in Teuchos::LAPACK
  Xpetra: ETI conversion for Xpetra_TpetraExport.hpp
  Updates to handle when a non RK stepper observer is passed in. (trilinos#5324)
  Xpetra: add forced map writing to unit test
  MueLu: updating some tests/unit_tests, see issue trilinos#5310
  Xpetra: add option to force writing of all maps
  Fixes to make stk build when BoostLib is not enabled.
  Add Intel 18.0.5 compiler to ATDM environment.
  Tpetra: Fix trilinos#4627
  Xpetra: don't write maps unnecessarily
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 9, 2019
…s:develop' (8380623).

* trilinos-develop: (93 commits)
  Upgrade Intel default environment to 18.0.5 and add sems-rhel7 drivers.
  Automatic snapshot commit from tribits at 6858917
  Automatic snapshot commit from tribits at c8b8ef0
  Xpetra: remove GO check in binary reader
  bug fixes
  Panzer: fix multiblock test for order check
  Tpetra: Fix trilinos#5336
  Fixed review erros for AMGX rebuilding
  MueLu: Avatar interface mods
  TrilinosCouplings: More cleanup to example
  Fixed rebuild issue for MueLu+AMGX run
  epetra: Fix for unguarded macro defined in SLU
  epetra: Fix for unguarded SuperLU macro
  MueLu: fixing an issue in the structured region driver data
  Fix std::complex<> interface to GEEV/GGEVX in Teuchos::LAPACK
  Xpetra: ETI conversion for Xpetra_TpetraExport.hpp
  Updates to handle when a non RK stepper observer is passed in. (trilinos#5324)
  Xpetra: add forced map writing to unit test
  MueLu: updating some tests/unit_tests, see issue trilinos#5310
  Xpetra: add option to force writing of all maps
  ...
@bartlettroscoe
Copy link
Member

FYI: Still lots of random failures of these tests in the build Trilinos-atdm-waterman_cuda-9.2_shared_opt as shown here. The test MueLu_UnitTestsTpetra_MPI_1 failed once in the last 10 days and the test MueLu_UnitTestsTpetra_MPI_4 failed 8 times in the last 10 days.

But as shown in this query it looks like the build Trilinos-atdm-waterman_cuda-9.2_shared_opt is the only build where these tests failed in the last 10 days.

@bartlettroscoe
Copy link
Member

FYI: These tests are showing failures in unit tests with Compat_KokkosCudaWrapperNode in the name of the unit test 4 times from 9/1/2019 through 10/10/2019 are shown in this query showing:

Test Name Status Time Details Build Time Processors
MueLu_UnitTestsTpetra_MPI_1 Failed 13s 360ms Completed (Failed) 2019-10-10T03:09:44 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 440ms Completed (Failed) 2019-09-16T03:06:54 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 370ms Completed (Failed) 2019-09-15T03:04:21 MDT 1
MueLu_UnitTestsTpetra_MPI_4 Failed 28s 50ms Completed (Failed) 2019-09-06T03:05:34 MDT 4

But there are 16 failures of these tests in this build over that time period as shown in this query that show the errors:

mpiexec noticed that process rank <rankid> with PID 0 on node waterman<num> exited on signal 6 (Aborted).

and

mpiexec noticed that process rank <rankid> with PID 0 on node waterman<num> exited on signal 9 (Killed).

Least one may think these are just random failures that impact more than just MueLu tests in this build this query shows that only MueLu tests are showing this error (20 in all). There are no Panzer, Tempus or other downstream packages that show these errors in this build. The set of MueLu tests showing this are:

Test Name Status Time Details Build Time Processors
MueLu_BlockCrs-Tpetra_MPI_4 Failed 5s 600ms Completed (Failed) 2019-10-10T03:09:44 MDT 4
MueLu_DriverTpetra_WithGlobalConstants_MPI_4 Failed 6s 210ms Completed (Failed) 2019-09-22T03:06:15 MDT 4
MueLu_ImportPerformance_Tpetra_MPI_4 Failed 4s 320ms Completed (Failed) 2019-10-04T03:05:06 MDT 4
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 18s 800ms Completed (Failed) 2019-10-06T03:09:36 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 11s 280ms Completed (Failed) 2019-09-27T03:07:24 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 29s 720ms Completed (Failed) 2019-09-21T03:05:07 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 32s 10ms Completed (Failed) 2019-09-20T03:06:56 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_4 Failed 44s 560ms Completed (Failed) 2019-09-24T03:06:13 MDT 4
MueLu_SimpleTpetra_MPI_4 Failed 5s 200ms Completed (Failed) 2019-09-13T03:06:22 MDT 4
MueLu_SimpleTpetraYaml_MPI_4 Failed 5s 510ms Completed (Failed) 2019-09-23T03:04:03 MDT 4
MueLu_SimpleTpetraYaml_MPI_4 Failed 5s 100ms Completed (Failed) 2019-09-22T03:06:15 MDT 4
MueLu_SimpleTpetraYaml_MPI_4 Failed 4s 520ms Completed (Failed) 2019-09-14T03:06:46 MDT 4
MueLu_Structured_Laplace2D_Shift_Tpetra_MPI_4 Failed 5s 200ms Completed (Failed) 2019-10-05T03:04:22 MDT 4
MueLu_Structured_Laplace2D_Tpetra_MPI_4 Failed 5s 110ms Completed (Failed) 2019-09-23T03:04:03 MDT 4
MueLu_Structured_Laplace2D_Tpetra_MPI_4 Failed 5s 740ms Completed (Failed) 2019-09-21T03:05:07 MDT 4
MueLu_UnitTestsTpetra_MPI_1 Failed 13s 360ms Completed (Failed) 2019-10-10T03:09:44 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 440ms Completed (Failed) 2019-09-16T03:06:54 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 370ms Completed (Failed) 2019-09-15T03:04:21 MDT 1
MueLu_UnitTestsTpetra_MPI_4 Failed 28s 50ms Completed (Failed) 2019-09-06T03:05:34 MDT 4
MueLu_VarDofDriver_MPI_2 Failed 15s 880ms Completed (Failed) 2019-10-10T03:09:44 MDT 2

@bartlettroscoe bartlettroscoe changed the title MueLu: MueLu_UnitTestsTpetra_MPI* tests failing on ATDM waterman build MueLu_UnitTestsTpetra_MPI* tests failing in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting 2019-06-02 Dec 11, 2019
@bartlettroscoe bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants