Modify 'make baseline' to continue running when a baseline fails. #3044

pbosler · 2019-07-06T02:15:26Z

Fixed an issue where a failed test would cause 'make baseline' to crash without running other tests.

[BFB]

…sh without running other tests.

ambrad · 2019-07-06T02:24:31Z

I like this change. It can be useful on testbed machines to make it through the baseline process even if a test fails.

What do you think of cherry-picking this related commit into the PR? It disables the SL test when SL isn't built. This fixes the problem of running this test on a GPU machine. SL doesn't yet run end-to-end on GPU: ambrad@ef95295

ambrad · 2019-07-06T02:27:32Z

I suppose a counterargument is that someone might think the fact that make baseline passes means the build is good on a new machine. Currently that is true; with this PR, it no longer would be.

Edit: I withdraw this comment. The output will show the error messages, so it will be clear there was a failure.

pbosler · 2019-07-06T02:56:36Z

I think it's good that each baseline test gets a chance to run -- it will cover more code, at a minimum. Plus, if a test is known to fail, like the SL test on GPU, that shouldn't stop other dev from moving forward. I'll add "Test failed" to the output to make sure it's clear.

I agree with cherry-picking the above SL commit.

oksanaguba · 2019-07-06T03:05:21Z

What if we are making baselines on some machine thru ./create_test ... ? Does this new change mean case status will not show that some baseline was not generated? If there is nothing alarming in the case status, there is no reason to read log files closely. I guess i can test this by messing with some namelist to cause it crash...

…' into pbosler/homme/make-baseline-crash-fix

pbosler · 2019-07-06T03:09:51Z

Sample output:

test baroCamMoist-run was run successfully
Running test baroCamMoistSL-run ... /ascldap/users/pabosle/hommexx/cuda/tests/baseline/baroCamMoistSL/baroCamMoistSL-run.sh > baroCamMoistSL-run.out 2> baroCamMoistSL-run.err
Test failed.
baroCamMoistSL-run.err file not found. Check baroCamMoistSL-run directory for details.```

ambrad · 2019-07-06T03:14:25Z

Pete, re: Oksana's comment. You could move the exit -7 line to the end of the routine. Accumulate a fail count. If the count is 0 at the end, do nothing; if it is >0, then exit -7. That way, the baselines all get a chance to get created, but the exit status is still bad if at least one fails.

pbosler · 2019-07-06T14:16:51Z

With the new change, the error return is back:

1 of 79 tests failed.
make[3]: *** [test_execs/CMakeFiles/baseline] Error 249
make[2]: *** [test_execs/CMakeFiles/baseline.dir/all] Error 2
make[1]: *** [test_execs/CMakeFiles/baseline.dir/rule] Error 2
make: *** [baseline] Error 2```

ambrad

Oksana and Mark should have the final say, but I think this looks like a great change, and the error output seems good to me.

pbosler · 2019-07-09T21:29:01Z

Add a fix for #3049

Can now use mpicxx as CMAKE_CXX_COMPILER with cuda and internal Kokkos build; just set env var OMPI_CXX=<kokkos_path>/bin/nvcc_wrapper, as recommended & supported by Kokkos (see https://github.com/kokkos/kokkos/wiki/Compiling , section 4.4).

jgfouca · 2019-07-11T17:35:18Z

@oksanaguba , are you OK with this PR?

oksanaguba · 2019-07-11T17:39:59Z

@jgfouca not yet. I want to run this from cime and confirm it behaves as it should. this is a low priority feature, not a bug (I will change labels).

components/homme/test/reg_test/run_tests/test-list.cmake

rljacob · 2019-07-12T20:38:39Z

It is fixing a bug so adding back the bug-fix label.

oksanaguba · 2019-07-12T20:46:15Z

@rljacob this PR fixes infrastructure issues; one is about nvcc-wrapper, maybe, it can be considered a bug; I only meant it does not fix bugs in homme as in dycore.

pbosler · 2019-07-13T17:35:13Z

@rljacob @oksanaguba: I originally added "bug fix" label because I thought it was a bug that one baseline failure would prevent all of the others from running. Depending on the order of the tests relative to the failure, this could be a large number. In my case, for example, a test we expected to fail, did fail; it was number 8 of 91 and its failure prevented the other 83 baselines from being run -- to me that seemed like a bug since we depend on those baselines to verify subsequent PRs.

I've no problem changing labels if "bug fix" is reserved for the Homme source code or if another label is a better fit.

rljacob · 2019-08-08T20:55:45Z

@mt5555 please update your review if your changes have been made.

oksanaguba · 2019-08-08T20:57:46Z

I still need to test it,so, it is not ready.

rljacob · 2019-08-08T21:03:25Z

This PR is a month old already. Make time to test it.

oksanaguba · 2019-08-15T22:53:44Z

I am trying this now: I merged it and broke 1st run for baroCamMoist by meddling with nl. The 1st run for this set did not run, the rest ran (because they are omp runs with different namelists). Total output is

[onguba@skybridge-login7 acmebld]$ ./cs.status.20190815_162407_bsqt7c 
20190815_162407_bsqt7c: 1 test
  HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel (Overall: PASS) details:
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel CREATE_NEWCASE
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel XML
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SETUP
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SHAREDLIB_BUILD time=0
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel NLCOMP
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel MODEL_BUILD time=274
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SUBMIT
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel RUN time=456
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel GENERATE
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SHORT_TERM_ARCHIVER

I moved to break a simpler test, baro2b. It did not run, movies folder is empty, there is an error, but the output at the end contains

Running test baro2b-run ... /gpfs1/onguba/acme_scratch/sandiatoss3/HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel.G.20190815_163733_6dnjlo/bld/tests/baseline/baro2b/baro2b-run.sh > baro2b-run.out 2> baro2b-run.err
test baro2b-run was run successfully

So, for now the code completes baselines (the next test, baro2c, has movies), but it also shows PASS

PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel GENERATE
PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SHORT_TERM_ARCHIVER

I think it should show FAIL.

oksanaguba · 2019-08-15T22:55:10Z

@pbosler ^^

oksanaguba · 2019-08-16T20:44:51Z

Since in this version case status returns pass even when some baselines failed (which can lead to broken baselines), could I suggest that the change about SL test is moved to a new PR and this PR is not being merged for now?

rljacob · 2019-08-20T19:01:08Z

@pbosler, @oksanaguba what is the status of this PR?

pbosler · 2019-08-20T20:39:22Z

@oksanaguba The reason the test passes is because pass/fail is determined by the executable's return value to the shell, e.g., the RUN_STAT variable in testing_utils.sh:

cmd="homme_executable > $THIS_STDOUT 2> &THIS_STDERR"
$cmd
RUN_STAT=$?

Hence, in order to show failure, the test must crash, throw an exception, or otherwise not return 0 to the shell. Is it possible that the break you introduced succeeded in writing an error message but still returned 0?

oksanaguba · 2019-08-20T21:35:25Z

Thanks, Pete @pbosler . @rljacob I will get back to this PR in 1-2 weeks.

rljacob · 2019-10-10T17:39:23Z

@oksanaguba what is the status of this PR?

oksanaguba · 2019-10-11T01:08:28Z

I did not have time to work on this.

rljacob · 2019-10-11T02:50:54Z

But what's the status going forward?

rljacob · 2019-10-14T19:51:54Z

Closing until this can be worked on.

…x_testing_docs Automatically Merged using E3SM Pull Request AutoTester PR Title: Update EAMxx docs and improve help formatters PR Author: jgfouca PR LABELS: AT: AUTOMERGE, AT: Integrate Without Testing

ambrad and others added 2 commits July 5, 2019 17:13

COMPOSE: Disable standalone SL test if SL is not built.

ef95295

Fixed an issue where a failed test would cause 'make baseline' to cra…

d430c94

…sh without running other tests.

pbosler added bug fix PR minor HOMME labels Jul 6, 2019

ambrad self-requested a review July 6, 2019 02:22

ambrad requested a review from oksanaguba July 6, 2019 02:24

Added explicit comment for the case of a single test failing.

a2e565b

Merge remote-tracking branch 'ambrad/ambrad/homme/baroCamMoistSL-flag…

a0ebde4

…' into pbosler/homme/make-baseline-crash-fix

ambrad changed the title ~~Fixed an issue where a failed test would cause 'make baseline' to cra…~~ Modify 'make baseline' to continue running when a baseline fails. Jul 6, 2019

Added exit error code and a failure counter for baseline tests.

c0d5aef

ambrad approved these changes Jul 8, 2019

View reviewed changes

rljacob assigned jgfouca Jul 8, 2019

Add support for MPI/CUDA using OMPI_CXX variable

875c243

pbosler mentioned this pull request Jul 9, 2019

Improve HOMME's cmake logic to check compiler in a CUDA build #3049

Closed

oksanaguba added enhancement and removed bug fix PR labels Jul 11, 2019

mt5555 requested changes Jul 12, 2019

View reviewed changes

components/homme/test/reg_test/run_tests/test-list.cmake Show resolved Hide resolved

rljacob added the bug fix PR label Jul 12, 2019

mt5555 approved these changes Aug 9, 2019

View reviewed changes

rljacob closed this Oct 14, 2019

pbosler deleted the pbosler/homme/make-baseline-crash-fix branch November 11, 2021 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify 'make baseline' to continue running when a baseline fails. #3044

Modify 'make baseline' to continue running when a baseline fails. #3044

pbosler commented Jul 6, 2019 •

edited by rljacob

Loading

ambrad commented Jul 6, 2019

ambrad commented Jul 6, 2019 •

edited

Loading

pbosler commented Jul 6, 2019

oksanaguba commented Jul 6, 2019 •

edited

Loading

pbosler commented Jul 6, 2019

ambrad commented Jul 6, 2019

pbosler commented Jul 6, 2019

ambrad left a comment

pbosler commented Jul 9, 2019 •

edited

Loading

jgfouca commented Jul 11, 2019

oksanaguba commented Jul 11, 2019

rljacob commented Jul 12, 2019 •

edited

Loading

oksanaguba commented Jul 12, 2019

pbosler commented Jul 13, 2019 •

edited

Loading

rljacob commented Aug 8, 2019

oksanaguba commented Aug 8, 2019

rljacob commented Aug 8, 2019

oksanaguba commented Aug 15, 2019 •

edited

Loading

oksanaguba commented Aug 15, 2019

oksanaguba commented Aug 16, 2019

rljacob commented Aug 20, 2019

pbosler commented Aug 20, 2019

oksanaguba commented Aug 20, 2019

rljacob commented Oct 10, 2019

oksanaguba commented Oct 11, 2019

rljacob commented Oct 11, 2019

rljacob commented Oct 14, 2019

Modify 'make baseline' to continue running when a baseline fails. #3044

Modify 'make baseline' to continue running when a baseline fails. #3044

Conversation

pbosler commented Jul 6, 2019 • edited by rljacob Loading

ambrad commented Jul 6, 2019

ambrad commented Jul 6, 2019 • edited Loading

pbosler commented Jul 6, 2019

oksanaguba commented Jul 6, 2019 • edited Loading

pbosler commented Jul 6, 2019

ambrad commented Jul 6, 2019

pbosler commented Jul 6, 2019

ambrad left a comment

Choose a reason for hiding this comment

pbosler commented Jul 9, 2019 • edited Loading

jgfouca commented Jul 11, 2019

oksanaguba commented Jul 11, 2019

rljacob commented Jul 12, 2019 • edited Loading

oksanaguba commented Jul 12, 2019

pbosler commented Jul 13, 2019 • edited Loading

rljacob commented Aug 8, 2019

oksanaguba commented Aug 8, 2019

rljacob commented Aug 8, 2019

oksanaguba commented Aug 15, 2019 • edited Loading

oksanaguba commented Aug 15, 2019

oksanaguba commented Aug 16, 2019

rljacob commented Aug 20, 2019

pbosler commented Aug 20, 2019

oksanaguba commented Aug 20, 2019

rljacob commented Oct 10, 2019

oksanaguba commented Oct 11, 2019

rljacob commented Oct 11, 2019

rljacob commented Oct 14, 2019

pbosler commented Jul 6, 2019 •

edited by rljacob

Loading

ambrad commented Jul 6, 2019 •

edited

Loading

oksanaguba commented Jul 6, 2019 •

edited

Loading

pbosler commented Jul 9, 2019 •

edited

Loading

rljacob commented Jul 12, 2019 •

edited

Loading

pbosler commented Jul 13, 2019 •

edited

Loading

oksanaguba commented Aug 15, 2019 •

edited

Loading