Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify 'make baseline' to continue running when a baseline fails. #3044

Closed
wants to merge 6 commits into from

Conversation

pbosler
Copy link
Contributor

@pbosler pbosler commented Jul 6, 2019

Fixed an issue where a failed test would cause 'make baseline' to crash without running other tests.

Fixes #3049

[BFB]

@ambrad
Copy link
Member

ambrad commented Jul 6, 2019

I like this change. It can be useful on testbed machines to make it through the baseline process even if a test fails.

What do you think of cherry-picking this related commit into the PR? It disables the SL test when SL isn't built. This fixes the problem of running this test on a GPU machine. SL doesn't yet run end-to-end on GPU: ambrad@ef95295

@ambrad ambrad requested a review from oksanaguba July 6, 2019 02:24
@ambrad
Copy link
Member

ambrad commented Jul 6, 2019

I suppose a counterargument is that someone might think the fact that make baseline passes means the build is good on a new machine. Currently that is true; with this PR, it no longer would be.

Edit: I withdraw this comment. The output will show the error messages, so it will be clear there was a failure.

@pbosler
Copy link
Contributor Author

pbosler commented Jul 6, 2019

I think it's good that each baseline test gets a chance to run -- it will cover more code, at a minimum. Plus, if a test is known to fail, like the SL test on GPU, that shouldn't stop other dev from moving forward. I'll add "Test failed" to the output to make sure it's clear.

I agree with cherry-picking the above SL commit.

@oksanaguba
Copy link
Contributor

oksanaguba commented Jul 6, 2019

What if we are making baselines on some machine thru ./create_test ... ? Does this new change mean case status will not show that some baseline was not generated? If there is nothing alarming in the case status, there is no reason to read log files closely. I guess i can test this by messing with some namelist to cause it crash...

…' into pbosler/homme/make-baseline-crash-fix
@pbosler
Copy link
Contributor Author

pbosler commented Jul 6, 2019

Sample output:

test baroCamMoist-run was run successfully
Running test baroCamMoistSL-run ... /ascldap/users/pabosle/hommexx/cuda/tests/baseline/baroCamMoistSL/baroCamMoistSL-run.sh > baroCamMoistSL-run.out 2> baroCamMoistSL-run.err
Test failed.
baroCamMoistSL-run.err file not found. Check baroCamMoistSL-run directory for details.```

@ambrad
Copy link
Member

ambrad commented Jul 6, 2019

Pete, re: Oksana's comment. You could move the exit -7 line to the end of the routine. Accumulate a fail count. If the count is 0 at the end, do nothing; if it is >0, then exit -7. That way, the baselines all get a chance to get created, but the exit status is still bad if at least one fails.

@ambrad ambrad changed the title Fixed an issue where a failed test would cause 'make baseline' to cra… Modify 'make baseline' to continue running when a baseline fails. Jul 6, 2019
@pbosler
Copy link
Contributor Author

pbosler commented Jul 6, 2019

With the new change, the error return is back:

1 of 79 tests failed.
make[3]: *** [test_execs/CMakeFiles/baseline] Error 249
make[2]: *** [test_execs/CMakeFiles/baseline.dir/all] Error 2
make[1]: *** [test_execs/CMakeFiles/baseline.dir/rule] Error 2
make: *** [baseline] Error 2```

Copy link
Member

@ambrad ambrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oksana and Mark should have the final say, but I think this looks like a great change, and the error output seems good to me.

@pbosler
Copy link
Contributor Author

pbosler commented Jul 9, 2019

Add a fix for #3049

Can now use mpicxx as CMAKE_CXX_COMPILER with cuda and internal Kokkos build; just set env var OMPI_CXX=<kokkos_path>/bin/nvcc_wrapper, as recommended & supported by Kokkos (see https://github.com/kokkos/kokkos/wiki/Compiling , section 4.4).

@jgfouca
Copy link
Member

jgfouca commented Jul 11, 2019

@oksanaguba , are you OK with this PR?

@oksanaguba
Copy link
Contributor

@jgfouca not yet. I want to run this from cime and confirm it behaves as it should. this is a low priority feature, not a bug (I will change labels).

@rljacob
Copy link
Member

rljacob commented Jul 12, 2019

It is fixing a bug so adding back the bug-fix label.

@oksanaguba
Copy link
Contributor

@rljacob this PR fixes infrastructure issues; one is about nvcc-wrapper, maybe, it can be considered a bug; I only meant it does not fix bugs in homme as in dycore.

@pbosler
Copy link
Contributor Author

pbosler commented Jul 13, 2019

@rljacob @oksanaguba: I originally added "bug fix" label because I thought it was a bug that one baseline failure would prevent all of the others from running. Depending on the order of the tests relative to the failure, this could be a large number. In my case, for example, a test we expected to fail, did fail; it was number 8 of 91 and its failure prevented the other 83 baselines from being run -- to me that seemed like a bug since we depend on those baselines to verify subsequent PRs.

I've no problem changing labels if "bug fix" is reserved for the Homme source code or if another label is a better fit.

@rljacob
Copy link
Member

rljacob commented Aug 8, 2019

@mt5555 please update your review if your changes have been made.

@oksanaguba
Copy link
Contributor

I still need to test it,so, it is not ready.

@rljacob
Copy link
Member

rljacob commented Aug 8, 2019

This PR is a month old already. Make time to test it.

@oksanaguba
Copy link
Contributor

oksanaguba commented Aug 15, 2019

I am trying this now: I merged it and broke 1st run for baroCamMoist by meddling with nl. The 1st run for this set did not run, the rest ran (because they are omp runs with different namelists). Total output is

[onguba@skybridge-login7 acmebld]$ ./cs.status.20190815_162407_bsqt7c 
20190815_162407_bsqt7c: 1 test
  HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel (Overall: PASS) details:
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel CREATE_NEWCASE
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel XML
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SETUP
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SHAREDLIB_BUILD time=0
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel NLCOMP
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel MODEL_BUILD time=274
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SUBMIT
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel RUN time=456
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel GENERATE
    PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SHORT_TERM_ARCHIVER

I moved to break a simpler test, baro2b. It did not run, movies folder is empty, there is an error, but the output at the end contains

Running test baro2b-run ... /gpfs1/onguba/acme_scratch/sandiatoss3/HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel.G.20190815_163733_6dnjlo/bld/tests/baseline/baro2b/baro2b-run.sh > baro2b-run.out 2> baro2b-run.err
test baro2b-run was run successfully

So, for now the code completes baselines (the next test, baro2c, has movies), but it also shows PASS

PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel GENERATE
PASS HOMME_P24.f19_g16_rx1.A.sandiatoss3_intel SHORT_TERM_ARCHIVER

I think it should show FAIL.

@oksanaguba
Copy link
Contributor

@pbosler ^^

@oksanaguba
Copy link
Contributor

Since in this version case status returns pass even when some baselines failed (which can lead to broken baselines), could I suggest that the change about SL test is moved to a new PR and this PR is not being merged for now?

@rljacob
Copy link
Member

rljacob commented Aug 20, 2019

@pbosler, @oksanaguba what is the status of this PR?

@pbosler
Copy link
Contributor Author

pbosler commented Aug 20, 2019

@oksanaguba The reason the test passes is because pass/fail is determined by the executable's return value to the shell, e.g., the RUN_STAT variable in testing_utils.sh:

cmd="homme_executable > $THIS_STDOUT 2> &THIS_STDERR"
$cmd
RUN_STAT=$?

Hence, in order to show failure, the test must crash, throw an exception, or otherwise not return 0 to the shell. Is it possible that the break you introduced succeeded in writing an error message but still returned 0?

@oksanaguba
Copy link
Contributor

Thanks, Pete @pbosler . @rljacob I will get back to this PR in 1-2 weeks.

@rljacob
Copy link
Member

rljacob commented Oct 10, 2019

@oksanaguba what is the status of this PR?

@oksanaguba
Copy link
Contributor

I did not have time to work on this.

@rljacob
Copy link
Member

rljacob commented Oct 11, 2019

But what's the status going forward?

@rljacob
Copy link
Member

rljacob commented Oct 14, 2019

Closing until this can be worked on.

@rljacob rljacob closed this Oct 14, 2019
@pbosler pbosler deleted the pbosler/homme/make-baseline-crash-fix branch November 11, 2021 17:23
jgfouca pushed a commit that referenced this pull request Nov 7, 2024
…x_testing_docs

Automatically Merged using E3SM Pull Request AutoTester
PR Title: Update EAMxx docs and improve help formatters
PR Author: jgfouca
PR LABELS: AT: AUTOMERGE, AT: Integrate Without Testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve HOMME's cmake logic to check compiler in a CUDA build
6 participants