-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect job status returned from squeue in rt.sh #2379
Comments
As I'm testing the updates that fix the above issue I'm finding few other problems with the current scripts. Consider this test configuration:
If This happens because the exit code of a job is never checked. Once the job finishes we must check the error code. Instead the script (run_test.sh) proceeds by comparing the output files, finds that they are missing, labels the test as failed and exits successfully. So workflow manager, ecflow in this case, never reruns the failed job, instead it (ecflow) thinks the job finished successfully and submits the dependent task (control_restart_p8), which it should not. The previous job failed. See the
This used to work before. In previous versions of rt_utils.sh we used ufs-weather-model/tests/rt_utils.sh Line 214 in 10974d9
and here: ufs-weather-model/tests/rt_utils.sh Line 238 in 10974d9
These checks have been removed for some reason. I think we should add them back. |
Good catch, it seems the slurm/pbs default output format changed and we need to be more specific. For the second issue, am I correct in my understanding that the run test script, which is what ecflow executes, returns exit 0 even when a test fails? I'm not trying to say we shouldn't check the job's exit status in the batch system (this may end up being the easiest approach), but wouldn't it be best to try to catch the non-zero exit status in the run_test script and ensure it gets propagated such that ecflow will properly handle the job? |
Currently, the run test script (run_test.sh) returns exit code 0 regardless on whether job failed or not. By job I mean the script (job_card) that is executed by the pbs or slurm.
Yes, and 'catch the non-zero exit status in the run_test script' is exactly what these two commands ( |
See the changes here: develop...DusanJovic-NOAA:ufs-weather-model:rt_squeue_state |
OK. Let's bring them back in, then. I think we'd have to wrap these commands in a set +e because if i recall they would cause error even with success. |
I also noticed that ecflow_client commands that set the task labels were also removed: ufs-weather-model/tests/rt_utils.sh Lines 180 to 183 in 10974d9
The task labels are very useful while using ecflow ui to monitor tests execution during development/debugging, for example: I can see here that I find this very useful. I suggest we add those ecflow_client commands back. |
You may be the only person I know who uses the ecflow gui. |
I found one more minor issue. If, for whatever reason, any command in the check_result function in rt_utils.sh fails, the error will not be detected. This is the log I see in one of my tests:
Copying test outputs to the baseline holding area failed, but the test script continued and was labeled as PASS. Even though we have '-o errexit' set and the function continues execution, and the calling script (run_test.sh) reports successful test execution. This happens due to a recent change in how check_result function is invoked from a calling script. In order to catch errors from a command (mkdir and cp in this case) run by the function, the function must not be in an if test. diff --git a/tests/run_test.sh b/tests/run_test.sh
index ace4fd0c..9adf9f1d 100755
--- a/tests/run_test.sh
+++ b/tests/run_test.sh
@@ -398,7 +398,8 @@ fi
skip_check_results=${skip_check_results:-false}
results_okay=YES
if [[ ${skip_check_results} = false ]]; then
- if ( ! check_results ) ; then
+ check_results
+ if [[ $? -ne 0 ]]; then
results_okay=NO
fi
else |
That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format |
I am not sure about the linter. Do we have it available on some machine or it is only in github workflow? |
|
Thanks. When I check this test script that is similar to what's happening in run_test.sh: #!/usr/bin/bash
set -eux
set -o pipefail
trap '[ "$?" -eq 0 ] || write_fail_test' EXIT
write_fail_test() {
echo "failed in run_test"
exit 1
}
check_results() {
local test_status='PASS'
cp missing_file misssing_directory/
if [[ ${test_status} == 'FAIL' ]]; then
return 1
else
return 0
fi
}
results_okay=YES
check_results
if [[ $? -ne 0 ]]; then
results_okay=NO
fi
echo "results_okay: ${results_okay}" I get: $ shellcheck myscript
Line 27:
if [[ $? -ne 0 ]]; then
^-- SC2181 (style): Check exit code directly with e.g. 'if ! mycmd;', not indirectly with $?.
$ Looks like this message is about 'style' not an error or warning. |
Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think? |
That's where it should reside then. Agree with it. |
Done. See DusanJovic-NOAA@9690557 |
I ran
jobs that are expected to pass passed, jobs that are expected to fail failed. |
Fixed in #2389 |
I am running a regression test on Gaea and I noticed that some test jobs fail due to wall clock timeout error but the status of those jobs is incorrectly interpreted by rt scripts. This is part of the log file:
In this specific case the job status is 'COMPLETININVALID', but the script gets the status 'Dusan.Jovic'. Obviously wrong. This happens because the default format of the squeue command is not the same across different systems running slurm scheduler:
On Hera:
On Gaea:
On Hera job state is in column 5, while on Gaea it is in column 6.
Instead of just running
squeue
with the default format we must explicitly set the desired format in order to get the same output on all platforms. For example, I suggest:In this case the first column will always be JOBID and the second column will always be STATE.
The text was updated successfully, but these errors were encountered: