Incorrect job status returned from squeue in rt.sh #2379

DusanJovic-NOAA · 2024-07-26T22:06:04Z

I am running a regression test on Gaea and I noticed that some test jobs fail due to wall clock timeout error but the status of those jobs is incorrectly interpreted by rt scripts. This is part of the log file:

++ squeue -u Dusan.Jovic -j 135087666
+ job_info='JOBID       CLUSTER  PARTITION  QOS         USER                STATE    TIME_LEFT   NODES  NAME
135087666   c5       batch      normal      Dusan.Jovic         COMPLETININVALID     3      run_cpld_debug_pdlib'
+ grep -q 135087666
+ job_running=true
++ grep 135087666
+ status='135087666   c5       batch      normal      Dusan.Jovic         COMPLETININVALID     3      run_cpld_debug_pdlib'
++ awk '{print $5}'
+ status=Dusan.Jovic
+ case ${status} in
+ status_label=Unknown
+ echo 'rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic'
rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic
+ echo 'rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed'
rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed
+ echo '34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)'
34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)

In this specific case the job status is 'COMPLETININVALID', but the script gets the status 'Dusan.Jovic'. Obviously wrong. This happens because the default format of the squeue command is not the same across different systems running slurm scheduler:

On Hera:

$ squeue -u Dusan.Jovic -j 63947991
    JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)
 63947991 hera       compile_atm_dyn32_intel  Dusan.Jovic      RUNNING      0:36      30:00     1 h20c01

On Gaea:

$ squeue -u Dusan.Jovic -j 135087764
JOBID       CLUSTER  PARTITION  QOS         USER                STATE    TIME_LEFT   NODES  NAME
135087764   c5       batch      normal      Dusan.Jovic         RUNNING  17:23       3      run_cpld_debug_pdlib

On Hera job state is in column 5, while on Gaea it is in column 6.

Instead of just running squeue with the default format we must explicitly set the desired format in order to get the same output on all platforms. For example, I suggest:

squeue -u "${USER}" -j "${jobid}" -o '%i %T'

In this case the first column will always be JOBID and the second column will always be STATE.

The text was updated successfully, but these errors were encountered:

DusanJovic-NOAA · 2024-07-29T13:49:47Z

As I'm testing the updates that fix the above issue I'm finding few other problems with the current scripts.

Consider this test configuration:

COMPILE | atm_dyn32 | intel | -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v17_p8 -D32BIT=ON | | fv3 |
RUN | control_p8            |   | baseline |
RUN | control_restart_p8    |   |          | control_p8

If control_p8 test fails for whatever reason (timeout or model crash) it is not rerun the second time despite the fact that we have ECF_TRIES=2 set in rt.sh. The second issue is that the dependent task (control_restart_p8) is submitted, which is wrong because the task it depends on (control_p8) failed.

This happens because the exit code of a job is never checked. Once the job finishes we must check the error code. Instead the script (run_test.sh) proceeds by comparing the output files, finds that they are missing, labels the test as failed and exits successfully. So workflow manager, ecflow in this case, never reruns the failed job, instead it (ecflow) thinks the job finished successfully and submits the dependent task (control_restart_p8), which it should not. The previous job failed. See the control_p8 test log file:

+ echo '1 min. Slurm Job 64031780 Status: Job running (RUNNING)'
1 min. Slurm Job 64031780 Status: Job running (RUNNING)
+ ((  n=n+1  ))
+ wait 1528283
+ sleep 60
+ [[ true == \f\a\l\s\e ]]
+ case ${SCHEDULER} in
++ squeue -u Dusan.Jovic -j 64031780
+ job_info='     JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)'
+ grep -q 64031780
+ job_running=false
+ continue
+ [[ false == \f\a\l\s\e ]]
+ skip_check_results=false
+ results_okay=YES
+ [[ false = false ]]
+ check_results
+ echo 'rt_utils.sh: Checking results of the regression test: control_p8_intel'
rt_utils.sh: Checking results of the regression test: control_p8_intel
+ ROCOTO=false
+ ECFLOW=true
+ local test_status=PASS
+ echo
+ echo 'baseline dir = /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/develop-20240724/control_p8_intel'
+ echo 'working dir  = /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel'
+ echo 'Checking test control_p8_intel results ....'
+ echo

+ echo 'baseline dir = /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/develop-20240724/control_p8_intel'
baseline dir = /scratch2/NAGAPE/epic/UFS-WM_RT/NEMSfv3gfs/develop-20240724/control_p8_intel
+ echo 'working dir  = /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel'
working dir  = /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel
+ echo 'Checking test control_p8_intel results ....'
Checking test control_p8_intel results ....
+ [[ false = false ]]
+ for i in ${LIST_FILES}
+ printf %s ' Comparing sfcf000.nc .....'
+ printf %s ' Comparing sfcf000.nc .....'
 Comparing sfcf000.nc .....+ [[ ! -f /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_2799342/control_p8_intel/sfcf000.nc ]]
+ echo '.......MISSING file'
+ echo '.......MISSING file'

This used to work before. In previous versions of rt_utils.sh we used qstat and sacct commands to look at exit code of finished jobs, for example here:

ufs-weather-model/tests/rt_utils.sh

Line 214 in 10974d9

exit_status=$( qstat ${jobid} -x -f | grep Exit_status | awk '{print $3}')

and here:

ufs-weather-model/tests/rt_utils.sh

Line 238 in 10974d9

    
           status_label=$( sacct -n -j ${slurm_id} --format=JobID,state%20,Jobname%20 | grep "^${slurm_id}" | grep ${JBNME} | awk '{print $2}' )

These checks have been removed for some reason. I think we should add them back.

BrianCurtis-NOAA · 2024-07-29T14:09:53Z

Good catch, it seems the slurm/pbs default output format changed and we need to be more specific.

For the second issue, am I correct in my understanding that the run test script, which is what ecflow executes, returns exit 0 even when a test fails?

I'm not trying to say we shouldn't check the job's exit status in the batch system (this may end up being the easiest approach), but wouldn't it be best to try to catch the non-zero exit status in the run_test script and ensure it gets propagated such that ecflow will properly handle the job?

DusanJovic-NOAA · 2024-07-29T14:37:21Z

Good catch, it seems the slurm/pbs default output format changed and we need to be more specific.

For the second issue, am I correct in my understanding that the run test script, which is what ecflow executes, returns exit 0 even when a test fails?

Currently, the run test script (run_test.sh) returns exit code 0 regardless on whether job failed or not. By job I mean the script (job_card) that is executed by the pbs or slurm.

I'm not trying to say we shouldn't check the job's exit status in the batch system (this may end up being the easiest approach), but wouldn't it be best to try to catch the non-zero exit status in the run_test script and ensure it gets propagated such that ecflow will properly handle the job?

Yes, and 'catch the non-zero exit status in the run_test script' is exactly what these two commands (qstat -x and sacct) were doing before.

DusanJovic-NOAA · 2024-07-29T14:47:03Z

See the changes here: develop...DusanJovic-NOAA:ufs-weather-model:rt_squeue_state

BrianCurtis-NOAA · 2024-07-29T14:48:21Z

Yes, and 'catch the non-zero exit status in the run_test script' is exactly what these two commands (qstat -x and sacct) were doing before.

OK. Let's bring them back in, then.

I think we'd have to wrap these commands in a set +e because if i recall they would cause error even with success.

DusanJovic-NOAA · 2024-07-29T16:12:56Z

I also noticed that ecflow_client commands that set the task labels were also removed:

ufs-weather-model/tests/rt_utils.sh

Lines 180 to 183 in 10974d9

    
           if [[ ${ECFLOW:-false} == true ]]; then 
        
             ecflow_client --label=job_id "${jobid}" 
        
             ecflow_client --label=job_status "submitted" 
        
           fi

The task labels are very useful while using ecflow ui to monitor tests execution during development/debugging, for example:

I can see here that control_p8_intel test failed and there is a job_id so I can check the reason for failure (if job itself failed due to timeout error or network/mpi error, or if test script failed due to disk quota error for example), I can then easily restart the failed test. I also see that control_restart_p8_intel has never been submitted (due to failed dependency) which means I should also resubmit it too. All this without resubmitting rt.sh and most importantly without recompiling the executable.

I find this very useful. I suggest we add those ecflow_client commands back.

BrianCurtis-NOAA · 2024-07-30T14:54:57Z

You may be the only person I know who uses the ecflow gui.

DusanJovic-NOAA · 2024-08-02T13:19:39Z

I found one more minor issue. If, for whatever reason, any command in the check_result function in rt_utils.sh fails, the error will not be detected. This is the log I see in one of my tests:

....OK
+ for i in ${LIST_FILES}
+ printf %s ' Moving 20210323.060000.out_grd.ww3 .....'
 Moving 20210323.060000.out_grd.ww3 .....+ printf %s ' Moving 20210323.060000.out_grd.ww3 .....'
+ [[ -f /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/20210323.060000.out_grd.ww3 ]]
++ dirname 20210323.060000.out_grd.ww3
+ mkdir -p /scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_intel/.
mkdir: cannot create directory '/scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_intel': Permission denied
+ cp /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/20210323.060000.out_grd.ww3 /scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_inte
l/20210323.060000.out_grd.ww3
cp: failed to access '/scratch1/NCEPDEV/stmp4/Dusan.Jovic/FV3_RT/REGRESSION_TEST/cpld_control_p8_intel/20210323.060000.out_grd.ww3': Permission denied
+ echo ....OK
+ echo ....OK
....OK
+ echo
+ grep 'The total amount of wall time' /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/out
+ grep 'The maximum resident set size' /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/out
+ echo
+ TRIES=
+ [[ false == true ]]
+ echo 'Test cpld_2threads_p8_intel PASS'
+ echo
+ echo 'Test cpld_2threads_p8_intel PASS'
Test cpld_2threads_p8_intel PASS
+ echo

+ [[ PASS = \F\A\I\L ]]
+ return 0
+ [[ slurm != \n\o\n\e ]]
+ cat /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_4120962/cpld_2threads_p8_intel/job_timestamp.txt
+ [[ YES == YES ]]
+ remove_fail_test
+ echo 'Removing test failure flag file for cpld_2threads_p8_intel'
Removing test failure flag file for cpld_2threads_p8_intel
+ rm -f /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/pr2278/ufs-weather-model/tests-dev/fail_test_cpld_2threads_p8_intel
++ date +%s
+ date_s=1722549369
+ echo ' 1722549369, 6'
+ delete_rundir=false
+ [[ false = true ]]
+ elapsed=374
+ echo 'run_test.sh: Test cpld_2threads_p8_intel Completed.'
run_test.sh: Test cpld_2threads_p8_intel Completed.
+ echo 'run_test.sh: Test cpld_2threads_p8_intel Elapsed time 374 seconds.'
run_test.sh: Test cpld_2threads_p8_intel Elapsed time 374 seconds.
+ '[' 0 -eq 0 ']'
_______________________________________________________________
Start Epilog on node h23c07 for job 64322161 :: Thu Aug  1 21:56:09 UTC 2024
Job 64322161 finished for user Dusan.Jovic in partition hera with exit code 0:0
_______________________________________________________________
End Epilogue Thu Aug  1 21:56:09 UTC 2024

Copying test outputs to the baseline holding area failed, but the test script continued and was labeled as PASS. Even though we have '-o errexit' set and the function continues execution, and the calling script (run_test.sh) reports successful test execution. This happens due to a recent change in how check_result function is invoked from a calling script. In order to catch errors from a command (mkdir and cp in this case) run by the function, the function must not be in an if test.

diff --git a/tests/run_test.sh b/tests/run_test.sh
index ace4fd0c..9adf9f1d 100755
--- a/tests/run_test.sh
+++ b/tests/run_test.sh
@@ -398,7 +398,8 @@ fi
 skip_check_results=${skip_check_results:-false}
 results_okay=YES
 if [[ ${skip_check_results} = false ]]; then
-  if ( ! check_results ) ; then
+  check_results
+  if [[ $? -ne 0 ]]; then
     results_okay=NO
   fi
 else

BrianCurtis-NOAA · 2024-08-02T15:01:03Z

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

DusanJovic-NOAA · 2024-08-02T16:00:15Z

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

I am not sure about the linter. Do we have it available on some machine or it is only in github workflow?

BrianCurtis-NOAA · 2024-08-02T16:01:32Z

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

I am not sure about the linter. Do we have it available on some machine or it is only in github workflow?

https://www.shellcheck.net/

DusanJovic-NOAA · 2024-08-02T16:19:05Z

That last catch is a good one. I've seen many of these that i've tried to address in previous updates. Does the linter like that format if [[ $? -ne 0 ]]; then ? I think it may have had issue with the $?

I am not sure about the linter. Do we have it available on some machine or it is only in github workflow?

https://www.shellcheck.net/

Thanks. When I check this test script that is similar to what's happening in run_test.sh:

#!/usr/bin/bash
set -eux
set -o pipefail

trap '[ "$?" -eq 0 ] || write_fail_test' EXIT

write_fail_test() {
  echo "failed in run_test"
  exit 1
}

check_results() {

    local test_status='PASS'

    cp missing_file misssing_directory/

    if [[ ${test_status} == 'FAIL' ]]; then
       return 1
    else
       return 0
    fi
}

results_okay=YES
check_results
if [[ $? -ne 0 ]]; then
    results_okay=NO
fi

echo "results_okay: ${results_okay}"

I get:

$ shellcheck myscript
 
Line 27:
if [[ $? -ne 0 ]]; then
      ^-- SC2181 (style): Check exit code directly with e.g. 'if ! mycmd;', not indirectly with $?.

$

Looks like this message is about 'style' not an error or warning.

DusanJovic-NOAA · 2024-08-02T17:48:24Z

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

BrianCurtis-NOAA · 2024-08-02T17:50:00Z

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

That's where it should reside then. Agree with it.

DusanJovic-NOAA · 2024-08-02T19:53:53Z

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

That's where it should reside then. Agree with it.

Done. See DusanJovic-NOAA@9690557

DusanJovic-NOAA · 2024-08-02T20:08:46Z

Since check_results is called only from one place in run_test.sh would it make sense to simply inline the code of check_results in run_test? That will simplify error handling. What do you think?

That's where it should reside then. Agree with it.

Done. See DusanJovic-NOAA@9690557

I ran ./rt.sh -l error-test.conf -e to test error handling, and I got:

PASS -- COMPILE 'atm_dyn32_intel' [13:12, 11:56] ( 1 warnings 1 remarks )
PASS -- TEST 'control_c48.v2.sfc_intel' [08:23, 06:24](874 MB)
FAILED: RUN DID NOT COMPLETE -- TEST 'fail_to_copy_intel' [, ]( MB)
FAILED: RUN DID NOT COMPLETE -- TEST 'fail_to_run_intel' [, ]( MB)

PASS -- COMPILE 'atm_dyn64_intel' [13:12, 11:55] ( 1 warnings 1 remarks )
FAILED: UNABLE TO COMPLETE COMPARISON -- TEST 'control_c48_intel' [, ]( MB)

FAILED: UNABLE TO FINISH COMPILE -- COMPILE 'fail_to_compile_intel' [, ]
SKIPPED: ASSOCIATED COMPILE FAILED -- TEST 'dependency_unmet_intel' [, ]( MB)

jobs that are expected to pass passed, jobs that are expected to fail failed.

DusanJovic-NOAA · 2024-12-04T13:41:37Z

Fixed in #2389

junwang-noaa added this to Model infrastructure development FY2024Q4 Jul 29, 2024

junwang-noaa assigned DusanJovic-NOAA Jul 29, 2024

junwang-noaa moved this to In Progress in Model infrastructure development FY2024Q4 Jul 29, 2024

DusanJovic-NOAA mentioned this issue Aug 5, 2024

Improve error checking in rt.sh #2388

Closed

14 tasks

NickSzapiro-NOAA mentioned this issue Aug 21, 2024

Improve error checking in rt.sh #2388 NickSzapiro-NOAA/ufs-weather-model#3

Merged

14 tasks

junwang-noaa moved this to In Progress in Model infrastructure development FY25Q1 Sep 30, 2024

junwang-noaa added this to Model infrastructure development FY25Q1 Sep 30, 2024

DusanJovic-NOAA closed this as completed Dec 4, 2024

github-project-automation bot moved this from In Progress to Done in Model infrastructure development FY25Q1 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect job status returned from squeue in rt.sh #2379

Incorrect job status returned from squeue in rt.sh #2379

DusanJovic-NOAA commented Jul 26, 2024

DusanJovic-NOAA commented Jul 29, 2024

BrianCurtis-NOAA commented Jul 29, 2024

DusanJovic-NOAA commented Jul 29, 2024

DusanJovic-NOAA commented Jul 29, 2024

BrianCurtis-NOAA commented Jul 29, 2024

DusanJovic-NOAA commented Jul 29, 2024

BrianCurtis-NOAA commented Jul 30, 2024

DusanJovic-NOAA commented Aug 2, 2024

BrianCurtis-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

BrianCurtis-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

BrianCurtis-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Dec 4, 2024

Incorrect job status returned from squeue in rt.sh #2379

Incorrect job status returned from squeue in rt.sh #2379

Comments

DusanJovic-NOAA commented Jul 26, 2024

DusanJovic-NOAA commented Jul 29, 2024

BrianCurtis-NOAA commented Jul 29, 2024

DusanJovic-NOAA commented Jul 29, 2024

DusanJovic-NOAA commented Jul 29, 2024

BrianCurtis-NOAA commented Jul 29, 2024

DusanJovic-NOAA commented Jul 29, 2024

BrianCurtis-NOAA commented Jul 30, 2024

DusanJovic-NOAA commented Aug 2, 2024

BrianCurtis-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

BrianCurtis-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

BrianCurtis-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Aug 2, 2024

DusanJovic-NOAA commented Dec 4, 2024