Corrupted status files on chrysalis (NING bug) #242

golaz · 2022-05-13T00:19:09Z

This is likely a new issue that appeared after the monthly maintenance on Chrysalis this past Monday.

For certain zppy tasks (mpas_analysis, tc_analysis), the status files upon successful completion of a task are corrupt. The content of the status file should simply be

OK

but instead, it looks something like

OK
NING 176138

The bash line of code that updates the file looks like

echo 'OK' > /lcrc/group/e3sm/ac.golaz/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/mpas_analysis_ts_1850-1860_climo_1855-1860.status

I don't understand how the status file can become corrupt and I have not been able to reproduce in a simple test (but the full task will repeatedly and consistently produce the erroneous status).

A simple workaround that appears effective is to first delete the file:

rm -f /lcrc/group/e3sm/ac.golaz/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/mpas_analysis_ts_1850-1860_climo_1855-1860.status
echo 'OK' > /lcrc/group/e3sm/ac.golaz/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/mpas_analysis_ts_1850-1860_climo_1855-1860.status

The text was updated successfully, but these errors were encountered:

golaz · 2022-05-13T00:20:29Z

Also tagging @xylar out of pure curiosity. Have you ever seen anything like that?

forsyth2 · 2022-05-13T00:22:08Z

This is a duplicate of #241. As this issue provides more info, I'm keeping this open and closing #241.

golaz · 2022-05-13T00:31:02Z

Also tagging @rljacob. Do you know what was done during the Chrysalis maintenance last Monday? Any upgrades to bash or the file system? In this particular case, the bug is mostly harmless. But if it is a symptom of a larger issue, it could be quite serious (file integrity).

xylar · 2022-05-13T00:46:21Z

I would say this is almost certainly a race condition. This call:

zppy/zppy/templates/mpas_analysis.bash

Line 23 in 335b33a

echo "RUNNING ${id}" > {{ scriptDir }}/{{ prefix }}.status

most likely happens at the same time as this call:

zppy/zppy/templates/mpas_analysis.bash

Line 388 in 335b33a

echo 'OK' > {{ scriptDir }}/{{ prefix }}.status

But I can't immediately see why that would happen.

xylar · 2022-05-13T00:47:45Z

I don't have much experience with bash redirects like this. I use python almost exclusively for logging.

golaz · 2022-05-13T02:07:14Z

Thanks, Xylar. In most cases, the first and last updates to status files should be many minutes apart.

rljacob · 2022-05-13T03:51:05Z

Adding @amametjanov who might know more about what was done during maintenance.

amametjanov · 2022-05-13T17:06:06Z

Based on NING $JobID, this is likely an issue with scripts rather than maintenance.

rljacob · 2022-05-13T18:57:00Z

There was an update to the GPFS software. You should send details about file corruption to [email protected].

rljacob · 2022-05-13T18:59:11Z

I agree a readable string at the end of a file doesn't look like corruption. File corruption usually results in unprintable characters when you try to more/cat/edit the file.

forsyth2 · 2022-05-13T22:50:49Z

@golaz and I noticed the issue in different branches (#227 and #237) and hadn't seen it before. That leads me to believe it's a machine issue. That said, I can try running the last release to see if the error still occurs. If the last release still works fine, I can run git bisect to see if a certain zppy commit introduces this error.

forsyth2 · 2022-05-18T20:51:36Z

@golaz @rljacob I re-ran zppy as of the latest official release and the output below shows the bug has affected even this release. I don't see how this is a zppy issue -- something must have changed on Chrysalis.

$ grep -v "OK" *status
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_0001-0020.status:ERROR (1)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_0001-0050.status:WAITING 179072
mpas_analysis_ts_0001-0050_climo_0021-0050.status:NING 179073
mpas_analysis_ts_0001-0100_climo_0051-0100.status:NING 179074
tc_analysis_0001-0020.status:NING 179069
tc_analysis_0001-0050.status:RUNNING 179070

$ cat mpas_analysis_ts_0001-0050_climo_0021-0050.status
OK
NING 179073

forsyth2 · 2022-05-18T20:56:20Z

this is likely an issue with scripts rather than maintenance.

@amametjanov Did Chrysalis upgrade the bash version (or something along those lines) such that output redirection would be affected? The bug is affecting a previous release that we know for sure worked earlier.

forsyth2 · 2022-05-18T22:32:23Z

I'm also not seeing this bug when running on Compy

amametjanov · 2022-05-18T23:42:00Z

Please see Slack channel chrysalis-users about OS upgrade around April 28: from CentOS 8 to RHEL 8.5 https://acmeclimate.slack.com/archives/C01ER9J9TEJ/p1651178895618699

chrlogin1 is back up from that kernel security vulnerability patching, please directly `ssh [chrlogin1.lcrc.anl.gov](http://chrlogin1.lcrc.anl.gov/)` and try things out, chrlogin2 will be taken offline for patching tomorrow.
Before:
$ hostnamectl
   Static hostname: chrlogin2.lcrc.anl.gov
         Icon name: computer-server
           Chassis: server
        Machine ID: 1223eff44ddb4f608acd23a5878f24be
           Boot ID: 58dacba3b29a45e892939571228a4d58
  Operating System: CentOS Linux 8
       CPE OS Name: cpe:/o:centos:centos:8
            Kernel: Linux 4.18.0-240.10.1.el8_3.x86_64
      Architecture: x86-64
After:
$ hostnamectl
   Static hostname: chrlogin1.lcrc.anl.gov
         Icon name: computer-server
           Chassis: server
        Machine ID: b62a99e3760b44279d0961058b9a12b6
           Boot ID: d96630b422b44a6bbdf15a6b68898bb3
  Operating System: Red Hat Enterprise Linux 8.5 (Ootpa)
       CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
            Kernel: Linux 4.18.0-348.23.1.el8_5.x86_64
      Architecture: x86-64
Had to upgrade CentOS 8 to 8.5, Mellanox OFED drivers 5.2 to 5.5 and GPFS.

rljacob · 2022-05-19T04:51:26Z

The compute nodes still have the old image. Try running zppy in an interactive session.

Bash is a little different:
old:
GNU bash, version 4.4.19(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.

New:
GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.

golaz · 2022-05-19T16:28:09Z

@rljacob : this is strange. The bug manifests itself when we run on the compute nodes.

@forsyth2 : I doubt that we are going to find an explanation for this. So it might be best to simply implement the workaround.

golaz added the semver: bug Bug fix (will increment patch version) label May 13, 2022

golaz assigned forsyth2 May 13, 2022

forsyth2 mentioned this issue May 13, 2022

"OK" overwriting "RUNNING" status #241

Closed

golaz changed the title ~~Corrupted status files on chrysalis~~ Corrupted status files on chrysalis (NING bug) May 13, 2022

forsyth2 mentioned this issue May 20, 2022

Workaround for status file bug #246

Merged

forsyth2 closed this as completed in #246 May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted status files on chrysalis (NING bug) #242

Corrupted status files on chrysalis (NING bug) #242

golaz commented May 13, 2022

golaz commented May 13, 2022

forsyth2 commented May 13, 2022

golaz commented May 13, 2022

xylar commented May 13, 2022

xylar commented May 13, 2022

golaz commented May 13, 2022

rljacob commented May 13, 2022

amametjanov commented May 13, 2022

rljacob commented May 13, 2022

rljacob commented May 13, 2022

forsyth2 commented May 13, 2022 •

edited

Loading

forsyth2 commented May 18, 2022 •

edited

Loading

forsyth2 commented May 18, 2022

forsyth2 commented May 18, 2022

amametjanov commented May 18, 2022

rljacob commented May 19, 2022

golaz commented May 19, 2022

Corrupted status files on chrysalis (NING bug) #242

Corrupted status files on chrysalis (NING bug) #242

Comments

golaz commented May 13, 2022

golaz commented May 13, 2022

forsyth2 commented May 13, 2022

golaz commented May 13, 2022

xylar commented May 13, 2022

xylar commented May 13, 2022

golaz commented May 13, 2022

rljacob commented May 13, 2022

amametjanov commented May 13, 2022

rljacob commented May 13, 2022

rljacob commented May 13, 2022

forsyth2 commented May 13, 2022 • edited Loading

forsyth2 commented May 18, 2022 • edited Loading

forsyth2 commented May 18, 2022

forsyth2 commented May 18, 2022

amametjanov commented May 18, 2022

rljacob commented May 19, 2022

golaz commented May 19, 2022

forsyth2 commented May 13, 2022 •

edited

Loading

forsyth2 commented May 18, 2022 •

edited

Loading