Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted status files on chrysalis (NING bug) #242

Closed
golaz opened this issue May 13, 2022 · 17 comments · Fixed by #246
Closed

Corrupted status files on chrysalis (NING bug) #242

golaz opened this issue May 13, 2022 · 17 comments · Fixed by #246
Assignees
Labels
semver: bug Bug fix (will increment patch version)

Comments

@golaz
Copy link
Collaborator

golaz commented May 13, 2022

This is likely a new issue that appeared after the monthly maintenance on Chrysalis this past Monday.

For certain zppy tasks (mpas_analysis, tc_analysis), the status files upon successful completion of a task are corrupt. The content of the status file should simply be

OK

but instead, it looks something like

OK
NING 176138

The bash line of code that updates the file looks like

echo 'OK' > /lcrc/group/e3sm/ac.golaz/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/mpas_analysis_ts_1850-1860_climo_1855-1860.status

I don't understand how the status file can become corrupt and I have not been able to reproduce in a simple test (but the full task will repeatedly and consistently produce the erroneous status).

A simple workaround that appears effective is to first delete the file:

rm -f /lcrc/group/e3sm/ac.golaz/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/mpas_analysis_ts_1850-1860_climo_1855-1860.status
echo 'OK' > /lcrc/group/e3sm/ac.golaz/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/mpas_analysis_ts_1850-1860_climo_1855-1860.status
@golaz golaz added the semver: bug Bug fix (will increment patch version) label May 13, 2022
@golaz
Copy link
Collaborator Author

golaz commented May 13, 2022

Also tagging @xylar out of pure curiosity. Have you ever seen anything like that?

@forsyth2
Copy link
Collaborator

This is a duplicate of #241. As this issue provides more info, I'm keeping this open and closing #241.

@golaz
Copy link
Collaborator Author

golaz commented May 13, 2022

Also tagging @rljacob. Do you know what was done during the Chrysalis maintenance last Monday? Any upgrades to bash or the file system? In this particular case, the bug is mostly harmless. But if it is a symptom of a larger issue, it could be quite serious (file integrity).

@golaz golaz changed the title Corrupted status files on chrysalis Corrupted status files on chrysalis (NING bug) May 13, 2022
@xylar
Copy link
Contributor

xylar commented May 13, 2022

I would say this is almost certainly a race condition. This call:

echo "RUNNING ${id}" > {{ scriptDir }}/{{ prefix }}.status

most likely happens at the same time as this call:
echo 'OK' > {{ scriptDir }}/{{ prefix }}.status

But I can't immediately see why that would happen.

@xylar
Copy link
Contributor

xylar commented May 13, 2022

I don't have much experience with bash redirects like this. I use python almost exclusively for logging.

@golaz
Copy link
Collaborator Author

golaz commented May 13, 2022

Thanks, Xylar. In most cases, the first and last updates to status files should be many minutes apart.

@rljacob
Copy link
Member

rljacob commented May 13, 2022

Adding @amametjanov who might know more about what was done during maintenance.

@amametjanov
Copy link
Member

Based on NING $JobID, this is likely an issue with scripts rather than maintenance.

@rljacob
Copy link
Member

rljacob commented May 13, 2022

There was an update to the GPFS software. You should send details about file corruption to [email protected].

@rljacob
Copy link
Member

rljacob commented May 13, 2022

I agree a readable string at the end of a file doesn't look like corruption. File corruption usually results in unprintable characters when you try to more/cat/edit the file.

@forsyth2
Copy link
Collaborator

forsyth2 commented May 13, 2022

@golaz and I noticed the issue in different branches (#227 and #237) and hadn't seen it before. That leads me to believe it's a machine issue. That said, I can try running the last release to see if the error still occurs. If the last release still works fine, I can run git bisect to see if a certain zppy commit introduces this error.

@forsyth2
Copy link
Collaborator

forsyth2 commented May 18, 2022

@golaz @rljacob I re-ran zppy as of the latest official release and the output below shows the bug has affected even this release. I don't see how this is a zppy issue -- something must have changed on Chrysalis.

$ grep -v "OK" *status
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_0001-0020.status:ERROR (1)
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_0001-0050.status:WAITING 179072
mpas_analysis_ts_0001-0050_climo_0021-0050.status:NING 179073
mpas_analysis_ts_0001-0100_climo_0051-0100.status:NING 179074
tc_analysis_0001-0020.status:NING 179069
tc_analysis_0001-0050.status:RUNNING 179070
$ cat mpas_analysis_ts_0001-0050_climo_0021-0050.status
OK
NING 179073

@forsyth2
Copy link
Collaborator

this is likely an issue with scripts rather than maintenance.

@amametjanov Did Chrysalis upgrade the bash version (or something along those lines) such that output redirection would be affected? The bug is affecting a previous release that we know for sure worked earlier.

@forsyth2
Copy link
Collaborator

I'm also not seeing this bug when running on Compy

@amametjanov
Copy link
Member

Please see Slack channel chrysalis-users about OS upgrade around April 28: from CentOS 8 to RHEL 8.5 https://acmeclimate.slack.com/archives/C01ER9J9TEJ/p1651178895618699

chrlogin1 is back up from that kernel security vulnerability patching, please directly `ssh [chrlogin1.lcrc.anl.gov](http://chrlogin1.lcrc.anl.gov/)` and try things out, chrlogin2 will be taken offline for patching tomorrow.
Before:
$ hostnamectl
   Static hostname: chrlogin2.lcrc.anl.gov
         Icon name: computer-server
           Chassis: server
        Machine ID: 1223eff44ddb4f608acd23a5878f24be
           Boot ID: 58dacba3b29a45e892939571228a4d58
  Operating System: CentOS Linux 8
       CPE OS Name: cpe:/o:centos:centos:8
            Kernel: Linux 4.18.0-240.10.1.el8_3.x86_64
      Architecture: x86-64
After:
$ hostnamectl
   Static hostname: chrlogin1.lcrc.anl.gov
         Icon name: computer-server
           Chassis: server
        Machine ID: b62a99e3760b44279d0961058b9a12b6
           Boot ID: d96630b422b44a6bbdf15a6b68898bb3
  Operating System: Red Hat Enterprise Linux 8.5 (Ootpa)
       CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
            Kernel: Linux 4.18.0-348.23.1.el8_5.x86_64
      Architecture: x86-64
Had to upgrade CentOS 8 to 8.5, Mellanox OFED drivers 5.2 to 5.5 and GPFS.

@rljacob
Copy link
Member

rljacob commented May 19, 2022

The compute nodes still have the old image. Try running zppy in an interactive session.

Bash is a little different:
old:
GNU bash, version 4.4.19(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.

New:
GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.

@golaz
Copy link
Collaborator Author

golaz commented May 19, 2022

@rljacob : this is strange. The bug manifests itself when we run on the compute nodes.

@forsyth2 : I doubt that we are going to find an explanation for this. So it might be best to simply implement the workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants