Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpld_control_p8_faster & cpld_debug_p8 fail on Gaea #1790

Closed
zach1221 opened this issue Jun 7, 2023 · 36 comments
Closed

cpld_control_p8_faster & cpld_debug_p8 fail on Gaea #1790

zach1221 opened this issue Jun 7, 2023 · 36 comments
Assignees
Labels
bug Something isn't working

Comments

@zach1221
Copy link
Collaborator

zach1221 commented Jun 7, 2023

Description

cpld_control_p8_faster & cpld_debug_p8 fail on Gaea with apparent memory issue.

To Reproduce:

This happens with intel compilers, when attempting to run the regression tests on Gaea. Steps to reproduce below.

  1. git clone --recurse-submodules https://github.com/ufs-community/ufs-weather-model
  2. cd ufs-weather-model/tests/
  3. ./rt.sh -a nggps_emc -c -n cpld_control_p8_faster intel or ./rt.sh -a nggps_emc -c -n cpld_debug_p8 intel
  4. cd to the test directory, example (/lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_21757/cpld_control_p8_faster_intel)
  5. vi err

Additional context

Turning off these two tests on Gaea, for the time being, and will investigate/troubleshoot through this ticket.

These tests have been disabled for Gaea in UFS-WM PR#1754.

Output

image

@zach1221 zach1221 added the bug Something isn't working label Jun 7, 2023
@zach1221 zach1221 self-assigned this Jun 7, 2023
@DeniseWorthen
Copy link
Collaborator

In the cpld_debug_p8 case located at /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_42017/cpld_debug_p8_intel, the err log shows an error in sfcsub at line 8338, which is an allocate statement

      allocate(data8(1:mdata))

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 7, 2023

@DeniseWorthen so allocate(data8(1:mdata)) is a line of code in sfcsub.F ?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 7, 2023

The job was killed by the cgroup out-of-memory handler. So, I think out of memory issue on the system.

@DeniseWorthen
Copy link
Collaborator

@jkbk2004 It doesn't really make sense that the debug test would fail and control case would not. These tests are different only in the test length (3 hrs vs 24 hrs) and the compile options.

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 7, 2023

@DeniseWorthen cpld_control_p8_faster fails, or are you referring to control_p8 or cpld_control_p8?

@DeniseWorthen
Copy link
Collaborator

The cpld_debug_p8 test also fails, correct? The title says the debug test also fails; it is that test I was referring to.

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 7, 2023

Correct, both cpld_control_p8_faster & cpld_debug_p8 fail with the same error.
edit I just recieved the same error when running cpld_control_p8 on Gaea as well.

I'll make some changes to the job file to see if I can find workaround for the issue.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 7, 2023

@zach1221 can we give a try to use --mem-per-cpu= option in sbatch option?

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 7, 2023

@jkbk2004 sure, I'll do that.

@jieshunzhu
Copy link
Collaborator

It is probably a Gaea issue. I got the same error hours ago for my other tests. But now the issue disappeared.

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 8, 2023

I'm still receiving the error on Gaea when trying to run cpld_control_p8, cpld_debug_p8 and cpld_control_p8_faster cases. However they will pass if I utilize stack options ulimit -s unlimited and ulimit -l unlimited, while allocating more nodes than needed to run the job.

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 9, 2023

Following up here. In my testing, increasing the node allocation (reducing task per node), is the only consistently effective workaround.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 9, 2023

In my test with develop branch, these two cases run ok with TPN=18 on gaea. We can apply the change to #1754 and monitor how the cases behave.

@junwang-noaa
Copy link
Collaborator

@zach1221 I am a little confused, I see the the cpld_debug_p8 test passed on gaea in yesterday's PR log file:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/logs/RegressionTests_gaea.log#L946

But in this ticket, it failed on gaea? Do I miss anything?

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 9, 2023

Hi, @junwang-noaa . My understanding is Jong used the workaround mentioned above, to allow the test to pass.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 9, 2023

We didn't apply any trick to develop. Full regression test ran ok on gaea. But in some cases, kernel and slurm trigger the out-of-memory interruption as @zach1221 test with only these cases. I tested a few time with develop branch with TPN=18. It looks stable with increased resource. I applied the change in #1754.

@DeniseWorthen
Copy link
Collaborator

So yesterday we suggested that you back up and bisect the commits to determine when the error arose. Did you do that?

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 9, 2023

Hi, @DeniseWorthen . That's still a work in progress, and I will have more details as soon as possible. I'm leaving this issue open in my name, so I can continue to investigate.

@junwang-noaa
Copy link
Collaborator

@jkbk2004 So if you don't apply the change, the RT will still pass? I'd suggest not applying the change until Zach gets some conclusion on the causes.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 9, 2023

@junwang-noaa it's resource issue. Gaea kernel restricts memory use like that. Also, OS updated a few time during March and April.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 9, 2023

I agree we can traceback some PRs. But it's still trial and error approach.

@jieshunzhu
Copy link
Collaborator

Just for your information -- the memory problem @gaea not only occurs in cpld_control_p8_faster & cpld_debug_p8 @zach1221 experienced. I also got the problem occasionally with cpld_control_nowave_noaero_p8 (with the same executable, it could fail, but might succeed if being resubmitted).

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 9, 2023

@jieshunzhu thanks for the information! I think gaea kernel controls like that depending on overall system workload conditions and slurm follows with the feedback. Giving enough resource might be only practical way on application side. BTW, p8 tag is running ok on gaea, right @jieshunzhu ?

@jieshunzhu
Copy link
Collaborator

@jkbk2004 yes, p8 tag is running ok after some minor modifications. Thanks for the help.

@junwang-noaa
Copy link
Collaborator

@jkbk2004 The thing confusing to me is that if this is sbatch or kernel issue, we should see more failure cases. Especially when cpld_control_nowave_noaero_p8 has problem while there is no report on cpld_control_p8, but cpld_control_p8 uses more memory than the cpld_control_nowave_noaero_p8. If the test is depends on the overall system status, then test failure should be random, not just these two or three.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jun 9, 2023

I agree it must be random.

@jieshunzhu
Copy link
Collaborator

@junwang-noaa Yesterday I got the same problem with cpld_control_p8, too.

@DeniseWorthen
Copy link
Collaborator

@jkbk2004 The thing confusing to me is that if this is sbatch or kernel issue, we should see more failure cases. Especially when cpld_control_nowave_noaero_p8 has problem while there is no report on cpld_control_p8, but cpld_control_p8 uses more memory than the cpld_control_nowave_noaero_p8. If the test is depends on the overall system status, then test failure should be random, not just these two or three.

Agreed. Why does cpld_control_p8 fail and not cpld_control_ciceC_p8? Those are identical in every respect except for a namelist parameter switching CICE to the C-grid. Are we going to keep just turning off tests as they randomly fail?

@jieshunzhu
Copy link
Collaborator

Another information: The same problem could occur in the middle of my runs, i.e., crashing after integrating some model days. It looks like it is related to Gaea system.

@junwang-noaa
Copy link
Collaborator

@jieshunzhu So the failure is pretty random in your testing. Can you do some testing with the fewer tasks per node (30 or 24)?
@jkbk2004 If this is confirmed to be a system issue, I'd suggest changing the gaea TPN in defaultvars.sh, the changes in PR#1754 only applied to three tests. Also changing TPN from 36 -> 18 results in doubling the nodes, can we test TPN 30 or 24 to see if the issue is resolved?

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 9, 2023

Hi, @junwang-noaa . I can definitely test that out. I'll keep you posted.

@zach1221
Copy link
Collaborator Author

zach1221 commented Jun 9, 2023

In my testing, looks like the cpld cases in question will pass when I test with both TPN=30, and TPN=24. I assume it would be best to use 30, as that occupies fewer nodes.

@jieshunzhu
Copy link
Collaborator

I tried TPN=30 together with --#SBATCH --exclusive, and ran cpld_control_nowave_noaero_p8 for ~8months. Previously, using TPN=36, I can also ran it for ~8months (but sometimes failed). So with only one test I am not sure TPN=30 will fix the problem permanently, but I will use it later and add my updates here.

@zach1221 zach1221 moved this to In Progress in Backlog: platforms and RT Jun 13, 2023
@natalie-perlin
Copy link
Collaborator

@zach1221 ,
Is it still the issue on Gaea, or could it be closed now?
A full RT suite from 08/24/2023, which used UFS-WM version before the merge of #1707 , successfully passes all the tests, including cpld_control_p8_faster and cpld_debug_p8 on Gaea, see full log attached.

RegressionTests_Gaea_24Aug2023.log

@jieshunzhu
Copy link
Collaborator

Just want to share a late update on my side: after using more nodes (e.g. TPN=36 to 30) for my job, I did not have the memory problem anymore (I completed hundreds of 9-month runs with no problems).

@natalie-perlin
Copy link
Collaborator

@jieshunzhu - thank you for the update.
@jkbk2004 - please close this issue now.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Backlog: platforms and RT Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

6 participants