cpld_control_p8_faster & cpld_debug_p8 fail on Gaea #1790

zach1221 · 2023-06-07T13:55:49Z

Description

cpld_control_p8_faster & cpld_debug_p8 fail on Gaea with apparent memory issue.

To Reproduce:

This happens with intel compilers, when attempting to run the regression tests on Gaea. Steps to reproduce below.

git clone --recurse-submodules https://github.com/ufs-community/ufs-weather-model
cd ufs-weather-model/tests/
./rt.sh -a nggps_emc -c -n cpld_control_p8_faster intel or ./rt.sh -a nggps_emc -c -n cpld_debug_p8 intel
cd to the test directory, example (/lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_21757/cpld_control_p8_faster_intel)
vi err

Additional context

Turning off these two tests on Gaea, for the time being, and will investigate/troubleshoot through this ticket.

These tests have been disabled for Gaea in UFS-WM PR#1754.

Output

DeniseWorthen · 2023-06-07T14:15:20Z

In the cpld_debug_p8 case located at /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_42017/cpld_debug_p8_intel, the err log shows an error in sfcsub at line 8338, which is an allocate statement

      allocate(data8(1:mdata))

zach1221 · 2023-06-07T14:58:05Z

@DeniseWorthen so allocate(data8(1:mdata)) is a line of code in sfcsub.F ?

jkbk2004 · 2023-06-07T15:34:56Z

The job was killed by the cgroup out-of-memory handler. So, I think out of memory issue on the system.

DeniseWorthen · 2023-06-07T17:15:51Z

@jkbk2004 It doesn't really make sense that the debug test would fail and control case would not. These tests are different only in the test length (3 hrs vs 24 hrs) and the compile options.

zach1221 · 2023-06-07T17:26:04Z

@DeniseWorthen cpld_control_p8_faster fails, or are you referring to control_p8 or cpld_control_p8?

DeniseWorthen · 2023-06-07T17:41:39Z

The cpld_debug_p8 test also fails, correct? The title says the debug test also fails; it is that test I was referring to.

zach1221 · 2023-06-07T17:44:20Z

Correct, both cpld_control_p8_faster & cpld_debug_p8 fail with the same error.
edit I just recieved the same error when running cpld_control_p8 on Gaea as well.

I'll make some changes to the job file to see if I can find workaround for the issue.

jkbk2004 · 2023-06-07T18:41:41Z

@zach1221 can we give a try to use --mem-per-cpu= option in sbatch option?

zach1221 · 2023-06-07T18:45:00Z

@jkbk2004 sure, I'll do that.

jieshunzhu · 2023-06-07T21:01:56Z

It is probably a Gaea issue. I got the same error hours ago for my other tests. But now the issue disappeared.

zach1221 · 2023-06-08T03:17:21Z

I'm still receiving the error on Gaea when trying to run cpld_control_p8, cpld_debug_p8 and cpld_control_p8_faster cases. However they will pass if I utilize stack options ulimit -s unlimited and ulimit -l unlimited, while allocating more nodes than needed to run the job.

zach1221 · 2023-06-09T12:47:27Z

Following up here. In my testing, increasing the node allocation (reducing task per node), is the only consistently effective workaround.

jkbk2004 · 2023-06-09T12:54:58Z

In my test with develop branch, these two cases run ok with TPN=18 on gaea. We can apply the change to #1754 and monitor how the cases behave.

junwang-noaa · 2023-06-09T12:59:54Z

@zach1221 I am a little confused, I see the the cpld_debug_p8 test passed on gaea in yesterday's PR log file:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/logs/RegressionTests_gaea.log#L946

But in this ticket, it failed on gaea? Do I miss anything?

zach1221 · 2023-06-09T13:05:14Z

Hi, @junwang-noaa . My understanding is Jong used the workaround mentioned above, to allow the test to pass.

jkbk2004 · 2023-06-09T13:09:44Z

We didn't apply any trick to develop. Full regression test ran ok on gaea. But in some cases, kernel and slurm trigger the out-of-memory interruption as @zach1221 test with only these cases. I tested a few time with develop branch with TPN=18. It looks stable with increased resource. I applied the change in #1754.

DeniseWorthen · 2023-06-09T13:11:04Z

So yesterday we suggested that you back up and bisect the commits to determine when the error arose. Did you do that?

zach1221 · 2023-06-09T13:13:49Z

Hi, @DeniseWorthen . That's still a work in progress, and I will have more details as soon as possible. I'm leaving this issue open in my name, so I can continue to investigate.

junwang-noaa · 2023-06-09T13:19:08Z

@jkbk2004 So if you don't apply the change, the RT will still pass? I'd suggest not applying the change until Zach gets some conclusion on the causes.

jkbk2004 · 2023-06-09T13:21:34Z

@junwang-noaa it's resource issue. Gaea kernel restricts memory use like that. Also, OS updated a few time during March and April.

jkbk2004 · 2023-06-09T13:23:17Z

I agree we can traceback some PRs. But it's still trial and error approach.

jieshunzhu · 2023-06-09T13:32:09Z

Just for your information -- the memory problem @gaea not only occurs in cpld_control_p8_faster & cpld_debug_p8 @zach1221 experienced. I also got the problem occasionally with cpld_control_nowave_noaero_p8 (with the same executable, it could fail, but might succeed if being resubmitted).

jkbk2004 · 2023-06-09T13:38:49Z

@jieshunzhu thanks for the information! I think gaea kernel controls like that depending on overall system workload conditions and slurm follows with the feedback. Giving enough resource might be only practical way on application side. BTW, p8 tag is running ok on gaea, right @jieshunzhu ?

jieshunzhu · 2023-06-09T13:40:31Z

@jkbk2004 yes, p8 tag is running ok after some minor modifications. Thanks for the help.

junwang-noaa · 2023-06-09T14:17:34Z

@jkbk2004 The thing confusing to me is that if this is sbatch or kernel issue, we should see more failure cases. Especially when cpld_control_nowave_noaero_p8 has problem while there is no report on cpld_control_p8, but cpld_control_p8 uses more memory than the cpld_control_nowave_noaero_p8. If the test is depends on the overall system status, then test failure should be random, not just these two or three.

jkbk2004 · 2023-06-09T14:58:02Z

I agree it must be random.

jieshunzhu · 2023-06-09T15:11:22Z

@junwang-noaa Yesterday I got the same problem with cpld_control_p8, too.

DeniseWorthen · 2023-06-09T15:16:08Z

@jkbk2004 The thing confusing to me is that if this is sbatch or kernel issue, we should see more failure cases. Especially when cpld_control_nowave_noaero_p8 has problem while there is no report on cpld_control_p8, but cpld_control_p8 uses more memory than the cpld_control_nowave_noaero_p8. If the test is depends on the overall system status, then test failure should be random, not just these two or three.

Agreed. Why does cpld_control_p8 fail and not cpld_control_ciceC_p8? Those are identical in every respect except for a namelist parameter switching CICE to the C-grid. Are we going to keep just turning off tests as they randomly fail?

jieshunzhu · 2023-06-09T15:26:26Z

Another information: The same problem could occur in the middle of my runs, i.e., crashing after integrating some model days. It looks like it is related to Gaea system.

junwang-noaa · 2023-06-09T15:37:17Z

@jieshunzhu So the failure is pretty random in your testing. Can you do some testing with the fewer tasks per node (30 or 24)?
@jkbk2004 If this is confirmed to be a system issue, I'd suggest changing the gaea TPN in defaultvars.sh, the changes in PR#1754 only applied to three tests. Also changing TPN from 36 -> 18 results in doubling the nodes, can we test TPN 30 or 24 to see if the issue is resolved?

zach1221 · 2023-06-09T15:49:41Z

Hi, @junwang-noaa . I can definitely test that out. I'll keep you posted.

zach1221 · 2023-06-09T20:13:51Z

In my testing, looks like the cpld cases in question will pass when I test with both TPN=30, and TPN=24. I assume it would be best to use 30, as that occupies fewer nodes.

jieshunzhu · 2023-06-11T01:44:23Z

I tried TPN=30 together with --#SBATCH --exclusive, and ran cpld_control_nowave_noaero_p8 for ~8months. Previously, using TPN=36, I can also ran it for ~8months (but sometimes failed). So with only one test I am not sure TPN=30 will fix the problem permanently, but I will use it later and add my updates here.

natalie-perlin · 2023-08-27T16:28:22Z

@zach1221 ,
Is it still the issue on Gaea, or could it be closed now?
A full RT suite from 08/24/2023, which used UFS-WM version before the merge of #1707 , successfully passes all the tests, including cpld_control_p8_faster and cpld_debug_p8 on Gaea, see full log attached.

RegressionTests_Gaea_24Aug2023.log

jieshunzhu · 2023-08-29T01:27:20Z

Just want to share a late update on my side: after using more nodes (e.g. TPN=36 to 30) for my job, I did not have the memory problem anymore (I completed hundreds of 9-month runs with no problems).

natalie-perlin · 2023-09-06T15:39:18Z

@jieshunzhu - thank you for the update.
@jkbk2004 - please close this issue now.

zach1221 added the bug Something isn't working label Jun 7, 2023

zach1221 self-assigned this Jun 7, 2023

zach1221 added this to Backlog: platforms and RT Jun 7, 2023

zach1221 mentioned this issue Jun 7, 2023

MYNN fix for numerical stability issues with mixing snow #1754

Merged

36 tasks

zach1221 moved this to In Progress in Backlog: platforms and RT Jun 13, 2023

zach1221 closed this as completed Oct 17, 2023

github-project-automation bot moved this from In Progress to Done in Backlog: platforms and RT Oct 17, 2023

cpld_control_p8_faster & cpld_debug_p8 fail on Gaea #1790

cpld_control_p8_faster & cpld_debug_p8 fail on Gaea #1790

Comments

zach1221 commented Jun 7, 2023

Description

To Reproduce:

Additional context

Output

DeniseWorthen commented Jun 7, 2023

zach1221 commented Jun 7, 2023

jkbk2004 commented Jun 7, 2023

DeniseWorthen commented Jun 7, 2023

zach1221 commented Jun 7, 2023

DeniseWorthen commented Jun 7, 2023

zach1221 commented Jun 7, 2023 • edited Loading

jkbk2004 commented Jun 7, 2023

zach1221 commented Jun 7, 2023

jieshunzhu commented Jun 7, 2023

zach1221 commented Jun 8, 2023

zach1221 commented Jun 9, 2023

jkbk2004 commented Jun 9, 2023

junwang-noaa commented Jun 9, 2023

zach1221 commented Jun 9, 2023

jkbk2004 commented Jun 9, 2023

DeniseWorthen commented Jun 9, 2023

zach1221 commented Jun 9, 2023

junwang-noaa commented Jun 9, 2023

jkbk2004 commented Jun 9, 2023

jkbk2004 commented Jun 9, 2023

jieshunzhu commented Jun 9, 2023

jkbk2004 commented Jun 9, 2023

jieshunzhu commented Jun 9, 2023

junwang-noaa commented Jun 9, 2023

jkbk2004 commented Jun 9, 2023

jieshunzhu commented Jun 9, 2023

DeniseWorthen commented Jun 9, 2023

jieshunzhu commented Jun 9, 2023

junwang-noaa commented Jun 9, 2023

zach1221 commented Jun 9, 2023

zach1221 commented Jun 9, 2023

jieshunzhu commented Jun 11, 2023

natalie-perlin commented Aug 27, 2023

jieshunzhu commented Aug 29, 2023

natalie-perlin commented Sep 6, 2023

zach1221 commented Jun 7, 2023 •

edited

Loading