-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpld_control_p8_faster & cpld_debug_p8 fail on Gaea #1790
Comments
In the cpld_debug_p8 case located at
|
@DeniseWorthen so allocate(data8(1:mdata)) is a line of code in sfcsub.F ? |
The job was killed by the cgroup out-of-memory handler. So, I think out of memory issue on the system. |
@jkbk2004 It doesn't really make sense that the debug test would fail and control case would not. These tests are different only in the test length (3 hrs vs 24 hrs) and the compile options. |
@DeniseWorthen cpld_control_p8_faster fails, or are you referring to control_p8 or cpld_control_p8? |
The cpld_debug_p8 test also fails, correct? The title says the debug test also fails; it is that test I was referring to. |
Correct, both cpld_control_p8_faster & cpld_debug_p8 fail with the same error. I'll make some changes to the job file to see if I can find workaround for the issue. |
@zach1221 can we give a try to use --mem-per-cpu= option in sbatch option? |
@jkbk2004 sure, I'll do that. |
It is probably a Gaea issue. I got the same error hours ago for my other tests. But now the issue disappeared. |
I'm still receiving the error on Gaea when trying to run cpld_control_p8, cpld_debug_p8 and cpld_control_p8_faster cases. However they will pass if I utilize stack options ulimit -s unlimited and ulimit -l unlimited, while allocating more nodes than needed to run the job. |
Following up here. In my testing, increasing the node allocation (reducing task per node), is the only consistently effective workaround. |
In my test with develop branch, these two cases run ok with TPN=18 on gaea. We can apply the change to #1754 and monitor how the cases behave. |
@zach1221 I am a little confused, I see the the cpld_debug_p8 test passed on gaea in yesterday's PR log file: But in this ticket, it failed on gaea? Do I miss anything? |
Hi, @junwang-noaa . My understanding is Jong used the workaround mentioned above, to allow the test to pass. |
We didn't apply any trick to develop. Full regression test ran ok on gaea. But in some cases, kernel and slurm trigger the out-of-memory interruption as @zach1221 test with only these cases. I tested a few time with develop branch with TPN=18. It looks stable with increased resource. I applied the change in #1754. |
So yesterday we suggested that you back up and bisect the commits to determine when the error arose. Did you do that? |
Hi, @DeniseWorthen . That's still a work in progress, and I will have more details as soon as possible. I'm leaving this issue open in my name, so I can continue to investigate. |
@jkbk2004 So if you don't apply the change, the RT will still pass? I'd suggest not applying the change until Zach gets some conclusion on the causes. |
@junwang-noaa it's resource issue. Gaea kernel restricts memory use like that. Also, OS updated a few time during March and April. |
I agree we can traceback some PRs. But it's still trial and error approach. |
@jieshunzhu thanks for the information! I think gaea kernel controls like that depending on overall system workload conditions and slurm follows with the feedback. Giving enough resource might be only practical way on application side. BTW, p8 tag is running ok on gaea, right @jieshunzhu ? |
@jkbk2004 yes, p8 tag is running ok after some minor modifications. Thanks for the help. |
@jkbk2004 The thing confusing to me is that if this is sbatch or kernel issue, we should see more failure cases. Especially when cpld_control_nowave_noaero_p8 has problem while there is no report on cpld_control_p8, but cpld_control_p8 uses more memory than the cpld_control_nowave_noaero_p8. If the test is depends on the overall system status, then test failure should be random, not just these two or three. |
I agree it must be random. |
@junwang-noaa Yesterday I got the same problem with cpld_control_p8, too. |
Agreed. Why does cpld_control_p8 fail and not cpld_control_ciceC_p8? Those are identical in every respect except for a namelist parameter switching CICE to the C-grid. Are we going to keep just turning off tests as they randomly fail? |
Another information: The same problem could occur in the middle of my runs, i.e., crashing after integrating some model days. It looks like it is related to Gaea system. |
@jieshunzhu So the failure is pretty random in your testing. Can you do some testing with the fewer tasks per node (30 or 24)? |
Hi, @junwang-noaa . I can definitely test that out. I'll keep you posted. |
In my testing, looks like the cpld cases in question will pass when I test with both TPN=30, and TPN=24. I assume it would be best to use 30, as that occupies fewer nodes. |
I tried TPN=30 together with --#SBATCH --exclusive, and ran cpld_control_nowave_noaero_p8 for ~8months. Previously, using TPN=36, I can also ran it for ~8months (but sometimes failed). So with only one test I am not sure TPN=30 will fix the problem permanently, but I will use it later and add my updates here. |
Just want to share a late update on my side: after using more nodes (e.g. TPN=36 to 30) for my job, I did not have the memory problem anymore (I completed hundreds of 9-month runs with no problems). |
@jieshunzhu - thank you for the update. |
Description
cpld_control_p8_faster & cpld_debug_p8 fail on Gaea with apparent memory issue.
To Reproduce:
This happens with intel compilers, when attempting to run the regression tests on Gaea. Steps to reproduce below.
Additional context
Turning off these two tests on Gaea, for the time being, and will investigate/troubleshoot through this ticket.
These tests have been disabled for Gaea in UFS-WM PR#1754.
Output
The text was updated successfully, but these errors were encountered: