-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with early versions of PIO2 in CLM #124
Comments
Erik Kluzek < erik > - 2015-12-10 12:09:49 -0700 Removing the workaround (avoid_pnetcdf) in bug 1730 removes some issues, but not all. |
Erik Kluzek < erik > - 2015-12-10 12:11:27 -0700 Another workaround is to use PIO1 as follows... cd cime/externals The ERP_D_Ld5.f19_g16.ICRUCLM50BGC.yellowstone_intel.clm-fire_emis test was shown to work with pio1.9.23. |
Bill Sacks < sacks > - 2015-12-10 12:17:13 -0700 (In reply to Erik Kluzek from comment #1)
Does this point to a broader problem in PIO2? i.e., why does PIO2 not like it when you use netcdf for some files? Is this a problem with the netcdf interface in general, or just when you have some files that use pnetcdf and some that use netcdf? e.g., if you set the pio type to netcdf for everything, would things work fine in these cases? |
Jim Edwards < jedwards > - 2015-12-10 12:44:02 -0700 The problem is that in pio2 we have two rearranger methods instead of just one and the default rearranger is subset (the new one) which improves performance of pnetcdf but hurts the serial netcdf performance, so if you want to use netcdf you should use the box rearranger. My sandbox now appears to work without being forced to use serial netcdf for the clm history file - this will be in cime4.3.2 |
Erik Kluzek < erik > - 2015-12-10 14:48:59 -0700 Bunch of tests fail on hobart as well, and it looks like it's this problem (a timeout that happens after the simulation is finished when it's writing a bunch of output). RUN ERI_D_Ld9_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-reduceOutput.C.151208-160543 I verified the timeout in SMS.f10_f10.IRCP45CN.hobart_pgi.clm-reduceOutput.C.151208-160543 I haven't looked at the others. mpi-serial tests with hobart_nag were successful. And it looks like bug 2213 was fixed on hobart_nag in r158. |
Erik Kluzek < erik > - 2016-01-05 13:59:19 -0700 Looks like these issues get cleared up with cime4.3.9 (at least on yellowstone). |
Erik Kluzek < erik > - 2016-01-07 17:27:35 -0700 OK, on hobart with clm4_5_7_r164 with cime4.3.9 I still have a list of failures due to the run taking too long (over 2 hours). All of these should finish in a much shorter time than that as they are short simulations. Other cases run to completion in a much shorter time. ERI_D_Ld9_P24x1.f10_f10.ICRUCLM50BGC.hobart_nag.clm-reduceOutput.GC.160106-153439 |
Jim Edwards < jedwards > - 2016-01-07 20:16:43 -0700 It appears that someone (Bill Sacks according to svn blame) commented out the initdecomp at line 2394 of ncdio_pio.F90.in and replaced it with the older PIO_REARR_BOX version. The variable LEVGRND_CLASS is causing initdecomp to |
Jim Edwards < jedwards > - 2016-01-07 20:29:49 -0700 (In reply to Jim Edwards from comment #8)
I wonder if this also explains the degraded performance that Bill reported recently? |
Erik Kluzek < erik > - 2016-01-07 22:54:09 -0700 (In reply to Jim Edwards from comment #8)
Jim this was the commit of the pio2 branch that Bill brought to the CLM trunk in December of 2014. r65959 | sacks | 2014-12-03 06:24:30 -0700 (Wed, 03 Dec 2014) | 1 line merge changes from pio2_dev2 branch: update pio calls to pio2 API I'll try a few with the PIO_REARR_SUBSET option and see if that goes. |
Erik Kluzek < erik > - 2016-01-08 00:08:24 -0700 OK, giving more time to the job does NOT work, but using SUBSET rearranger does! I'll see if the other cases work that failed. |
Bill Sacks < sacks > - 2016-01-08 05:52:02 -0700 Yes, the initdecomp change was actually Jim's change. I just brought it to the trunk for him. Jim made this change in revision 64202. |
Jim Edwards < jedwards > - 2016-01-08 07:46:48 -0700 I extracted the decomp for variable LEVGRND_CLASS and ran it in the PIO standalone test suite. Not only does it work fine for both PIO_REARR_BOX and PIO_REARR_SUBSET, but there is also no notable difference in performance. I will continue to investigate. |
Erik Kluzek < erik > - 2016-01-08 10:05:07 -0700 OK changing PIO REARR to SUBSET allows all the cases on hobart that failed to work successfully. |
Erik Kluzek < erik > - 2016-01-08 14:41:33 -0700 OK, I redid several tests, both on hobart and yellowstone and most (sans-1) ran OK. Although it looks like performance is abysmal for SUBSET, so I'm not sure we want to use it just for that reason. It looks like to me that the performance of PIO2 for CLM is poor compared to PIO1, and subset is even worse. But, the following KitchenSink test fails on yellowstone... SMS_Lm1.f09_g16_gl5.IG1850CRUCLM50BGC.yellowstone_intel.clm-clm50KitchenSink with the following error... 601:Open file /glade/p/cesm/lmwg/atm_forcing.datm7.cruncep_qianFill.0.5d.V4.c130305/TPHWL6Hrly/clmforc.cruncep.V4.c2011.0.5d.TPQWL.1901-01.nc 0 |
Erik Kluzek < erik > - 2016-01-12 12:06:57 -0700 OK, clm4_5_7_r164 updates to cime4.3.9 and also uses the setting of LND_PIO_REARRANGER rather than hardcoding the rearranger in CLM source (SUBSET for clm40 and BOX for clm45/clm50). The default for LND_PIO_REARRANGER is the same as before. Our testing on hobart runs with some tests set to SUBSET for clm45/clm50 tests that failed. So testing works. |
Bill Sacks < sacks > - 2016-01-28 11:17:44 -0700 In my branch off of r164, these tests take > 10 hours to complete. I am putting them in the xfail list since I don't typically allow that much time for tests in the test suite: ERP_D_P4x30_Ld5.ne30_g16.ICN.yellowstone_intel.clm-40default |
Bill Sacks < sacks > - 2016-03-17 14:07:28 -0600 For the workaround in comment2 (using pio1) to work for me (on hobart-nag), I needed to set PIO_REARRANGER to 1; it didn't work to set LND_PIO_REARRANGER to 1. |
Bill Sacks < sacks > - 2016-03-17 14:11:33 -0600 This test now fails consistently with the pio2 version in CLM, in my branch slated to become r173: ERP_Ly5.1x1_numaIA.ICRUCLM50BGCCROP.hobart_nag.clm-monthly It looks like it's dying when writing the .rh1 file. It passes with pio1, using the workarounds documented in comment 2 and comment 18. A debug version of that test passes with pio2, and both production and debug versions pass with pio2 with all 3 yellowstone compilers. I'm not sure why this started failing all of a sudden. In addition, this test fails about half the time now; again, I can't tell why the changes on my branch would trigger these sporadic failures: ERP_Ld5_P24x1.f10_f10.I1850CLM45BGC.hobart_nag.clm-default When it fails, it seems to be in writing the .h1 file. Oddly, one traceback pointed to a death in the pnetcdf library, despite the fact that there was a message from CLM saying that it was using the workaround for bug 1730: using netcdf rather than pnetcdf. |
Erik Kluzek < erik > - 2016-06-17 16:52:26 -0600 In clm4_5_8_r181 you can now choose to use PIO1 or PIO2 and PIO1 is the default. |
I think this is likely not a problem anymore as both CLM and PIO2 have progressed. We should run the latest test list with PIO2 on hobart and cheyenne and just see that we don't have problems though. CESM does want to be moving to PIO2. |
Let's wait to test this until we have the go-ahead from Jim with a suggestion that pio2 should work well now for all CESM use cases and that we want to move to it. |
A fix for PIO2 in DEBUG mode was brought in in ctsm1.0.dev070 (see #810) - though note that I haven't actually run tests with PIO2. I'm not sure whether there are other outstanding problems that will still need to be resolved. |
These issues seem to be resolved now (see #1095) |
Erik Kluzek < erik > - 2015-12-10 12:07:05 -0700
Bugzilla Id: 2256
Bugzilla Depends: 1730,
Bugzilla CC: andre, jedwards, mvertens, sacks,
Most CLM tests work fine when CIKME is updated to a version that uses PIO2. But, several have problems. One problem is a hang when creating files.
Here is a list of tests that fail with PIO2 in clm4_5_6_r159
ERP_D_P4x30_Ld5.ne30_g16.ICN.yellowstone_intel.clm-40default
ERP_D_Ld5.f19_g16.ICRUCLM50BGC.yellowstone_intel.clm-fire_emis
ERP_D_Ld5.hcru_hcru.ICRUCN.yellowstone_pgi.clm-40default
SMS_D_Ld5_Mmpi-serial.5x5_amazon.ICLM45ED.yellowstone_pgi.clm-edTest
ERS_P192x1_Ld211.f19_g16.ICNDVCROP.yellowstone_intel.clm-crop
ERI_Ld9.ne30_g16.I4804.yellowstone_pgi.clm-40default
This is with cime4.3.1 which uses PIO2.0.27.
The text was updated successfully, but these errors were encountered: