-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse grid simulation with PPE.n12_ctsm5.1.dev030 fails when trying to close h5 file at end of simulation. #1449
Comments
@olyson is going to try this with PIO1 just for kicks (change PIO_VERSION). And also try it in the latest version on CTSM main-dev. We should also check what NetCDF format is being used (check PIO_NETCDF_FORMAT). Possibly it needs to be upgraded to a newer format if it's reaching the data limit for NetCDF with these 10 year 3-hourly files (the default is likely 64bit_offset, but you could try 64bit_data). |
/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/share/util/mct_mod.F90(766): remark #5140: Unrecognized directive
|
OK, none of our suggestions work. Jim Edwards is going on vacation, but when he gets back we should point this out to him. Since, this works for the full grid, but fails for the sparse grid (which should would be smaller data size) it shouldn't be the NetCDF format. And since it worked before but fails now, that might be some hint about what's going on. Jim is back the week of the 16th... |
Closing this since we have a workaround for this. |
@olyson can you let us know what the workaround you figured out is? |
It's noted above: "Splitting the h5 output into yearly files DOES work. So we (the PPE) could operate in that configuration." |
Brief summary of bug
A sparse grid simulation that uses the PPE.n12_ctsm5.1.dev030 tag fails when trying to close the h5 file at the end of the simulation.
General bug information
CTSM version you are using: branch_tags/PPE.n12_ctsm5.1.dev030
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: --compset 2000_DATM%GSWP3v1_CLM51%SP_SICE_SOCN_SROF_SGLC_SWAV_SIAC_SESP --res f19_g17
Details of bug
The simulation dies at the end with this error:
73:MPT: #2 MPI_SGI_stacktraceback (
73:MPT: header=header@entry=0x7ffdd3c24b40 "MPT ERROR: Rank 73(g:73) received signal SIGSEGV(11).\n\tProcess ID: 35712, Host: r11i5n9, Program: /glade/scratch/oleson/ctsm51c6_PPEn12ctsm51d030_2deg_GSWP3V1_Sparse400_2000/bld/cesm.exe\n\tMPT Version:"...) at sig.c:340
73:MPT: #3 0x00002b4a1a3e7a62 in first_arriver_handler (signo=signo@entry=11,
73:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b4a24a00080) at sig.c:489
73:MPT: #4 0x00002b4a1a3e7dfb in slave_sig_handler (signo=11,
73:MPT: siginfo=, extra=) at sig.c:565
73:MPT: #5
73:MPT: #6 0x0000000001853a03 in intel_avx_rep_memcpy ()
73:MPT: #7 0x000000000118c60b in PIOc_write_darray_multi ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/clib/pio_darray.c:378
73:MPT: #8 0x0000000001193081 in flush_buffer ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/clib/pio_darray_int.c:1846
73:MPT: #9 0x000000000114dc10 in PIOc_closefile ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/clib/pio_file.c:420
73:MPT: #10 0x00000000010b8da3 in piolib_mod_mp_closefile ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/flib/piolib_mod.F90:1447
73:MPT: #11 0x00000000005d6626 in histfilemod_mp_hist_htapes_wrapup ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/src/main/histFileMod.F90:3869
73:MPT: #12 0x0000000000543d99 in clm_driver_mp_clm_drv_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/src/main/clm_driver.F90:1398
73:MPT: #13 0x00000000004fc853 in lnd_comp_mct_mp_lnd_run_mct_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/src/cpl/mct/lnd_comp_mct.F90:455
73:MPT: #14 0x000000000042b214 in component_mod_mp_component_run_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/drivers/mct/main/component_mod.F90:737
73:MPT: #15 0x000000000040a1dd in cime_comp_mod_mp_cime_run_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/drivers/mct/main/cime_comp_mod.F90:2855
73:MPT: #16 0x000000000042ae55 in MAIN__ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/drivers/mct/main/cime_driver.F90:153
73:MPT: #17 0x00000000004079e2 in main ()
Line 3869 of histFileMod.F90 is:
According to the lnd log it's trying to close the h5 file.
It takes about 45 minutes between the end of the simulation and the error to occur.
This is a 3-hourly file and I've requested that 10 years of data be on a single file.
The h5 configuration is:
hist_mfilt(6) = 29201
hist_dov2xy(6) = .false.
hist_nhtfrq(6) = -3
hist_type1d_pertape(6) = 'GRID'
hist_fincl6 += 'TSA','RH2M','FSH','EFLX_LH_TOT','TSKIN:I','FPSN'
If I completely remove the h5 file from my user_nl_clm and rerun, the simulation ends normally.
Reconfiguring the h5 file with dov2xy = .true. (i.e., lat/lon) DOESN'T work.
Splitting the h5 output into yearly files DOES work. So we (the PPE) could operate in that configuration.
I'm filing this because this worked fine in a previous tag (PPE.n08_ctsm5.1.dev023).
Once significant difference I see is that the new tag uses PIO2 instead of PIO1.
A full grid version of this simulation (including the h5 file as configured above) runs successfully.
Important details of your setup / configuration so we can reproduce the bug
Case directory:
/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/scripts/ctsm51c6_PPEn12ctsm51d030_2deg_GSWP3V1_Sparse400_2000
Run directory:
/glade/scratch/oleson/ctsm51c6_PPEn12ctsm51d030_2deg_GSWP3V1_Sparse400_2000/run
The text was updated successfully, but these errors were encountered: