-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The high-resolution region simulation with NUOPC is interrupted at the beginning #1937
Comments
Note this was originally raised in the fates github issue board: NGEET/fates#975 |
@niuhanlin this is all very odd behavior to me. I'm wondering if the real core issue is running into memory limitations on your machine. I gather that this is a specific problem on your particular region and on your particular machine. So I don't think it's a general problem with the model. What size is your grid in terms of total number of gridcells? And how many processors are you using? I would recommend using more processors and even go towards using a processor for each gridcell which would be the max you could scale it out to. If the behavior calms down with using more processors I think this is likely just a problem with not using enough. |
Since this problem appears on my own machine, I will close this issue. |
Hi! I am currently using CTSM-FATES to perform simulations on the Tibetan Plateau.Run with NUOPC as recommended.
But get a bad news, the run failed.
The error indicates that there is no clear problem, which is where I get confused.
The compiler I use is intel. Do I need to change to the gnu compiler?
It should be noted that single point and regional runs are feasible using MCT.
Running a single point is fine, but running a regional simulation will force it out after three years.
I guess it's caused by too much memory usage.
For NUOPC, a single point of simulation is perfectly fine.
The area simulation will show the MPI interrupt directly and does not show the real problem.
Here are some of the Settings I used to create the case.
./create_newcase --compset 2000_DATM%QIA_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV --res CLM_USRDAT --case TP_5days_test_nuopc_2 --run-unsupported --machine niuhanlin
./xmlchange DATM_YR_START=1979
./xmlchange DATM_YR_END=1979
./xmlchange RUN_STARTDATE=1979-01-01
./xmlchange CLM_FORCE_COLDSTART=on
./xmlchange CLM_ACCELERATED_SPINUP=on
./xmlchange STOP_OPTION=ndays
./xmlchange STOP_N=5
./xmlchange LND_DOMAIN_MESH=lnd_mesh.nc
./xmlchange ATM_DOMAIN_MESH=lnd_mesh.nc
./xmlchange MASK_MESH=mask_mesh.nc
./case.setup
Add the surface file in user_nl_clm.
./case.build
sbatch cesm.sh(This file is the submission Settings.)
The following is what is written in the log file after the interrupt output. By the way, using single-node single-core and multi-node multi-core both failed.
in cesm.log file:
application called MPI_Abort(comm=0x84000000, 1) - process 0
in lnd.log file:
LND: PIO numiotasks= 1
LND: PIO stride= 1
LND: PIO rearranger= 2
LND: PIO root= 1
1 pes participating in computation for CLM
NODE# NAME
( 0) comput20
atm component = datm
rof component = srof
glc component = sglc
atm_prognostic = F
rof_prognostic = F
glc_present = F
flds_scalar_name = cpl_scalars
flds_scalar_num = 4
flds_scalar_index_nx = 1
flds_scalar_index_ny = 2
flds_scalar_index_nextsw_cday = 3
flds_co2a= F
flds_co2b= F
flds_co2c= F
sending co2 to atm = F
receiving co2 from atm = F
(shr_drydep_read) Read in drydep_inparm namelist from: drv_flds_in
(shr_drydep_read) No dry deposition fields will be transfered
(shr_fire_emis_readnl) Read in fire_emis_readnl namelist from: drv_flds_in
(shr_megan_readnl) Read in megan_emis_readnl namelist from: drv_flds_in
(shr_carma_readnl) Read in carma_inparm namelist from: drv_flds_in
shr_carma_readnl: no carma_inparm namelist found in drv_flds_in
(shr_ndep_readnl) Read in ndep_inparm namelist from: drv_flds_in
in atm.log file:
ATM: PIO numiotasks= 1
ATM: PIO stride= 1
ATM: PIO rearranger= 1
ATM: PIO root= 1
((atm_comp_nuopc)) case_name = TP_5days_test_nuopc_2
((atm_comp_nuopc)) datamode = CLMNCEP
((atm_comp_nuopc)) model_meshfile = /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
((atm_comp_nuopc)) model_maskfile = /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
((atm_comp_nuopc)) nx_global = 1
((atm_comp_nuopc)) ny_global = 1
((atm_comp_nuopc)) restfilm = null
((atm_comp_nuopc)) iradsw = 1
((atm_comp_nuopc)) factorFn_data = null
((atm_comp_nuopc)) factorFn_mesh = null
((atm_comp_nuopc)) flds_presaero = T
((atm_comp_nuopc)) flds_presndep = T
((atm_comp_nuopc)) flds_preso3 = T
((atm_comp_nuopc)) flds_co2 = F
((atm_comp_nuopc)) flds_wiso = F
((atm_comp_nuopc)) skip_restart_read = F
datm datamode = CLMNCEP
(dshr_mesh_init) (dshr_mod:dshr_mesh_init) obtained ATM mesh and mask from /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/Solar3Hrly/clmforc.Qian.c2006.T62.Solr.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/Solar3Hrly/clmforc.Qian.c2006.T62.Solr.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/Precip3Hrly/clmforc.Qian.c2006.T62.Prec.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/Precip3Hrly/clmforc.Qian.c2006.T62.Prec.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/TmpPrsHumWnd3Hrly/clmforc.Qian.c2006.T62.TPQW.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/TmpPrsHumWnd3Hrly/clmforc.Qian.c2006.T62.TPQW.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_WACCM.ensmean_monthly_hist_1849-2015_0.9x1.25_CMIP6_c180926.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_WACCM.ensmean_monthly_hist_1849-2015_0.9x1.25_CMIP6_c180926.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/lnd/clm2/ndepdata/fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/lnd/clm2/ndepdata/fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/cdeps/datm/ozone/O3_surface.f09_g17.CMIP6-historical-WACCM.001.monthly.185001-201412.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/cdeps/datm/ozone/O3_surface.f09_g17.CMIP6-historical-WACCM.001.monthly.185001-201412.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/atm/datm7/topo_forcing/topodata_0.9x1.25_USGS_070110_stream_c151201.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/atm/datm7/topo_forcing/topodata_0.9x1.25_USGS_070110_stream_c151201.nc
(shr_strdata_set_stream_domain) stream_nlev = 1
(shr_sdat_init) Creating field bundle array fldbun_data of size 2 for stream 1
adding field Faxa_swdn to fldbun_data for stream 1
in drv.log file:
(esm_time_clockInit):: driver start_ymd: 19790101
(esm_time_clockInit):: driver start_tod: 0
(esm_time_clockInit):: driver curr_ymd: 19790101
(esm_time_clockInit):: driver curr_tod: 0
(esm_time_clockInit):: driver time interval is : 1800
(esm_time_clockInit):: driver stop_ymd: 99990101
(esm_time_clockInit):: driver stop_tod: 0
PIO rearranger options:
comm type = 0 (p2p)
comm fcd = 0 (2denable)
max pend req (comp2io) = -2
enable_hs (comp2io) = T
enable_isend (comp2io) = F
max pend req (io2comp) = 64
enable_hs (io2comp) = F
enable_isend (io2comp) = T
8 MB memory alloc in MB is 8.00\n8 MB memory dealloc in MB is
0.00\nMemory block size conversion in bytes is 1019.02
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
The text was updated successfully, but these errors were encountered: