Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The high-resolution region simulation with NUOPC is interrupted at the beginning #1937

Closed
niuhanlin opened this issue Jan 18, 2023 · 3 comments

Comments

@niuhanlin
Copy link

Hi! I am currently using CTSM-FATES to perform simulations on the Tibetan Plateau.Run with NUOPC as recommended.
But get a bad news, the run failed.
The error indicates that there is no clear problem, which is where I get confused.
The compiler I use is intel. Do I need to change to the gnu compiler?

It should be noted that single point and regional runs are feasible using MCT.
Running a single point is fine, but running a regional simulation will force it out after three years.
I guess it's caused by too much memory usage.

For NUOPC, a single point of simulation is perfectly fine.
The area simulation will show the MPI interrupt directly and does not show the real problem.

Here are some of the Settings I used to create the case.

./create_newcase --compset 2000_DATM%QIA_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV --res CLM_USRDAT --case TP_5days_test_nuopc_2 --run-unsupported --machine niuhanlin
./xmlchange DATM_YR_START=1979
./xmlchange DATM_YR_END=1979
./xmlchange RUN_STARTDATE=1979-01-01
./xmlchange CLM_FORCE_COLDSTART=on
./xmlchange CLM_ACCELERATED_SPINUP=on
./xmlchange STOP_OPTION=ndays
./xmlchange STOP_N=5
./xmlchange LND_DOMAIN_MESH=lnd_mesh.nc
./xmlchange ATM_DOMAIN_MESH=lnd_mesh.nc
./xmlchange MASK_MESH=mask_mesh.nc
./case.setup
Add the surface file in user_nl_clm.
./case.build
sbatch cesm.sh(This file is the submission Settings.)

The following is what is written in the log file after the interrupt output. By the way, using single-node single-core and multi-node multi-core both failed.

in cesm.log file:
application called MPI_Abort(comm=0x84000000, 1) - process 0

in lnd.log file:
LND: PIO numiotasks= 1
LND: PIO stride= 1
LND: PIO rearranger= 2
LND: PIO root= 1

1 pes participating in computation for CLM

NODE# NAME
( 0) comput20
atm component = datm
rof component = srof
glc component = sglc
atm_prognostic = F
rof_prognostic = F
glc_present = F
flds_scalar_name = cpl_scalars
flds_scalar_num = 4
flds_scalar_index_nx = 1
flds_scalar_index_ny = 2
flds_scalar_index_nextsw_cday = 3
flds_co2a= F
flds_co2b= F
flds_co2c= F
sending co2 to atm = F
receiving co2 from atm = F
(shr_drydep_read) Read in drydep_inparm namelist from: drv_flds_in
(shr_drydep_read) No dry deposition fields will be transfered
(shr_fire_emis_readnl) Read in fire_emis_readnl namelist from: drv_flds_in
(shr_megan_readnl) Read in megan_emis_readnl namelist from: drv_flds_in
(shr_carma_readnl) Read in carma_inparm namelist from: drv_flds_in
shr_carma_readnl: no carma_inparm namelist found in drv_flds_in
(shr_ndep_readnl) Read in ndep_inparm namelist from: drv_flds_in

in atm.log file:
ATM: PIO numiotasks= 1
ATM: PIO stride= 1
ATM: PIO rearranger= 1
ATM: PIO root= 1
((atm_comp_nuopc)) case_name = TP_5days_test_nuopc_2
((atm_comp_nuopc)) datamode = CLMNCEP
((atm_comp_nuopc)) model_meshfile = /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
((atm_comp_nuopc)) model_maskfile = /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
((atm_comp_nuopc)) nx_global = 1
((atm_comp_nuopc)) ny_global = 1
((atm_comp_nuopc)) restfilm = null
((atm_comp_nuopc)) iradsw = 1
((atm_comp_nuopc)) factorFn_data = null
((atm_comp_nuopc)) factorFn_mesh = null
((atm_comp_nuopc)) flds_presaero = T
((atm_comp_nuopc)) flds_presndep = T
((atm_comp_nuopc)) flds_preso3 = T
((atm_comp_nuopc)) flds_co2 = F
((atm_comp_nuopc)) flds_wiso = F
((atm_comp_nuopc)) skip_restart_read = F
datm datamode = CLMNCEP
(dshr_mesh_init) (dshr_mod:dshr_mesh_init) obtained ATM mesh and mask from /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/Solar3Hrly/clmforc.Qian.c2006.T62.Solr.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/Solar3Hrly/clmforc.Qian.c2006.T62.Solr.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/Precip3Hrly/clmforc.Qian.c2006.T62.Prec.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/Precip3Hrly/clmforc.Qian.c2006.T62.Prec.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/TmpPrsHumWnd3Hrly/clmforc.Qian.c2006.T62.TPQW.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/TmpPrsHumWnd3Hrly/clmforc.Qian.c2006.T62.TPQW.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_WACCM.ensmean_monthly_hist_1849-2015_0.9x1.25_CMIP6_c180926.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_WACCM.ensmean_monthly_hist_1849-2015_0.9x1.25_CMIP6_c180926.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/lnd/clm2/ndepdata/fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/lnd/clm2/ndepdata/fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/cdeps/datm/ozone/O3_surface.f09_g17.CMIP6-historical-WACCM.001.monthly.185001-201412.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/cdeps/datm/ozone/O3_surface.f09_g17.CMIP6-historical-WACCM.001.monthly.185001-201412.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/atm/datm7/topo_forcing/topodata_0.9x1.25_USGS_070110_stream_c151201.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/atm/datm7/topo_forcing/topodata_0.9x1.25_USGS_070110_stream_c151201.nc
(shr_strdata_set_stream_domain) stream_nlev = 1
(shr_sdat_init) Creating field bundle array fldbun_data of size 2 for stream 1
adding field Faxa_swdn to fldbun_data for stream 1

in drv.log file:
(esm_time_clockInit):: driver start_ymd: 19790101
(esm_time_clockInit):: driver start_tod: 0
(esm_time_clockInit):: driver curr_ymd: 19790101
(esm_time_clockInit):: driver curr_tod: 0
(esm_time_clockInit):: driver time interval is : 1800
(esm_time_clockInit):: driver stop_ymd: 99990101
(esm_time_clockInit):: driver stop_tod: 0
PIO rearranger options:
comm type = 0 (p2p)
comm fcd = 0 (2denable)
max pend req (comp2io) = -2
enable_hs (comp2io) = T
enable_isend (comp2io) = F
max pend req (io2comp) = 64
enable_hs (io2comp) = F
enable_isend (io2comp) = T
8 MB memory alloc in MB is 8.00\n8 MB memory dealloc in MB is
0.00\nMemory block size conversion in bytes is 1019.02
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jan 19, 2023
@glemieux
Copy link
Collaborator

glemieux commented Jan 23, 2023

Note this was originally raised in the fates github issue board: NGEET/fates#975

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 23, 2023

@niuhanlin this is all very odd behavior to me. I'm wondering if the real core issue is running into memory limitations on your machine. I gather that this is a specific problem on your particular region and on your particular machine. So I don't think it's a general problem with the model.

What size is your grid in terms of total number of gridcells? And how many processors are you using? I would recommend using more processors and even go towards using a processor for each gridcell which would be the max you could scale it out to. If the behavior calms down with using more processors I think this is likely just a problem with not using enough.

@ekluzek ekluzek removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jan 23, 2023
@niuhanlin
Copy link
Author

Since this problem appears on my own machine, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants