Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The high-resolution region simulation with NUOPC is interrupted at the beginning #975

Closed
niuhanlin opened this issue Jan 18, 2023 · 18 comments

Comments

@niuhanlin
Copy link

niuhanlin commented Jan 18, 2023

Hi! I am currently using CTSM-FATES to perform simulations on the Tibetan Plateau.Run with NUOPC as recommended.
But get a bad news, the run failed.
The error indicates that there is no clear problem, which is where I get confused.
The compiler I use is intel. Do I need to change to the gnu compiler?

It should be noted that single point and regional runs are feasible using MCT.
Running a single point is fine, but running a regional simulation will force it out after three years.
I guess it's caused by too much memory usage.

For NUOPC, a single point of simulation is perfectly fine.
The area simulation will show the MPI interrupt directly and does not show the real problem.

Here are some of the Settings I used to create the case.

./create_newcase --compset 2000_DATM%QIA_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV --res CLM_USRDAT --case TP_5days_test_nuopc_2 --run-unsupported --machine niuhanlin
./xmlchange DATM_YR_START=1979
./xmlchange DATM_YR_END=1979
./xmlchange RUN_STARTDATE=1979-01-01
./xmlchange CLM_FORCE_COLDSTART=on
./xmlchange CLM_ACCELERATED_SPINUP=on
./xmlchange STOP_OPTION=ndays
./xmlchange STOP_N=5
./xmlchange LND_DOMAIN_MESH=lnd_mesh.nc
./xmlchange ATM_DOMAIN_MESH=lnd_mesh.nc
./xmlchange MASK_MESH=mask_mesh.nc
./case.setup
Add the surface file in user_nl_clm.
./case.build
sbatch cesm.sh(This file is the submission Settings.)

The following is what is written in the log file after the interrupt output. By the way, using single-node single-core and multi-node multi-core both failed.

in cesm.log file:
application called MPI_Abort(comm=0x84000000, 1) - process 0

in lnd.log file:
LND: PIO numiotasks= 1
LND: PIO stride= 1
LND: PIO rearranger= 2
LND: PIO root= 1

1 pes participating in computation for CLM


NODE# NAME
( 0) comput20
atm component = datm
rof component = srof
glc component = sglc
atm_prognostic = F
rof_prognostic = F
glc_present = F
flds_scalar_name = cpl_scalars
flds_scalar_num = 4
flds_scalar_index_nx = 1
flds_scalar_index_ny = 2
flds_scalar_index_nextsw_cday = 3
flds_co2a= F
flds_co2b= F
flds_co2c= F
sending co2 to atm = F
receiving co2 from atm = F
(shr_drydep_read) Read in drydep_inparm namelist from: drv_flds_in
(shr_drydep_read) No dry deposition fields will be transfered
(shr_fire_emis_readnl) Read in fire_emis_readnl namelist from: drv_flds_in
(shr_megan_readnl) Read in megan_emis_readnl namelist from: drv_flds_in
(shr_carma_readnl) Read in carma_inparm namelist from: drv_flds_in
shr_carma_readnl: no carma_inparm namelist found in drv_flds_in
(shr_ndep_readnl) Read in ndep_inparm namelist from: drv_flds_in

in atm.log file:
ATM: PIO numiotasks= 1
ATM: PIO stride= 1
ATM: PIO rearranger= 1
ATM: PIO root= 1
((atm_comp_nuopc)) case_name = TP_5days_test_nuopc_2
((atm_comp_nuopc)) datamode = CLMNCEP
((atm_comp_nuopc)) model_meshfile = /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
((atm_comp_nuopc)) model_maskfile = /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
((atm_comp_nuopc)) nx_global = 1
((atm_comp_nuopc)) ny_global = 1
((atm_comp_nuopc)) restfilm = null
((atm_comp_nuopc)) iradsw = 1
((atm_comp_nuopc)) factorFn_data = null
((atm_comp_nuopc)) factorFn_mesh = null
((atm_comp_nuopc)) flds_presaero = T
((atm_comp_nuopc)) flds_presndep = T
((atm_comp_nuopc)) flds_preso3 = T
((atm_comp_nuopc)) flds_co2 = F
((atm_comp_nuopc)) flds_wiso = F
((atm_comp_nuopc)) skip_restart_read = F
datm datamode = CLMNCEP
(dshr_mesh_init) (dshr_mod:dshr_mesh_init) obtained ATM mesh and mask from /public/home/huser053/nhl/CTSM-221203/CTSM-master/tools/site_and_regional/subset_data_regional/lnd_mesh.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/Solar3Hrly/clmforc.Qian.c2006.T62.Solr.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/Solar3Hrly/clmforc.Qian.c2006.T62.Solr.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/Precip3Hrly/clmforc.Qian.c2006.T62.Prec.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/Precip3Hrly/clmforc.Qian.c2006.T62.Prec.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/data/TmpPrsHumWnd3Hrly/clmforc.Qian.c2006.T62.TPQW.1979-01.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/data/TmpPrsHumWnd3Hrly/clmforc.Qian.c2006.T62.TPQW.1979-01.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_WACCM.ensmean_monthly_hist_1849-2015_0.9x1.25_CMIP6_c180926.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_WACCM.ensmean_monthly_hist_1849-2015_0.9x1.25_CMIP6_c180926.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/lnd/clm2/ndepdata/fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/lnd/clm2/ndepdata/fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/cdeps/datm/ozone/O3_surface.f09_g17.CMIP6-historical-WACCM.001.monthly.185001-201412.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/cdeps/datm/ozone/O3_surface.f09_g17.CMIP6-historical-WACCM.001.monthly.185001-201412.nc
(shr_stream_getCalendar) opening stream filename = /public/home/huser053/nhl/inputdata/atm/datm7/topo_forcing/topodata_0.9x1.25_USGS_070110_stream_c151201.nc
(shr_stream_getCalendar) closing stream filename = /public/home/huser053/nhl/inputdata/atm/datm7/topo_forcing/topodata_0.9x1.25_USGS_070110_stream_c151201.nc
(shr_strdata_set_stream_domain) stream_nlev = 1
(shr_sdat_init) Creating field bundle array fldbun_data of size 2 for stream 1
adding field Faxa_swdn to fldbun_data for stream 1

in drv.log file:
(esm_time_clockInit):: driver start_ymd: 19790101
(esm_time_clockInit):: driver start_tod: 0
(esm_time_clockInit):: driver curr_ymd: 19790101
(esm_time_clockInit):: driver curr_tod: 0
(esm_time_clockInit):: driver time interval is : 1800
(esm_time_clockInit):: driver stop_ymd: 99990101
(esm_time_clockInit):: driver stop_tod: 0
PIO rearranger options:
comm type = 0 (p2p)
comm fcd = 0 (2denable)
max pend req (comp2io) = -2
enable_hs (comp2io) = T
enable_isend (comp2io) = F
max pend req (io2comp) = 64
enable_hs (io2comp) = F
enable_isend (io2comp) = T
8 MB memory alloc in MB is 8.00\n8 MB memory dealloc in MB is
0.00\nMemory block size conversion in bytes is 1019.02
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F

Attached is my lnd.in file
lnd_in.txt

@rosiealice
Copy link
Contributor

Hi @niuhanlin,

This sounds like an exciting experiment! Have you tried to run it with a default (non fates) CLM compset? In the case of relatively complex setups like this it might be useful to confirm whether it it s FATES-specific error or not...

If so, I a, not sure we have done a whole lot of testing (someone correct me if I'm wrong) with the accelerated spinup activated.

Cheers!
Rosie

@slevis-lmwg
Copy link
Contributor

Have you tried to run it with a default (non fates) CLM compset? In the case of relatively complex setups like this it might be useful to confirm whether it it s FATES-specific error or not...

If so, I a, not sure we have done a whole lot of testing (someone correct me if I'm wrong) with the accelerated spinup activated.

@rosiealice raises a good point. I think that the accelerated spinup option is reserved for bgc cases, and your compset does not include bgc from what I can tell.

Regardless, if you haven't done the following, I recommend that you do this first:

Another comment:
In your create_newcase I see that you set --machine niuhanlin which may or may not be a problem. The examples that I mentioned above have worked for me and others on cheyenne. I'm not sure whether @XiulinGao has also tested elsewhere.

@niuhanlin
Copy link
Author

@rosiealice With your reminder, I have done some cases of other situations and the results are as follows.

1.compset:2000_DATM%QIA_CLM51%SP_SICE_SOCN_SROF_SGLC_SWAV_SESP
The rest of the content remains the same as above, with the change to SP mode.
The result is the same as above, with an error at runtime.

2.compset: X res:f19_g16
An error is reported at the case creation stage.
The error is reported as follows.
ERROR: Config file .../CTSM-master/cime/src/components/xcpl_comps_nuopc/xlnd/cime_config/config_component.xml for component xlnd not found.
The same mistake happened with LIU after I found it after searching. But unfortunately, no solution was given.

3.compset: B1850 res:f19_g16
The error is reported as follows.
ERROR: Invalid compset name, B1850, all stub components generated

Based on testing so far, I'm guessing if it's due to a version update and I need to add important settings to the config_* file.
Do you have any suggestions?

@slevis-lmwg
Copy link
Contributor

Something that I don't see you using in your create_newcase is this:
--mpilib mpi-serial

@niuhanlin
Copy link
Author

@slevisconsulting In fact, I followed this step as you mentioned, except that it was in a local cluster.
I also think the error could have occurred in machines: niuhanlin.
Here is my local machine setup. Regarding mpi,I set it in machine and compiler file, but I'm not sure it worked.

Sorry I didn't find a way to make it directly available for you to see, I had to write it as a file.
compiler-niuhanlin.txt
machines-niuhanlin.txt

@slevis-lmwg
Copy link
Contributor

Porting to a local cluster or other platform is beyond my expertise.

@ekluzek does NCAR offer community support for porting the CTSM on other platforms?

@niuhanlin
Copy link
Author

@rosiealice @slevisconsulting I did the experiment on FATES and it worked! Created using compset as follows.
--res f45_g37 --compset I2000Clm51FatesSpRsGs

According to the current test results indicate that there is a problem with the compset of 2000_DATM%QIA_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV, not a problem with my porting to the local machine.

It also shows that my local setup for this port can be used by others.

@slevis-lmwg
Copy link
Contributor

That's great news @niuhanlin

Just to be clear:
Did you submit exactly what you shared in the first post (at the top) but with the different compset and different resolution?

@niuhanlin
Copy link
Author

Yes, the other settings I use are consistent.

But the surface file and the atmospheric forcing file use the default instead of using the one I made.
It should be noted that the surface data and atmospheric forcing data I produced are runnable with MCT.

@rosiealice
Copy link
Contributor

So the compset that doesn't work is a FATES-SP (satellite phenology) case, and the one that does work is a fully dynamic (default) fates case? I am not expert enough in compset names to figure out the other differences, but just to check that an SP case is actually what you want to run?

@niuhanlin
Copy link
Author

Contrary to what you said, FATES-SP works and FATES-fixed_biogeog does not.
But I guess it's possible that the non-working is caused by the way the atmosphere forces the use of QIAN.
I will do further work on fixed-bio mode to verify which module is the problem.

@jkshuman
Copy link
Contributor

@niuhanlin thank you for documenting and testing this so thoroughly. Can you try running your setup with the long name for the compset? I include here the long name for GSWP3, but I see you are using QIA. Maybe it is a problem with the alias.

I2000Clm51FatesRs
2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV

@niuhanlin
Copy link
Author

@jkshuman You are right! After following the information you provided, I created a new case with GSWP3 and it ran successfully!
I will try to run the simulation for a long time to see if it will be forced out.

@jkshuman
Copy link
Contributor

Thanks for checking that @niuhanlin. Can you open an issue on the CTSM side with the details for the fail for the alias? Tagging @ekluzek to talk about this alias problem.

Glad to hear it is functional with the long name for the compset.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 21, 2023

OK, it sounds like the issue here is that QIAN forcing doesn't run well with FATES. This isn't a configuration we test, as we only test FATES with GSWP3 and CRUNCEP forcing. In principle you should be able to use any datm forcing with any CTSM configuration, so even though we don't test QIAN with FATES I'd expect it to work. But, we only test QIAN forcing with CTSM-BGC. With software if you don't test something it can mean it's broken.

In this case I'd wonder if the problem is too few processors for this specific case, because it's a custom resolution on a custom machine. But, actually QIAN forcing has less data than GSWP3, which is the opposite of what I'd expect. I suppose it could still be something different between QIAN and GSWP3 forcing. And still could be something with the machine or processor setup.

I looked at the compset aliases for FATES and didn't see any problems. The main problem would be a mismatch of the alias name with how the long-compset name is given.

@jkshuman am I understanding what's going on here? Is there a specific compset alias you think I should check? Also do you think it's important for FATES to be able to run with QIAN forcing? Since, QIAN forcing is our oldest lowest resolution forcing dataset, I wasn't thinking it was that important. But, if so I could check some cases with FATES and QIAN. We could at least provide a warning about using the two together. But, it still could be something specific to this resolution and machine.

@jkshuman
Copy link
Contributor

Thanks for talking this through @ekluzek and looking things over. Glad to hear the aliases look fine. I think I got myself mixed up on this one, but at least @niuhanlin got a successful run. I agree with you on QIAN being low priority based on your comments @ekluzek

@niuhanlin can you confirm that using the GSWP3 compset 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV will work for you?

@niuhanlin
Copy link
Author

I did some testing and found that both GSWP3v1 and Qia worked. The problem was with mapalgo.
The default parameter bilinear cannot be simulated smoothly and can be run when nn is used.

@glemieux
Copy link
Contributor

Closing this here, to continue discussion on the ctsm-side: ESCOMP/CTSM#1937

@github-project-automation github-project-automation bot moved this from ❕Todo to ✔ Done in FATES issue board Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants