Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derecho transition: Get CTSM5.2 mksurfdata_esmf tool working on Derecho, Casper, Izumi #2201

Closed
samsrabin opened this issue Oct 16, 2023 · 10 comments
Assignees

Comments

@samsrabin
Copy link
Collaborator

Originally part of #1995 from @slevis-lmwg.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 9, 2023

  1. I have confirmed that ./gen_mksurfdata_build.sh still works on cheyenne.
  2. It fails on derecho, casper, and izumi with differing errors that I do not know how to address. There was a time when it worked on casper and izumi (Enable mksurfdata esmf to build and run on casper and izumi #1748).

UPDATE: See below for latest test on derecho.
On derecho it fails while trying the configure command. Output to the screen is:

cime Machine is: derecho...
Run cime configure for machine derecho...
configure for the default MPI-library and compiler...
ERROR: No machine derecho found
Error doing configure for machine name: derecho

On casper it fails with a different error:

cime Machine is: casper...
Run cime configure for machine casper...
configure for the default MPI-library and compiler...
ERROR: module command /glade/u/apps/dav/opt/lmod/7.7.29/libexec/lmod python purge  failed with message:
/glade/u/apps/dav/opt/lua/5.3.4/bin/lua: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory
Error doing configure for machine name: casper

On izumi it fails later in the build:

[100%] Linking Fortran executable mksurfdata
/usr/bin/cmake3 -E cmake_link_script CMakeFiles/mksurfdata.dir/link.txt --verbose=1
/cluster/mvapich2-2.3.3-intel-cluster-20.0.1/bin/mpif90   -g CMakeFiles/mksurfdata.dir/mkagfirepkmonthMod.F90.o CMakeFiles/mksurfdata.dir/mkchecksMod.F90.o CMakeFiles/mksurfdata.dir/mkdiagnosticsMod.F90.o CMakeFiles/mksurfdata.dir/mkdomainMod.F90.o CMakeFiles/mksurfdata.dir/mkesmfMod.F90.o CMakeFiles/mksurfdata.dir/mkfileMod.F90.o CMakeFiles/mksurfdata.dir/mkgdpMod.F90.o CMakeFiles/mksurfdata.dir/mkglacierregionMod.F90.o CMakeFiles/mksurfdata.dir/mkglcmecMod.F90.o CMakeFiles/mksurfdata.dir/mkharvestMod.F90.o CMakeFiles/mksurfdata.dir/mkindexmapMod.F90.o CMakeFiles/mksurfdata.dir/mkinputMod.F90.o CMakeFiles/mksurfdata.dir/mklaiMod.F90.o CMakeFiles/mksurfdata.dir/mklanwatMod.F90.o CMakeFiles/mksurfdata.dir/mkpeatMod.F90.o CMakeFiles/mksurfdata.dir/mkpioMod.F90.o CMakeFiles/mksurfdata.dir/mkpftMod.F90.o CMakeFiles/mksurfdata.dir/mkpftConstantsMod.F90.o CMakeFiles/mksurfdata.dir/mkpctPftTypeMod.F90.o CMakeFiles/mksurfdata.dir/mkpftUtilsMod.F90.o CMakeFiles/mksurfdata.dir/mksoilcolMod.F90.o CMakeFiles/mksurfdata.dir/mksoilfmaxMod.F90.o CMakeFiles/mksurfdata.dir/mksoiltexMod.F90.o CMakeFiles/mksurfdata.dir/mksoildepthMod.F90.o CMakeFiles/mksurfdata.dir/mktopostatsMod.F90.o CMakeFiles/mksurfdata.dir/mkurbanparMod.F90.o CMakeFiles/mksurfdata.dir/mkutilsMod.F90.o CMakeFiles/mksurfdata.dir/mkvarctl.F90.o CMakeFiles/mksurfdata.dir/mkvarpar.F90.o CMakeFiles/mksurfdata.dir/mkvocefMod.F90.o CMakeFiles/mksurfdata.dir/mkVICparamsMod.F90.o CMakeFiles/mksurfdata.dir/nanMod.F90.o CMakeFiles/mksurfdata.dir/shr_const_mod.F90.o CMakeFiles/mksurfdata.dir/shr_kind_mod.F90.o CMakeFiles/mksurfdata.dir/shr_string_mod.F90.o CMakeFiles/mksurfdata.dir/shr_sys_mod.F90.o CMakeFiles/mksurfdata.dir/mksurfdata.F90.o  -o mksurfdata  /project/esmf/PROGS/esmf/8.4.1b05/mvapich2/2.3.3/intel/20.0.1/lib/libO/Linux.intel.64.mvapich2.default/libesmf.a /project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a /project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpioc.a -Wl,-rpath,/project/esmf/PROGS/esmf/8.4.1b05/mvapich2/2.3.3/intel/20.0.1/lib/libO/Linux.intel.64.mvapich2.default -L/project/esmf/PROGS/esmf/8.4.1b05/mvapich2/2.3.3/intel/20.0.1/lib/libO/Linux.intel.64.mvapich2.default -L/project/esmf/PROGS/esmf/8.4.1b05/mvapich2/2.3.3/intel/20.0.1/lib/libO/Linux.intel.64.mvapich2.default -L/usr/local/netcdf-c-4.7.4-f-4.5.2-intel-cluster-20.0.1/lib -L/usr/local/netcdf-c-4.7.4-f-4.5.2-intel-cluster-20.0.1/lib -L/usr/local/hdf5-1.12.0-intel-cluster-20.0.1/lib -L/usr/local/netcdf-c-4.7.4-f-4.5.2-intel-cluster-20.0.1/lib -L/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib -cxxlib -lrt -ldl -lnetcdf -lnetcdff -lnetcdf -lnetcdf -lpioc -m64 -mcmodel=small -pthread -threads -Wl,--no-as-needed  -qopenmp 
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(piolib_mod.F90.o): In function `piolib_mod_mp_pio_initdecomp_dof_i8_':
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1035: undefined reference to `perf_mod_mp_t_startf_'
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1044: undefined reference to `perf_mod_mp_t_stopf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(piolib_mod.F90.o): In function `piolib_mod_mp_init_intracom_':
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1107: undefined reference to `perf_mod_mp_t_startf_'
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1118: undefined reference to `perf_mod_mp_t_stopf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(piolib_mod.F90.o): In function `piolib_mod_mp_createfile_':
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1373: undefined reference to `perf_mod_mp_t_startf_'
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1387: undefined reference to `perf_mod_mp_t_stopf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(piolib_mod.F90.o): In function `piolib_mod_mp_pio_openfile_':
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1429: undefined reference to `perf_mod_mp_t_startf_'
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1444: undefined reference to `perf_mod_mp_t_stopf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(piolib_mod.F90.o): In function `piolib_mod_mp_closefile_':
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1543: undefined reference to `perf_mod_mp_t_startf_'
/fs/cgd/data0/fischer/ParallelIO/src/flib/piolib_mod.F90:1548: undefined reference to `perf_mod_mp_t_stopf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(pionfput_mod.F90.o): In function `pionfput_mod_mp_put_var1_int_':
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/src/flib//fs/cgd/data0/fischer/ParallelIO/src/flib/pionfput_mod.F90.in:127: undefined reference to `perf_mod_mp_t_startf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/src/flib//fs/cgd/data0/fischer/ParallelIO/src/flib/pionfput_mod.F90.in:138: undefined reference to `perf_mod_mp_t_stopf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(pionfput_mod.F90.o): In function `pionfput_mod_mp_put_var1_real_':
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/src/flib//fs/cgd/data0/fischer/ParallelIO/src/flib/pionfput_mod.F90.in:127: undefined reference to `perf_mod_mp_t_startf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/src/flib//fs/cgd/data0/fischer/ParallelIO/src/flib/pionfput_mod.F90.in:138: undefined reference to `perf_mod_mp_t_stopf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpiof.a(pionfput_mod.F90.o): In function `pionfput_mod_mp_put_var1_double_':
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/src/flib//fs/cgd/data0/fischer/ParallelIO/src/flib/pionfput_mod.F90.in:127: undefined reference to `perf_mod_mp_t_startf_'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/src/flib//fs/cgd/data0/fischer/ParallelIO/src/flib/pionfput_mod.F90.in:138: undefined reference to `perf_mod_mp_t_stopf_'
[...]
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpioc.a(pioc_support.c.o): In function `pio_start_timer':
/fs/cgd/data0/fischer/ParallelIO/src/clib/pioc_support.c:66: undefined reference to `GPTLstart'
/project/esmf/PROGS/intel/20.0.1/mvapich2/2.3.3/pio/2_5_10/lib/libpioc.a(pioc_support.c.o): In function `pio_stop_timer':
/fs/cgd/data0/fischer/ParallelIO/src/clib/pioc_support.c:82: undefined reference to `GPTLstop'
CMakeFiles/mksurfdata.dir/build.make:646: recipe for target 'mksurfdata' failed
make[2]: *** [mksurfdata] Error 1
make[2]: Leaving directory '/fs/cgd/data0/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf/tool_bld'
CMakeFiles/Makefile2:95: recipe for target 'CMakeFiles/mksurfdata.dir/all' failed
make[1]: *** [CMakeFiles/mksurfdata.dir/all] Error 2
make[1]: Leaving directory '/fs/cgd/data0/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf/tool_bld'
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2
Error doing make for izumi mvapich2 intel

@slevis-lmwg slevis-lmwg changed the title Derecho transition: Get CTSM5.2 mksurfdata_esmf file creation working on Derecho Derecho transition: Get CTSM5.2 mksurfdata_esmf tool working on Derecho, Casper, Izumi Nov 17, 2023
@samsrabin samsrabin moved this from Todo to In Progress in CTSM: Cheyenne to Derecho transition Dec 1, 2023
@slevis-lmwg
Copy link
Contributor

UPDATE:
git describe in ctsm5.2.mksurfdata branch is
alpha-ctsm5.2.mksrf.18_ctsm5.1.dev123

git show ctsm5.1.dev158:Externals.cfg > Externals.cfg
./manage_externals/checkout_externals

We updated /tools/mksurfdata_esmf/gen_mksurfdata_build.sh

 # May overwrite this default with command-line option --machine
 hostname=`hostname --short`
 case $hostname in
-  ##cheyenne
+  derecho* | r* )
+      export MACH="derecho"
+      pio_iotype=1
+      ;;

Now the build proceeds most of the way, but fails with this error:

[ 97%] Building Fortran object CMakeFiles/mksurfdata.dir/mksurfdata.F90.o
/glade/u/apps/derecho/23.09/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2023.2.1/6q5s/bin/mpi/mpif90 -DPIO2 -I/include -I/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/esmf-8.6.0b04-un2qwjvc54ac5lwa63x62gwgaxfhswp5/include -I/glade/work/csgteam/spack-deployments/derecho/23.09/envs/build/opt/__spack_path_placeholder__/__spack_path_placeholder__/__spack/netcdf-c/4.9.2/cray-mpich/8.1.27/oneapi/2023.2.1/njnx/include -I/glade/u/apps/derecho/23.09/spack/opt/spack/netcdf-fortran/4.6.1/cray-mpich/8.1.27/oneapi/2023.2.1/w55c/include -I/glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.27/oneapi/2023.2.1/zyhu/include -g -c /glade/work/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf/src/mksurfdata.F90 -o CMakeFiles/mksurfdata.dir/mksurfdata.F90.o
make[2]: *** No rule to make target '/glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.27/oneapi/2023.2.1/zyhu/lib/libpiof.a', needed by 'mksurfdata'.  Stop.
make[2]: Leaving directory '/glade/work/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf/tool_bld'
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/mksurfdata.dir/all] Error 2
make[1]: Leaving directory '/glade/work/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf/tool_bld'
make: *** [Makefile:136: all] Error 2
Error doing make for derecho mpich intel

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Jan 16, 2024

Update:
Same error on derecho with git describe ctsm5.2.mksrf.19_ctsm5.1.dev125-3-g1b6ac0e2c
after correcting a simple syntax error in mksurfdata.F90

@wwieder
Copy link
Contributor

wwieder commented Jan 17, 2024

@ekluzek this seems like a blocker for continued development of the 5.2 branch. Is this an issue that can be addressed quickly?

@fang-bowen
Copy link
Contributor

Thank you everyone! As discussed with Keith @olyson , I am posting my recent experience trying to get mksurfdata_esmf work on Derecho. Hope some of this information could be useful.

I started on ctsm5.2.mksrf.19_ctsm5.1.dev125, when Derecho was not on the config_machines.xml (ccs_config_cesm0.0.64), so I copied the latest part from ccs_config_cesm/machines/derecho/config_machines.xml as well as files config_machines.xsd and env_mach_specific.xsd under CIME/data/config/xml_schemas/. I also added the part Sam mentioned above (set pio_iotype=1 for Derecho) before building the tool.

I noticed an update on 1/23 which change the version to ctsm5.1.dev163-482-g1e0bca41f. This seems to fix the problems above.

Then I made the following modification: (I was not completely sure what I was doing though, but the tool seems working after these changes...)

  1. Since the model complains that ndiag6 was not declared:
    /tools/mksurfdata_esmf/src/mksurfdata.F90 #L1377
    write(ndiag6,*) n, suma, pctlak(n), pctwet(n), pctgla(n), pcturb(n), &
   write(ndiag,*) ‘n, suma, pctlak, pctwet, pctgla, pcturb, pctnatveg, pctcrop, pctocn = ’
-  write(ndiag6,*) n, suma, pctlak(n), pctwet(n), pctgla(n), pcturb(n), &
+  write(ndiag,*) n, suma, pctlak(n), pctwet(n), pctgla(n), pcturb(n), &
     pctnatpft(n)%get_pct_l2g(), pctcft(n)%get_pct_l2g(), pctocn(n)
  1. Since libpioc.a and libpiof.a were not found under the directory, but similar .so files were there...
    /tools/mksurfdata_esmf/src/CMakeLists.txt #L53
    set_property(TARGET pioc PROPERTY IMPORTED_LOCATION $ENV{PIO}/lib/libpioc.a)
    set_property(TARGET piof PROPERTY IMPORTED_LOCATION $ENV{PIO}/lib/libpiof.a)
-  set_property(TARGET pioc PROPERTY IMPORTED_LOCATION $ENV{PIO}/lib/libpioc.a)
-  set_property(TARGET piof PROPERTY IMPORTED_LOCATION $ENV{PIO}/lib/libpiof.a)
+  set_property(TARGET pioc PROPERTY IMPORTED_LOCATION $ENV{PIO}/lib/libpioc.so)
+  set_property(TARGET piof PROPERTY IMPORTED_LOCATION $ENV{PIO}/lib/libpiof.so)

Then

(conda activate ctsm_pylib)
(../../manage_externals/checkout_externals)
./gen_mksurfdata_build.sh --machine derecho

The tool was built sucessfully on Derecho and I was able to generate surfdata and landuse.timeseries for (1) a default namelist for 0.9x1.25, 1850-2015, 16-pft and (2) a 16-pft SSP3-7.0 case (0.9x1.25, 1850-2100). I also tried 78-pft with no success(3). The main error was "OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Bad address)". I feel my setup or input files could be causing the issue. mklai() did also complain that the input is 16pft instead of 78.

Relevant file locations:
(1) ongoing - /glade/work/bowen/ctsm5.2.mksurfdata_new/tools/mksurfdata_esmf/surfdata_0.9x1.25_hist_1850_16pfts_c240124.namelist
a previous one with 78pft did not work (mksurfdata_hist_test.o2906701)

(2) success - /glade/work/bowen/ctsm5.2.mksurfdata/tools/mksurfdata_esmf/SSP370_for_Yuan/ (this one was ctsm5.2.mksrf.19_ctsm5.1.dev125).

Steps I took for this one:

./gen_mksurfdata_namelist --res 0.9x1.25 --start-year 1850 --end-year 2100 --ssp-rcp SSP3-7.0 --nosurfdata
# manually edit the namelist, changing pft to 16
./gen_mksurfdata_jobscript_single --number-of-nodes 2 --tasks-per-node 128 --namelist-file surfdata_0.9x1.25_SSP3-7.0_1850_16pfts_c240123.namelist
# manually edit the job script, changing project code etc. and submit. 

I used select=4:ncpus=128:mpiprocs=32 and the job ran for ~5 hours.

(3) failed - multiple, e.g. /glade/work/bowen/ctsm5.2.mksurfdata_new/tools/mksurfdata_esmf/hist_urban_for_Song/surfdata_0.9x1.25_hist_1850_78pfts_c240124.namelist


I feel I probably did not change the right thing, but would be happy in case this helps in any way...

-Bowen

@slevis-lmwg
Copy link
Contributor

A big thank you to you, @fang-bowen
I was able to generate this file:
/glade/work/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf/surfdata_1.9x2.5_hist_1850_78pfts_c240126.nc

@slevis-lmwg
Copy link
Contributor

Looking in my mksurfdata_jobscript_single, I see
#PBS -l select=2:ncpus=128:mpiprocs=128

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 26, 2024

Yes, thanks for sharing your experience @fang-bowen! This gets us going. I think this might not be the right thing long term in order to get it working on multiple machines and compilers and MPI libraries. But, this gets us going and others can benefit as well, so it's great all around...

@slevis-lmwg
Copy link
Contributor

I do find some cases failing. I currently suspect memory issues. I will play with the number of nodes that I request.

@slevis-lmwg
Copy link
Contributor

Opened new issue for casper and izumi. Closing this one as fixed for derecho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants