Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test of UFS on WCOSS2-acorn for latest library updates #1621

Closed
Hang-Lei-NOAA opened this issue Feb 21, 2023 · 212 comments · Fixed by #1745
Closed

Test of UFS on WCOSS2-acorn for latest library updates #1621

Hang-Lei-NOAA opened this issue Feb 21, 2023 · 212 comments · Fixed by #1745
Assignees
Labels
bug Something isn't working

Comments

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Feb 21, 2023

Description

According to UFS code manager Jun's request, we installed the updated versions of following libraries on wcoss2 machines for testing. However, the results are not all successful. First, I posted the installations here. and Then include my test results and sample loading methods here.
The propose of this ticket is for further testing and debug.

To Reproduce:

What compilers/machines are you seeing this with?
WCOSS2 acorn

  1. Next note will include the detailed installations of libraries.
  2. then I will post the summary of test results and the way to correctly loading them.

Additional context

Add any other context about the problem here.

Library updates:
hdf5/1.10.6 => hdf5/1.14.0
netcdf/4.7.4 => netcdf/4.9.1
esmf/8.3.0b09 => esmf/8.4.0
mapl/2.23.1 => mapl/2.34.0

All base supporting libs using the system installed libraries.

@Hang-Lei-NOAA Hang-Lei-NOAA added the bug Something isn't working label Feb 21, 2023
@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Feb 21, 2023

The testing libraries install on Acorn following the absolute path setting in the modulefile, which follows the system standard.
Supporting libs, if available on acorn, will be using the system installed. Otherwise, we add additional supply. We try to combine different options and libraries in two sets based on netcdf versions.

Potential issues for esmf/8.4.0: since this version, the esmf process the PIO in different ways. Mainly, internal included PIO and external installed PIO. The required PIO version, according to ESMF code manager, is PIO/2.5.10. The internal PIO is recommended by esmf code manager for UFS use.

=============set based on netcdf 4.7.4 series==========
module use /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/libs2/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/
esmf:
8.4.0i.lua (esmf with internal pio option)
8.4.0.lua (esmf with external pio option)

fms:
2022.04.lua

hdf5:
1.10.6.lua

mapl:
2.23.1-esmf-8.4.0.lua (esmf with external pio option)
2.34.0-esmf-8.4.0i.lua (esmf with internal pio option)
2.34.0-esmf-8.4.0.lua (esmf with external pio option)

netcdf:
4.7.4.lua

pio:
2.5.10.lua

=============set based on netcdf 4.9.1 series========
module use /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/libs/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/
esmf:
8.4.0i.lua (esmf with internal pio option)
8.4.0.lua (esmf with external pio option)

fms:
2022.04.lua

hdf5:
1.14.0.lua

mapl:
2.23.1-esmf-8.4.0.lua (esmf with external pio option)
2.34.0-esmf-8.4.0i.lua (esmf with internal pio option)
2.34.0-esmf-8.4.0.lua (esmf with external pio option)

netcdf:
4.9.0.lua
4.9.1.lua

pio:
2.5.10.lua
2.5.7.lua
2.5.9.lua

========================================

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Feb 21, 2023

Testing:
The UFS is set on acorn for testing based on Jun and Pan Li provided model code. The directory is:
/lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/testufs
It is open to access, the loading of libraries are based on following files:
/lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/testufs/modulefiles
ufs_acorn.intel.lua.4.7.4 (loading libraries with netcdf/4.7.4 series)
ufs_acorn.intel.lua.4.9.1 (loading libraries with netcdf/4.9.1 series)
ufs_acorn.intel.lua.spack (loading libraries with spack installation)

===========Test on acorn=================

  1. Using 4.7.4 installations and internal esmf/8.4.0i installations (shown in file ufs_acorn.intel.lua.4.7.4) Detailed loading please see the file.
    Compile with:
    ./compile.sh acorn.intel "-DAPP=ATM -DCCPP_SUITES=FV3_GFS_v17_p8 -D32BIT=ON" 001 (ATM only)
    ./compile.sh acorn.intel "-DAPP=ATMAERO -DCCPP_SUITES=FV3_GFS_v17_p8 -D32BIT=ON" 001 (Aerosol plus)
    ./compile.sh acorn.intel "-DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8,FV3_GFS_cpld_rasmgshocnsstnoahmp_ugwp" 001 (Full coupled)

All compiling passed and test of " ./rt.sh -l rt.conf3" successful.

  1. still netcdf/4.7.4 series Loading with external PIO options
    esmf/8.4.0 and mapl/2.34.0-esmf-8.4.0 or mapl/2.23.1-esmf-8.4.0.
    All compiling succeed,
    But running test failed at 775 line of rt.sh (compile line).

  2. Using 4.9.1 installations and internal esmf/8.4.0i installations (shown in file ufs_acorn.intel.lua.4.9.1) Detailed loading please see the file.
    All compiling succeed,
    But running test failed at 775 line of rt.sh (compile line).

4.still netcdf/4.9.1 series Loading with external PIO options
esmf/8.4.0 and mapl/2.34.0-esmf-8.4.0 or mapl/2.23.1-esmf-8.4.0.
All compiling succeed,
But running test failed at 775 line of rt.sh (compile line).

========== Additional other tests=========
1, Test of UFS using spack-stack installations ufs_acorn.intel.lua.spack on acorn.
module use /lfs/h1/emc/nceplibs/noscrub/hpc-stack/test/spack-stack/envs/ufs-weather-model-static/install/modulefiles/intel/19.1.3.304
module use /lfs/h1/emc/nceplibs/noscrub/hpc-stack/test/spack-stack/envs/ufs-weather-model-static/install/modulefiles/cray-mpich/8.1.9/intel/19.1.3.304

Compiling of ATM only option passed, but failed at run time (rt.sh 775 line).

2, All repeating tests of acorn on Dogwoods occasionally pass or fail. The system may have limitation to use the self installed lib or the file system is not stable.
On Cactus, I only have time to repeat the first 1 tests before operation switches. The ATM only and Aerosol plus tests passed. But Full coupled failed.

======================

Please test and debug on Acorn based on our installations. With any questions, please refer to the NCEPLIBS group.

@arunchawla-NOAA
Copy link

@Hang-Lei-NOAA what is the failure?

@Hang-Lei-NOAA
Copy link
Author

UFS failure are kept in the /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/testufs/tests/log_acorn.intel/*
As a summary, most of failures in my run time tests are die on run time (rt.sh 775 line).

Pan Li reported segmentation failure in his run time, may be different from mine.
I was suspecting the libraries loading sequence conflict for it, so, I set my test files are open to everyone to access.

@DusanJovic-NOAA
Copy link
Collaborator

UFS failure are kept in the /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/testufs/tests/log_acorn.intel/* As a summary, most of failures in my run time tests are die on run time (rt.sh 775 line).

Pan Li reported segmentation failure in his run time, may be different from mine. I was suspecting the libraries loading sequence conflict for it, so, I set my test files are open to everyone to access.

I see this in /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_74145/control_CubedSphereGrid/err

+ mpiexec -n 256 -ppn 128 -depth 1 ./fv3.exe
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
fv3.exe            0000000003E2DCBA  Unknown               Unknown  Unknown
libpthread-2.31.s  000014CD6817E8C0  Unknown               Unknown  Unknown
libnetcdff.so.7.1  000014CD6CF79C6D  netcdf_mp_nf90_in     Unknown  Unknown
fv3.exe            00000000006BF4D2  esmf_iogridmosaic         279  ESMF_IOGridmosaic.F90
fv3.exe            000000000077EFE3  esmf_gridmod_mp_e       15925  ESMF_Grid.F90
fv3.exe            0000000001B79D76  module_wrt_grid_c         488  module_wrt_grid_comp.F90
fv3.exe            0000000000DE0804  _ZN5ESMCI6FTable1        2167  ESMCI_FTable.C
fv3.exe            0000000000DE443A  ESMCI_FTableCallE         824  ESMCI_FTable.C
fv3.exe            00000000013C938B  _ZN5ESMCI3VMK5ent        1124  ESMCI_VMKernel.C
fv3.exe            00000000013B2C49  _ZN5ESMCI2VM5ente        1216  ESMCI_VM.C
fv3.exe            0000000000DE1C47  c_esmc_ftablecall         981  ESMCI_FTable.C
fv3.exe            00000000009C8AF0  esmf_compmod_mp_e        1223  ESMF_Comp.F90
fv3.exe            00000000007C9E86  esmf_gridcompmod_        1412  ESMF_GridComp.F90
fv3.exe            0000000001ACCB97  fv3gfs_cap_mod_mp         529  fv3_cap.F90
fv3.exe            0000000000C3787F  _ZNK5ESMCI13Metho         377  ESMCI_MethodTable.C
fv3.exe            0000000000C377FD  _ZN5ESMCI11Method         563  ESMCI_MethodTable.C
fv3.exe            0000000000C381B1  c_esmc_methodtabl         347  ESMCI_MethodTable.C
fv3.exe            0000000000B32D4B  esmf_attachmethod        1280  ESMF_AttachMethods.F90
fv3.exe            00000000023A83D8  nuopc_modelbase_m         687  NUOPC_ModelBase.F90
fv3.exe            0000000000DE0804  _ZN5ESMCI6FTable1        2167  ESMCI_FTable.C
fv3.exe            0000000000DE443A  ESMCI_FTableCallE         824  ESMCI_FTable.C
fv3.exe            00000000013C919F  _ZN5ESMCI3VMK5ent        2320  ESMCI_VMKernel.C

@arunchawla-NOAA
Copy link

Seems like a netcdf error. Is netcdf looking for pthreads library?

@AlexanderRichert-NOAA
Copy link
Collaborator

Looks like it uses /lib64/libpthread-2.31.so

@Hang-Lei-NOAA Can you point me to the stack yaml used in your number 3 case (4.9.1 installations and internal esmf/8.4.0i installations)?

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Feb 22, 2023 via email

@junwang-noaa
Copy link
Collaborator

Comments from Gerhard:

Thank you for your replies on my 5 bullet points below. This does sort out a few items for me. I also took a look at the failure you referenced under the other thread (posted by Dusan in this issue/ticket:#1621). So that looks like a SEGV during execution. This makes a lot more sense now to me. I always had understood that there were numerical differences in the written out files, or something of this nature you were hunting down.
So here is how I see the current situation and the path forward from an ESMF perspective:
(A) We definitely encourage you to move away from tag v8.3.0b09. It is an old development snapshot, and even before we switched from PIO-1 to PIO-2 with the internal/external option. It seems we are all on the same page with moving away from this tag.
(B) There is the official v8.3.1 release, that might actually work out-of-the box. In fact, one of the driving forces behind this patch release was to have a working version for UFS. But I am not sure it was ever tried out. On the other hand by now ESMF has moved on, and it might be a bit of a waste of time to target v8.3.1 at this point.

(C) v8.4.0 is an official release, BUT it has a serious bug!! The bug can lead to memory corruption, and the SEGV you are observing might indicate that you are running into this very issue. We are just about to release a patch, and we do not recommend spending any more time with v8.4.0.

(D) The patch release will be v8.4.1. Our release target for it is end of this month (February). It has the memory corruption bug in v8.4.0 fixed, and a few other small improvements. Otherwise identical to v8.4.0. At this point we recommend you try out our latest release candidate: v8.4.1b07. Knowing how this tag works for ufs-weather-model will give us confidence for the v8.4.1 release!
(E) There is of course also the ESMF develop branch. It is currently in 8.5.0 beta phase. My recommendation for anything aiming at operational implementation is to NOT use any v8.5.0b tags yet. Lets concentrate on the 8.4.1 patch release. This will also give a good starting point for the future transition to v8.5.0 when that release comes out, or for upcoming development work.

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Would you please install the latest ESMF release candidate: v8.4.1b07 so that we can test it in ufs wm? Please let us know when it is ready on acorn. Thanks.

@Hang-Lei-NOAA
Copy link
Author

@junwang-noaa ESMF/8.4.1b07 and associated mapl new versions have been added into the netcdf/4.9.1 suite . Please see below, and load internal/external PIO options for your tests. Please also refer to the above mentioned my testufs folder to correctly loading modules in correct sequence on acorn. :

====netcdf 4.9.1 series========

module use /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/libs/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/

esmf:
8.4.1b07i.lua (esmf with internal pio option)
8.4.1b07.lua (esmf with external pio option)

fms:
2022.04.lua

hdf5:
1.14.0.lua

mapl:
2.34.0-esmf-8.4.1b07i.lua (esmf with internal pio option)
2.34.0-esmf-8.4.1b07.lua (esmf with external pio option)

netcdf:
4.9.1.lua

pio:
2.5.10.lua

@Hang-Lei-NOAA
Copy link
Author

@arunchawla-NOAA @junwang-noaa
My UFS tests (ATM; Aerosol plus) with the upgraded sets, (hdf5/1.14.0; netcdf/4.9.1; esmf/8.4.1b07; PIO/2.5.10; Mapl/2.34.0) passed on acorn. For both esmf internal and external PIO option. It is still on my open testufs folder.
Please other UFS developers to do further test.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Feb 22, 2023

@Hang-Lei-NOAA That's great! Thank you! I assume your testufs code with these module file updates is still at: /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/testufs

@Hang-Lei-NOAA
Copy link
Author

Yes, Jun. But you have to switch between esmf/8.4.1b07 (external pio) and esmf/8.4.1b07i (internal pio) in the file ../modulefiles/ufs_acorn.intel.lua for your tests.

@junwang-noaa
Copy link
Collaborator

I tested the library on acorn. So far the coupled test failed in GOCART (the coupled test without GOCART ran fine). I see the error message:

pe=00098 FAIL at line=03053    Base_Base_implementation.F90             <status=57>
pe=00098 FAIL at line=00685    SU2G_GridCompMod.F90                     <status=57>
pe=00098 FAIL at line=01817    MAPL_Generic.F90                         <status=57>
pe=00098 FAIL at line=00193    BaseProfiler.F90                         <Timer <GOCART2G> does not match start timer <SU>>
pe=00098 FAIL at line=01838    MAPL_Generic.F90                         <status=1>
pe=00098 FAIL at line=00161    Aerosol_GridComp.F90                     <Failed to run child component>
pe=00098 FAIL at line=01817    MAPL_Generic.F90                         <status=1>

It seems an error from MAPL library. I tried to update GOCART to the latest develop branch (revision acc574ff8) in https://github.com/GEOS-ESM/GOCART , I got the same error.

@tclune @weiyuan-jiang @theurich FYI.

@Hang-Lei-NOAA Is it possible that you can build MAPL 2.23.1 with esmf/8.4..1b07 for us to test? Thanks.

@Hang-Lei-NOAA
Copy link
Author

@junwang-noaa mapl/2.23.1 has been installed with internal/external esmf/8.4.1b07 on:
module use /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/libs/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/
mapl:
2.23.1-esmf-8.4.1b07i.lua (esmf with internal pio option)
2.23.1-esmf-8.4.1b07.lua (esmf with external pio option)

I will reproduce the other whole installations under netcdf/4.7.4 today.

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA I got library error message when using the 2.23.1-esmf-8.4.1b07.lua

Currently Loaded Modules:
  1) craype-x86-rome     (H)  15) sp/2.3.3
  2) libfabric/1.11.0.0. (H)  16) gftl_shared/1.5.0
  3) craype-network-ofi  (H)  17) w3emc/2.9.2
  4) envvar/1.0               18) crtm/2.4.0
  5) PrgEnv-intel/8.1.0       19) g2tmpl/1.10.2
  6) intel/19.1.3.304         20) hdf5/1.14.0
  7) craype/2.7.13            21) netcdf/4.9.1
  8) cray-mpich/8.1.7         22) fms/2022.04
  9) cmake/3.20.2             23) pio/2.5.10
 10) jasper/2.0.25            24) esmf/8.4.1b07
 11) zlib/1.2.11              25) mapl/2.23.1-esmf-8.4.1b07
 12) libpng/1.6.37            26) ufs_acorn.intel
 13) g2/3.4.5                 27) bacio/2.4.1
 14) ip/3.3.3
...

Calling CCPP code generator (ccpp_prebuild.py) for suites --suites=FV3_GFS_v17_coupled_p8 ...
Force 32-bit build for GOCART
CMake Error at GOCART/CMakeLists.txt:64 (find_package):
  By not providing "FindMAPL.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "MAPL", but
  CMake did not find one.

  Could not find a package configuration file provided by "MAPL" with any of
  the following names:

    MAPLConfig.cmake
    mapl-config.cmake

My module file is at: /lfs/h1/emc/nems/noscrub/jun.wang/ufs-wm/20230222/gocart_test/oldmapl/ufs-weather-model/modulefiles/ufs_acorn.intel.lua. I don't have this error when compiling with mapl 2.34.0. Thanks

@Hang-Lei-NOAA
Copy link
Author

@junwang-noaa Please try it again. I forget to set the absolute path. So you got the system env variable conflicts. It has been corrected.

@junwang-noaa
Copy link
Collaborator

@lipan-NOAA Thanks. which tests have you run?

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA I tried to compile the code with your module file. The first compile job finished successfully, but the second one failed. The compile log file is at:

/lfs/h1/emc/nems/noscrub/jun.wang/ufs-wm/20230531/ufs-weather-model/tests/xxxtcmlp2

@lipan-NOAA
Copy link
Collaborator

/lfs/h1/emc/couple/noscrub/li.pan/ufs-weather-model/tests/my.conf

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jun 12, 2023

The problem was located last week and replied to NCO/GDIT ticket Ticket#2023030910000069
The code build of GDIT following our instruction is correct. But they refuse to change to the absolute path in modulefiles, which caused the test failure. I reset the modulefiles and all regression test passed.
Please see my script at: /lfs/h1/emc/eib/noscrub/Hang.Lei/ufs-weather-model/modulefiles/ufs_acorn.intel.lua

BUT no further actions from GDIT is made on this. Looks like they are in vacation.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA
Copy link
Author

Bongi from GDIT is in vacation. After I contact him yesterday, he take extra time from his vacation to fix the problem in his installation.
I have been tested it.
Now, everything works fine for already installed libs.
I updated my script at: /lfs/h1/emc/eib/noscrub/Hang.Lei/ufs-weather-model/modulefiles/ufs_acorn.intel.lua
Still push on finish the mapl installation. @junwang-noaa you can try it now.

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Do we have MAPL library available on acorn for testing? I am using the module file /lfs/h1/emc/eib/noscrub/Hang.Lei/ufs-weather-model/modulefiles/ufs_acorn.intel.lua, but got following error:

Lmod has detected the following error: These module(s) or extension(s) exist
but cannot be loaded as requested: "netcdf/4.9.2", "esmf/8.4.2", "fms/2023.01"
Try: "module spider netcdf/4.9.2 esmf/8.4.2 fms/2023.01" to see how to load
the module(s).

While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
mapl/2.35.2-esmf-8.4.2 /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/libs3/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/mapl/2.35.2-esmf-8.4.2.lua
ufs_acorn.intel /lfs/h1/emc/nems/noscrub/jun.wang/ufs-wm/20230626/ufs-weather-model/modulefiles/ufs_acorn.intel.lua

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jun 26, 2023 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Any updates today?

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jun 29, 2023 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Any updates on the library installation on wcoss2?

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jul 3, 2023 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Any progress on the new library installation?
If this takes too long, can you install the test library on acorn under nceplibs so that we can move forward with PR#1745?

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jul 10, 2023 via email

@junwang-noaa
Copy link
Collaborator

Thanks, Hang. Any updates on wcoss2 library installation?

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jul 10, 2023 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA I tested the new module files in nceplibs on acorn, it is working so far (still waiting for full RT to finish). But without the new libraries, the RT test failed on wcoss2. Do you know when we can have the libraries installed on wcoss2? Thanks

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jul 11, 2023 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA I can use the nceplibs library on acorn. I'd like to know when the libraries will be installed on wcoss2.

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jul 11, 2023 via email

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jul 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.