-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test of UFS on WCOSS2-acorn for latest library updates #1621
Comments
The testing libraries install on Acorn following the absolute path setting in the modulefile, which follows the system standard. Potential issues for esmf/8.4.0: since this version, the esmf process the PIO in different ways. Mainly, internal included PIO and external installed PIO. The required PIO version, according to ESMF code manager, is PIO/2.5.10. The internal PIO is recommended by esmf code manager for UFS use. =============set based on netcdf 4.7.4 series========== fms: hdf5: mapl: netcdf: pio: =============set based on netcdf 4.9.1 series======== fms: hdf5: mapl: netcdf: pio: ======================================== |
Testing: ===========Test on acorn=================
All compiling passed and test of " ./rt.sh -l rt.conf3" successful.
4.still netcdf/4.9.1 series Loading with external PIO options ========== Additional other tests========= Compiling of ATM only option passed, but failed at run time (rt.sh 775 line). 2, All repeating tests of acorn on Dogwoods occasionally pass or fail. The system may have limitation to use the self installed lib or the file system is not stable. ====================== Please test and debug on Acorn based on our installations. With any questions, please refer to the NCEPLIBS group. |
@Hang-Lei-NOAA what is the failure? |
UFS failure are kept in the /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/testufs/tests/log_acorn.intel/* Pan Li reported segmentation failure in his run time, may be different from mine. |
I see this in /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_74145/control_CubedSphereGrid/err
|
Seems like a netcdf error. Is netcdf looking for pthreads library? |
Looks like it uses /lib64/libpthread-2.31.so @Hang-Lei-NOAA Can you point me to the stack yaml used in your number 3 case (4.9.1 installations and internal esmf/8.4.0i installations)? |
@alexander Richert - NOAA Affiliate ***@***.***> yes, all
libs are installed from the hpc-stack
/lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/nco_wcoss2/
I also have the spack-stack versions with internal PIO ESMF/8.4.0 installed
on
/lfs/h1/emc/nceplibs/noscrub/hpc-stack/test/spack-stack/envs/ufs-weather-model-static/install/modulefiles/cray-mpich/8.1.9/intel/19.1.3.304
…On Tue, Feb 21, 2023 at 6:36 PM Alex Richert ***@***.***> wrote:
Looks like it uses /lib64/libpthread-2.31.so
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> Can you point me to the
stack yaml used in your number 3 case (4.9.1 installations and internal
esmf/8.4.0i installations)?
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFBEWI74WWRPZ5UMIN3WYVGQFANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Comments from Gerhard: Thank you for your replies on my 5 bullet points below. This does sort out a few items for me. I also took a look at the failure you referenced under the other thread (posted by Dusan in this issue/ticket:#1621). So that looks like a SEGV during execution. This makes a lot more sense now to me. I always had understood that there were numerical differences in the written out files, or something of this nature you were hunting down. (C) v8.4.0 is an official release, BUT it has a serious bug!! The bug can lead to memory corruption, and the SEGV you are observing might indicate that you are running into this very issue. We are just about to release a patch, and we do not recommend spending any more time with v8.4.0. (D) The patch release will be v8.4.1. Our release target for it is end of this month (February). It has the memory corruption bug in v8.4.0 fixed, and a few other small improvements. Otherwise identical to v8.4.0. At this point we recommend you try out our latest release candidate: v8.4.1b07. Knowing how this tag works for ufs-weather-model will give us confidence for the v8.4.1 release! |
@Hang-Lei-NOAA Would you please install the latest ESMF release candidate: v8.4.1b07 so that we can test it in ufs wm? Please let us know when it is ready on acorn. Thanks. |
@junwang-noaa ESMF/8.4.1b07 and associated mapl new versions have been added into the netcdf/4.9.1 suite . Please see below, and load internal/external PIO options for your tests. Please also refer to the above mentioned my testufs folder to correctly loading modules in correct sequence on acorn. : ====netcdf 4.9.1 series======== module use /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/libs/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/ esmf: fms: hdf5: mapl: netcdf: pio: |
@arunchawla-NOAA @junwang-noaa |
@Hang-Lei-NOAA That's great! Thank you! I assume your testufs code with these module file updates is still at: /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/testufs |
Yes, Jun. But you have to switch between esmf/8.4.1b07 (external pio) and esmf/8.4.1b07i (internal pio) in the file ../modulefiles/ufs_acorn.intel.lua for your tests. |
I tested the library on acorn. So far the coupled test failed in GOCART (the coupled test without GOCART ran fine). I see the error message:
It seems an error from MAPL library. I tried to update GOCART to the latest develop branch (revision acc574ff8) in https://github.com/GEOS-ESM/GOCART , I got the same error. @tclune @weiyuan-jiang @theurich FYI. @Hang-Lei-NOAA Is it possible that you can build MAPL 2.23.1 with esmf/8.4..1b07 for us to test? Thanks. |
@junwang-noaa mapl/2.23.1 has been installed with internal/external esmf/8.4.1b07 on: I will reproduce the other whole installations under netcdf/4.7.4 today. |
@Hang-Lei-NOAA I got library error message when using the 2.23.1-esmf-8.4.1b07.lua
My module file is at: /lfs/h1/emc/nems/noscrub/jun.wang/ufs-wm/20230222/gocart_test/oldmapl/ufs-weather-model/modulefiles/ufs_acorn.intel.lua. I don't have this error when compiling with mapl 2.34.0. Thanks |
@junwang-noaa Please try it again. I forget to set the absolute path. So you got the system env variable conflicts. It has been corrected. |
@lipan-NOAA Thanks. which tests have you run? |
@Hang-Lei-NOAA I tried to compile the code with your module file. The first compile job finished successfully, but the second one failed. The compile log file is at: /lfs/h1/emc/nems/noscrub/jun.wang/ufs-wm/20230531/ufs-weather-model/tests/xxxtcmlp2 |
/lfs/h1/emc/couple/noscrub/li.pan/ufs-weather-model/tests/my.conf |
The problem was located last week and replied to NCO/GDIT ticket Ticket#2023030910000069 BUT no further actions from GDIT is made on this. Looks like they are in vacation. |
It's probably time to start estimating how many quarters (looks to me like
one or two) milestones for UFS-Weather-Model development
will slip because of issues with slow installation of required dependencies
on WCOSS2.
…On Mon, Jun 12, 2023 at 2:07 PM Hang-Lei-NOAA ***@***.***> wrote:
The problem was located last week and replied to NCO/GDIT ticket
Ticket#2023030910000069
The code build of GDIT following our instruction is correct. But they
refuse to change the absolute path in modulefiles, which caused the test
failure. I reset the modulefiles and all regression test passed.
Please see my script at:
/lfs/h1/emc/eib/noscrub/Hang.Lei/ufs-weather-model/modulefiles/ufs_acorn.intel.lua
BUT no further actions from GDIT is made on this. Looks like they are in
vacation.
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUXPONYGGL22NHQB4TXK4PDBANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Bongi from GDIT is in vacation. After I contact him yesterday, he take extra time from his vacation to fix the problem in his installation. |
@Hang-Lei-NOAA Do we have MAPL library available on acorn for testing? I am using the module file /lfs/h1/emc/eib/noscrub/Hang.Lei/ufs-weather-model/modulefiles/ufs_acorn.intel.lua, but got following error: Lmod has detected the following error: These module(s) or extension(s) exist While processing the following module(s): |
@jun Wang - NOAA Federal ***@***.***> Bongi is still on vacation.
Your problem is because WCOSS2 management did some changes. It
occurred many times before.
The installation folder is gone for a while. You can not see the
"/apps/test/hpc-stack/i-19.1.3.304__m-8.1.12__h-1.14.0__n-4.9.2__e-8.4.2/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.12"
now. But the management will let it come back sometime later.
***@***.***:~> ls /apps/test
grads lmodules modules python-modules test_rsync_then_delete
…On Mon, Jun 26, 2023 at 3:56 PM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> Do we have MAPL library
available on acorn for testing? I am using the module file
/lfs/h1/emc/eib/noscrub/Hang.Lei/ufs-weather-model/modulefiles/ufs_acorn.intel.lua,
but got following error:
Lmod has detected the following error: These module(s) or extension(s)
exist
but cannot be loaded as requested: "netcdf/4.9.2", "esmf/8.4.2",
"fms/2023.01"
Try: "module spider netcdf/4.9.2 esmf/8.4.2 fms/2023.01" to see how to load
the module(s).
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
mapl/2.35.2-esmf-8.4.2
/lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/libs3/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/mapl/2.35.2-esmf-8.4.2.lua
ufs_acorn.intel
/lfs/h1/emc/nems/noscrub/jun.wang/ufs-wm/20230626/ufs-weather-model/modulefiles/ufs_acorn.intel.lua
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFAFA4BMVN2ONR4KPALXNHSPLANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA Any updates today? |
Bongi is back. He is working on this.
…On Thu, Jun 29, 2023 at 10:31 AM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> Any updates today?
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFGBC2VUAHLJSWRCPS3XNWGTHANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA Any updates on the library installation on wcoss2? |
I pushed last Friday and today. Still under working.
The ticket number is : [Ticket#2023030910000069
…On Mon, Jul 3, 2023 at 11:31 AM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> Any updates on the
library installation on wcoss2?
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFE2E55HANYT34VI4QLXOLQVTANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA Any progress on the new library installation? |
Hi June,
I had installed the testing libs into the emc maintained hpc-stack weeks
ago:
**module use
/lfs/h1/emc/nceplibs/noscrub/hpc-stack/libs/hpc-stack/modulefiles/stack**
…On Mon, Jul 10, 2023 at 11:52 AM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> Any progress on the new
library installation?
If this takes too long, can you install the test library on acorn under
nceplibs so that we can move forward with PR#1745
<#1745>?
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFDGX5TDLBILTUGA2S3XPQQKXANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks, Hang. Any updates on wcoss2 library installation? |
Bongi is still working on it.
…On Mon, Jul 10, 2023 at 12:00 PM Jun Wang ***@***.***> wrote:
Thanks, Hang. Any updates on wcoss2 library installation?
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFCJTIGEJLK3W3HNGN3XPQRJLANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA I tested the new module files in nceplibs on acorn, it is working so far (still waiting for full RT to finish). But without the new libraries, the RT test failed on wcoss2. Do you know when we can have the libraries installed on wcoss2? Thanks |
@jun Wang - NOAA Federal ***@***.***> They just got them recovered
on acorn:
module purge ;module load envvar;* module use
/apps/test/hpc-stack/i-19.1.3.304__m-8.1.12__h-1.14.0__n-4.9.2__e-8.4.2/modulefiles/stack*;module
load hpc;module load hpc-intel;module load craype cray-mpich
hpc-cray-mpich;module load hdf5 netcdf esmf mapl pio ;module -t list 2>&1 |
while read line;do module show $line 2>&1 | sed -n -e '2p';done |
sort;module avail
…On Tue, Jul 11, 2023 at 11:00 AM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> I tested the new module
files in nceplibs on acorn, it is working so far (still waiting for full RT
to finish). But without the new libraries, the RT test failed on wcoss2. Do
you know when we can have the libraries installed on wcoss2? Thanks
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFH6T3UL3C3VESMVHG3XPVTAZANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA I can use the nceplibs library on acorn. I'd like to know when the libraries will be installed on wcoss2. |
@jun Wang - NOAA Federal ***@***.***> I have asked Bongi to do it
as soon as possible and had made demo on dogwoods last week. I can only
push. Don't know what they are waiting for.
…On Tue, Jul 11, 2023 at 11:34 AM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> I can use the nceplibs
library on acorn. I'd like to know when the libraries will be installed on
wcoss2.
—
Reply to this email directly, view it on GitHub
<#1621 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFD7KXU4X24JNXR4VDTXPVXA5ANCNFSM6AAAAAAVDNA7LU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This is the summary for the whole demo process on dogwoods. One with an
account can repeat this:
Following are the whole process that I installed from fresh on DOGWOODS.
=================
1005 mkdir forgdit
1006 cd forgdit/
1007 mkdir install
1008 ls
1009 git clone https://github.com/NOAA-EMC/hpc-stack.git nco_wcoss2
1010 cd nco_wcoss2/
1011 git checkout nco-wcoss2
1012 ls
1013 mkdir -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install
1014 ./setup_modules.sh -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install
-c config/config_nco_wcoss2.sh
1015 ./build_stack.sh -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install -c
config/config_nco_wcoss2.sh -y stack/stack_hdf5_v1_14_0.yaml -m
1016 vi
/lfs/h2/emc/eib/save/Hang.Lei/forgdit/install/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/hdf5/1.14.0.lua
1017 ./build_stack.sh -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install -c
config/config_nco_wcoss2.sh -y stack/stack_netcdf_v4_9_2.yaml -m
1018 vi
/lfs/h2/emc/eib/save/Hang.Lei/forgdit/install/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/netcdf/4.9.2.lua
1019 ./build_stack.sh -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install -c
config/config_nco_wcoss2.sh -y stack/stack_pio_v2_5_10.yaml -m
1020 vi
/lfs/h2/emc/eib/save/Hang.Lei/forgdit/install/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/pio/2.5.10.lua
1021 ./build_stack.sh -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install -c
config/config_nco_wcoss2.sh -y stack/stack_fms_2023_01.yaml -m
1022 vi
/lfs/h2/emc/eib/save/Hang.Lei/forgdit/install/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/fms/2023.01.lua
1023 ./build_stack.sh -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install -c
config/config_nco_wcoss2.sh -y stack/stack_esmf_v8_4_2.yaml -m
1024 vi
/lfs/h2/emc/eib/save/Hang.Lei/forgdit/install/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/esmf/8.4.2.lua
1025 module use /apps/dev/lmodules/core
1026 module load ecbuild/3.7.0
1027 module use /apps/dev/lmodules/intel/19.1.3.304
1028 module load gftl_shared/1.5.0
1029 module load yafyaml/1.0.4
1030 ./build_stack.sh -p /lfs/h2/emc/eib/save/Hang.Lei/forgdit/install -c
config/config_nco_wcoss2.sh -y stack/stack_mapl_v2_35_2.yaml -m
1031 vi
/lfs/h2/emc/eib/save/Hang.Lei/forgdit/install/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9/mapl/2.35.2-esmf-8.4.2.lua
1032 history
==========Interpret the steps===============
1005-1008: make a folder.
1009-1011: checkout the code.
1013: check the installation directory. <prefix>
1014: answer NO to all questions: set up the hpc-stack env.
1015-1016: install hdf5 and modify the path for "local base"
1017-1018: install netcdf and modify the path for "local base"
1019-1020: install pio and modify the path for "local base"
1021-1022: install fms and modify the path for "local base"
1023-1024: install ESMF and modify the path for "local base"
1025-1031: preload system installed support libs; and install mapl and
modify the path for "local base"
On Tue, Jul 11, 2023 at 11:41 AM Hang Lei - NOAA Affiliate <
***@***.***> wrote:
… @jun Wang - NOAA Federal ***@***.***> I have asked Bongi to do it
as soon as possible and had made demo on dogwoods last week. I can only
push. Don't know what they are waiting for.
On Tue, Jul 11, 2023 at 11:34 AM Jun Wang ***@***.***>
wrote:
> @Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> I can use the nceplibs
> library on acorn. I'd like to know when the libraries will be installed on
> wcoss2.
>
> —
> Reply to this email directly, view it on GitHub
> <#1621 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AKWSMFD7KXU4X24JNXR4VDTXPVXA5ANCNFSM6AAAAAAVDNA7LU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Description
According to UFS code manager Jun's request, we installed the updated versions of following libraries on wcoss2 machines for testing. However, the results are not all successful. First, I posted the installations here. and Then include my test results and sample loading methods here.
The propose of this ticket is for further testing and debug.
To Reproduce:
What compilers/machines are you seeing this with?
WCOSS2 acorn
Additional context
Add any other context about the problem here.
Library updates:
hdf5/1.10.6 => hdf5/1.14.0
netcdf/4.7.4 => netcdf/4.9.1
esmf/8.3.0b09 => esmf/8.4.0
mapl/2.23.1 => mapl/2.34.0
All base supporting libs using the system installed libraries.
The text was updated successfully, but these errors were encountered: