Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling with hpc-stack a WW3 wave model program produces empty grib2 files #137

Closed
RobertoPadilla-NOAA opened this issue Jan 12, 2021 · 97 comments · Fixed by #160
Closed
Labels
bug Something isn't working

Comments

@RobertoPadilla-NOAA
Copy link

Describe the bug
The use of the hpc-stack to compile a program to produce grib2 files from wave-component of the coupled system allows to build the executable but this executable produces empty grib2 files.
If the hpc-stack is not used and modules are loaded separately then the executable produces valid grib2 files.

To Reproduce
==Install and run the test as described at
https://github.com/NOAA-EMC/global-workflow/blob/feature/coupled-crow/README.md
== Except for the first steps "Checkout the source code and scripts"
== Use the following instructions
git clone https://github.com/Jessica-Meixner-NOAA/global-workflow coupled-workflow
cd coupled-workflow
git checkout feature/p5ww3post
git submodule update --init --recursive #Update submodules

==Follow the instructions .
cd sorc
sh checkout.sh coupled # Check out the coupled code, EMC_post, gsi, ...
sh build_ncep_post.sh #This command will build ncep_post
sh build_ww3prepost.sh #This command will build ww3 prep and post exes
sh build_fv3_coupled.sh #This command will build ufs-s2s-model
sh build_reg2grb2.sh #This command will build exes for ocean-ice post
=To link fixed files and executable programs for the coupled application:
=On Hera:
sh link_fv3gfs.sh emc hera coupled
=On Orion:
sh link_fv3gfs.sh emc orion coupled
d ../workflow
cp user.yaml.default user.yaml
=Then, open and edit user.yaml:
=EXPROOT: Place for experiment directory, make sure you have write access.
=FIX_SCRUB: True if you would like to fix the path to ROTDIR(under COMROOT) and RUNDIR(under DATAROOT) False if you would like CROW to detect available disk space automatically. *** Please use FIX_SCRUB: True on Hera/Orion until further notice (2020/03)
=COMROOT: Place to generate ROTDIR for this experiment.
=DATAROOT: Place for temporary storage for each job of this experiment.
=cpu_project: cpu project that you are working with.
=hpss_project: hpss project that you are working with.

==IMPORTANT, next step is different from the git page
=In HERA
./setup_case.sh -p HERA ../cases/coupled_free_forecast_wave.yaml test2d
=In ORION
./setup_case.sh -p ORION ../cases/coupled_free_forecast_wave.yaml test2d

=This will create a experiment directory ($EXPERIMENT_DIRECTORY). In the current example, $EXPERIMENT_DIRECTORY=$EXPROOT/test2d.

=For ORION: First make sure you have python loaded:
module load contrib
module load rocoto #Make sure to use 1.3.2
module load intelpython3

./make_rocoto_xml_for.sh $EXPERIMENT_DIRECTORY

=Run the model using the workflow
cd $EXPERIMENT_DIRECTORY
module load rocoto
=Run several time rocotorun until all process are done
rocotorun -w workflow.xml -d workflow.db
=Check the status of your test
rocotostat -w workflow.xml -d workflow.db

=You'll find the grib2 files in the directory you created:
cd $COMROOT/test2d/gfs.20130401/00/wave/gridded
wgrib2 -V gfswave.t00z.global.0p50.f000.grib2

=You'll see that all wave variables (min, averge, max) have the same value.

Expected behavior
Produce valid grib2 files from the wave model using the modules from the phc-stack.

System:
Hera and Orion

Additional context
Add any other context about the problem here.

@RobertoPadilla-NOAA RobertoPadilla-NOAA added the bug Something isn't working label Jan 12, 2021
@JessicaMeixner-NOAA
Copy link

@RobertoPadilla-NOAA can we give them a smaller test case where they don't have to run the whole workflow?

Also updates from my fork that you have pointed them to have long since gone back into feature/coupled-crow. I'd prefer that people are not using that anymore.

@aerorahul
Copy link
Contributor

Please provide a single standalone script to reproduce the behavior.

@kgerheiser
Copy link
Contributor

kgerheiser commented Jan 12, 2021

Which hpc-stack are you using? I just ran wgrib2 -V on a random grib2 file I have and it returned what I think are correct answers.

There was a bug in our wgrib2 build, but it has since been fixed in a newer version of hpc-stack.

Here is my output:

53:103480536:vt=2021011206:surface:anl:HPBL Planetary Boundary Layer Height [m]:
    ndata=4718592:undef=0:mean=564.815:min=17.8054:max=4805.09
    grid_template=40:winds(N/S):
	Gaussian grid: (3072 x 1536) units 1e-06 input WE:NS output WE:SN
	number of latitudes between pole-equator=768 #points=4718592
	lat 89.910324 to -89.910324
	lon 0.000000 to 359.882813 by 0.117188

54:109292212:vt=2021011206:surface:anl:LAND Land Cover (0=sea, 1=land) [Proportion]:
    ndata=4718592:undef=0:mean=0.337744:min=0:max=1
    grid_template=40:winds(N/S):
	Gaussian grid: (3072 x 1536) units 1e-06 input WE:NS output WE:SN
	number of latitudes between pole-equator=768 #points=4718592
	lat 89.910324 to -89.910324
	lon 0.000000 to 359.882813 by 0.117188

55:109385913:vt=2021011206:surface:anl:ICEC Ice Cover [Proportion]:
    ndata=4718592:undef=0:mean=0.108133:min=0:max=1
    grid_template=40:winds(N/S):
	Gaussian grid: (3072 x 1536) units 1e-06 input WE:NS output WE:SN
	number of latitudes between pole-equator=768 #points=4718592
	lat 89.910324 to -89.910324
	lon 0.000000 to 359.882813 by 0.117188

@JessicaMeixner-NOAA
Copy link

@kgerheiser those are atm grib files, not wave grib files. I've tried with both 1.0.0 and 1.1.0, without success.

@kgerheiser
Copy link
Contributor

I thought it might be an issue with the -V option of the executable, but if it's still broken with v1.1.0 then that's a problem. I thought that might fix it.

@RobertoPadilla-NOAA
Copy link
Author

RobertoPadilla-NOAA commented Jan 12, 2021 via email

@kgerheiser
Copy link
Contributor

The command used to create the grib file would be a good start

@JessicaMeixner-NOAA
Copy link

No @RobertoPadilla-NOAA but you made multiple tests without the workflow so I assumed you have something. I think we want the test to be super small and simple, so you just have the binary output from a model run, the ww3_grib.inp file and then a simple script for building the ww3_grib exe they need to run (1 way with the modules that work from the non-hpc-stack modules) and one w/the hpc-stack modules.

@RobertoPadilla-NOAA
Copy link
Author

Ok @kgerheiser @aerorahul , I'll be back to you once I have the small test ready.

@RobertoPadilla-NOAA
Copy link
Author

@kgerheiser @aerorahul I was working with the canned test for you, on Hera, but now hpc-stack modules can not be found. This is probably related to the problem of data loss this morning (Do you know if this is true?)
On Orion, looking into detail, the hpc-stack was not the issue, it was a version of the jasper module. I changed jasper/2.0.15 by jasper/1.900.1, ww3_grib works properly using the hpc-stack.

@kgerheiser
Copy link
Contributor

If they weren't working before they seem to be working now. I just tried loading the modules on Hera.

@climbfuji
Copy link
Contributor

If they weren't working before they seem to be working now. I just tried loading the modules on Hera.

I had success loading them on one of the login nodes (hfe11), but compiling on the compute nodes failed. Maybe some compute nodes lost their /scratch1 mounts?

@kgerheiser
Copy link
Contributor

@RobertoPadilla-NOAA you changed Jasper/2.0.15 to Jasper/1.900.1, or 1.900.1 to 2.0.15? That's something that should be investigated.

@RobertoPadilla-NOAA
Copy link
Author

@kgerheiser on Orion I changed jasper/2.0.15 to jasper/1.900.1 in order to build ww3_grib properly.

@RobertoPadilla-NOAA
Copy link
Author

On Hera I'm working on scrath1, and I'm compiling on the login nodes hfe04 and hfe10 and loading hpc-stack fails.

@kgerheiser
Copy link
Contributor

In what way does it fail?

I'm on hfe04 on scratch1 at /scratch1/NCEPDEV/nems/Kyle.Gerheiser

I run:

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-intel
etc

And it works.

@RobertoPadilla-NOAA
Copy link
Author

I don't know what is happening
On Hera
Several days ago I was using this file
/scratch1/NCEPDEV/stmp2/Roberto.Padilla/GitHub/WW3_hpc-satck_test/modulefiles/modulefile.ww3.hera_Original
script to load the hpc-stack, that contains
module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack
module load hpc/1.0.0
It was working.
Now that path it doesn't work, the one you sent has an extra "hpc-stack" in the path.
and notice the hpc module version, 1.0.0 was loading.
That file (modulefile.ww3.hera_Original) was loading all modules, the issue was that it was producing a ww3_grib execuatable
that produced empty grib2 files.

Ok, now I changed the path and loading (on the command line)
[Roberto.Padilla@hfe10 Run_test]$ module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
[Roberto.Padilla@hfe10 Run_test]$ module load hpc/1.1.0
[Roberto.Padilla@hfe10 Run_test]$ module load hpc-intel/18.0.5.274
[Roberto.Padilla@hfe10 Run_test]$ module load hpc-impi/2018.0.4
[Roberto.Padilla@hfe10 Run_test]$ module load jasper/2.0.15
Lmod has detected the following error: The following module(s) are unknown: "jasper/2.0.15"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore-cache load "jasper/2.0.15"

Also make sure that all modulefiles written in TCL start with the string #%Module

Thanks,
Roberto

@climbfuji
Copy link
Contributor

I don't know what is happening
On Hera
Several days ago I was using this file
/scratch1/NCEPDEV/stmp2/Roberto.Padilla/GitHub/WW3_hpc-satck_test/modulefiles/modulefile.ww3.hera_Original
script to load the hpc-stack, that contains
module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack
module load hpc/1.0.0
It was working.
Now that path it doesn't work, the one you sent has an extra "hpc-stack" in the path.
and notice the hpc module version, 1.0.0 was loading.
That file (modulefile.ww3.hera_Original) was loading all modules, the issue was that it was producing a ww3_grib execuatable
that produced empty grib2 files.

Ok, now I changed the path and loading (on the command line)
[Roberto.Padilla@hfe10 Run_test]$ module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
[Roberto.Padilla@hfe10 Run_test]$ module load hpc/1.1.0
[Roberto.Padilla@hfe10 Run_test]$ module load hpc-intel/18.0.5.274
[Roberto.Padilla@hfe10 Run_test]$ module load hpc-impi/2018.0.4
[Roberto.Padilla@hfe10 Run_test]$ module load jasper/2.0.15
Lmod has detected the following error: The following module(s) are unknown: "jasper/2.0.15"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore-cache load "jasper/2.0.15"

Also make sure that all modulefiles written in TCL start with the string #%Module

Thanks,
Roberto

Please note that there was a filesystem problem last night, resulting in about 45TB of corrupted=lost data.

@RobertoPadilla-NOAA
Copy link
Author

[Roberto.Padilla@hfe10 Run_test]$ module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
[Roberto.Padilla@hfe10 Run_test]$ module load bacio/2.4.0
Lmod has detected the following error: The following module(s) are unknown: "bacio/2.4.0"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore-cache load "bacio/2.4.0"
Lmod has detected the following error: The following module(s) are unknown: "g2/3.4.0"
Lmod has detected the following error: The following module(s) are unknown: "ip/3.3.0"
Lmod has detected the following error: The following module(s) are unknown: "nemsio/2.5.1"

@aerorahul
Copy link
Contributor

@RobertoPadilla-NOAA
bacio needs the intel compiler module loaded.

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
module avail
module load hpc/1.1.0
module load hpc-intel
module load bacio/2.4.1
module list

Currently Loaded Modules:
  1) hpc/1.1.0   2) intel/18.0.5.274   3) hpc-intel/18.0.5.274   4) bacio/2.4.1

@RobertoPadilla-NOAA
Copy link
Author

@climbfuji, yes, that was my question in the first comments of today, that if the filesystem failure was affecting the loading of hpc-stack?.
None of the modules are loading using
module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack

@aerorahul
Copy link
Contributor

@climbfuji, yes, that was my question in the first comments of today, that if the filesystem failure was affecting the loading of hpc-stack?.
None of the modules are loading using
module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack

@RobertoPadilla-NOAA
You are not using the modules correctly.
You are using the stack with the module use, but you still need to load the modules with a module load in a hierarchical manner.

The correct use of hpc-stack and the software stack underneath it is outlined here

@kgerheiser
Copy link
Contributor

The new version of hpc-stack also has updated some libraries (like bacio is version 2.4.1 now), so that's why it's not finding the versions you specified. The updated libraries should have no affect on your code or have any change in results (mainly build system changes).

@JessicaMeixner-NOAA
Copy link

@RobertoPadilla-NOAA do I need to help make the test case or a file using the new module versions of hpc-stack?

@kgerheiser
Copy link
Contributor

kgerheiser commented Jan 15, 2021

I'm not sure about what you were originally using at /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack as there's nothing there anymore. I didn't touch it, and it's on scratch2 which supposedly isn't affected by the data loss.

That seems to be an old version of hpc-stack (which would also contribute to your wgrib2 problem).

I would update to use the most recent version of hpc-stack if you can, if to just get the updated wgrib2.

@RobertoPadilla-NOAA
Copy link
Author

@JessicaMeixner-NOAA if you can help making the file with the new module versions of hpc-stack will be great. Thanks.

@kgerheiser
Copy link
Contributor

kgerheiser commented Jan 15, 2021

Try this:

#%Module######################################################################
## module for ww3 before base uses hpc-stack
module use /contrib/sutils/modulefiles
module load sutils

module load cmake/3.16.1

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack

module load hpc/1.1.0
module load hpc-intel/18.0.5.274
module load hpc-impi/2018.0.4

module load jasper/2.0.22
module load zlib/1.2.11
module load png/1.6.35

module load hdf5/1.10.6
module load netcdf/4.7.4
module load esmf/8_1_0_beta_snapshot_27

module load bacio/2.4.1
module load crtm/2.3.0
module load g2/3.4.1
module load g2tmpl/1.9.1
module load ip/3.3.3
module load upp/10.0.0
module load nemsio/2.5.2
module load sp/2.3.3
module load w3emc/2.7.3
module load w3nco/2.4.1

@RobertoPadilla-NOAA
Copy link
Author

@kgerheiser it failed to load only g2
Lmod has detected the following error: The following module(s) are unknown: "g2/3.4.0"

@aerorahul
Copy link
Contributor

@RobertoPadilla-NOAA
The g2 version is 3.4.1 in @kgerheiser message ⬆️

@kgerheiser
Copy link
Contributor

Looks like there's a new release of Jasper, 2.0.25, with the fix. I think we should update to that immediately, and we'll continue to look at phasing out Jasper.

@kgerheiser
Copy link
Contributor

kgerheiser commented Feb 11, 2021

@JessicaMeixner-NOAA or @WalterKolczynski-NOAA would you try out my nightly build of hpc-stack (develop)? I just want to make sure that the fix works before we install it everywhere.

Hera: /scratch1/NCEPDEV/stmp2/Kyle.Gerheiser/hpc-stack/nightly-develop/install/modulefiles/stack

Orion: /work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/modulefiles/stack

I have it built on Hera and Orion and it contains Jasper 2.0.25.

@WalterKolczynski-NOAA
Copy link

@kgerheiser What about WCOSS Dell?

@kgerheiser
Copy link
Contributor

I don't have a test build on there at the moment. I can do one if you like. I have a cron job set to build and test hpc-stack, but cron doesn't work on WCOSS Dell.

@WalterKolczynski-NOAA
Copy link

On Hera:

  • The current hpc/1.1.0 has esmf/8_1_0_beta_snapshot_36, but the nightly build only has snapshot 27. Don't know if this will be an issue.
  • WGRIB_LIB and WGRIB2_LIBAPI are not set correctly. This had previously been fixed by Hang, but it looks like those changes didn't make it back.

@kgerheiser
Copy link
Contributor

The ESMF thing doesn't matter.

That's a good catch. We recently fixed that in the code so it wasn't hardcoded, but wgrib2 was missed. I have fixed it in the existing build.

@WalterKolczynski-NOAA
Copy link

I don't have a test build on there at the moment. I can do one if you like. I have a cron job set to build and test hpc-stack, but cron doesn't work on WCOSS Dell.

I've never had a problem with cron on WCOSS Dell. Are you using the mycrontab file?

@kgerheiser
Copy link
Contributor

No, how do I do that?

@WalterKolczynski-NOAA
Copy link

No, how do I do that?

In your home directory, there should be a cron directory with a file named mycrontab inside. Works just like editing a normal crontab, except it will automatically be turned on/off when production switches (and you don't have to play 'which login node did I put the cron job on?').

@JessicaMeixner-NOAA
Copy link

@WalterKolczynski-NOAA do you have the testing done? I could run my quick test set-up for this case on orion if that would help. I'm just switching out the jasper or did you want me to use the whole hpc-stack from the nightly build?

@kgerheiser
Copy link
Contributor

Just use the whole hpc-stack. Everything should work.

@WalterKolczynski-NOAA
Copy link

On Hera, a bunch more wrong envvar libs:

  • landsfcutil/2.4.1
  • ip/3.3.3
  • sp/2.3.3
  • w3nco/2.4.1
  • bacio/2.4.1

@WalterKolczynski-NOAA
Copy link

@WalterKolczynski-NOAA do you have the testing done? I could run my quick test set-up for this case on orion if that would help. I'm just switching out the jasper or did you want me to use the whole hpc-stack from the nightly build?

I'm trying to get everything built and setup now

@kgerheiser
Copy link
Contributor

Yep, just realized that would happen. Sorry, about that. I fixed them.

@WalterKolczynski-NOAA
Copy link

It looks like Orion has the same lib variable issues.

@JessicaMeixner-NOAA
Copy link

@kgerheiser on my test on orion, I'm getting that the following two variables which I use when building the model:

G2_LIB4=/work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/intel-2018.4/g2/3.4.1/lib64/libg2_4.a
W3NCO_LIB4=/work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/intel-2018.4/w3nco/2.4.1/lib64/libw3nco_4.a

don't actually exist.

The modules I used:
module load contrib noaatools
module load cmake/3.17.3
module use /work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/modulefiles/stack
module load hpc/1.1.0
module load hpc-intel/2018.4
module load hpc-impi/2018.4
module load jasper/2.0.25
module load zlib/1.2.11
module load png/1.6.35
module load hdf5/1.10.6
module load netcdf/4.7.4
module load esmf/8_1_0_beta_snapshot_27
module load bacio/2.4.1
module load crtm/2.3.0
module load g2/3.4.1
module load g2tmpl/1.9.1
module load ip/3.3.3
module load nceppost/dceca26
module load sp/2.3.3
module load w3emc/2.7.3
module load w3nco/2.4.1

@kgerheiser
Copy link
Contributor

I believe I have fixed all the modules in both of the builds. I also put in a PR #163 to fix it.

@WalterKolczynski-NOAA
Copy link

@JessicaMeixner-NOAA looks like the pio version has to be updated to 2.5.2 as well

@kgerheiser
Copy link
Contributor

PIO 2.5.1 will also be there, but 2.5.2 is now the version we're moving to. Feel free to remain on 2.5.1 for now.

@WalterKolczynski-NOAA
Copy link

It isn't available in the nightly build, which makes it difficult to test without changing.

@JessicaMeixner-NOAA
Copy link

It isn't available in the nightly build, which makes it difficult to test without changing.

I don't need pio to test, but I do need the libraries to exist/link to, to be able to test ww3_grib.

@WalterKolczynski-NOAA
Copy link

I needed it to build the model.

@WalterKolczynski-NOAA
Copy link

I've successfully built on both Hera and Orion using the nightly build.

@JessicaMeixner-NOAA
Copy link

@WalterKolczynski-NOAA what do you use for G2_LIB4 and W3NCO_LIB4 ?

@WalterKolczynski-NOAA
Copy link

I didn't make any changes to the model except the jasper and pio versions. The build log says:

G2_LIB4=/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/g2/3.4.1/lib/libg2_4.a
W3NCO_LIB4=/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/w3nco/2.4.1/lib/libw3nco_4.a

@JessicaMeixner-NOAA
Copy link

I ran a test on orion and it worked for my test case @kgerheiser sorry it took a while

WalterKolczynski-NOAA added a commit to WalterKolczynski-NOAA/global-workflow that referenced this issue Feb 18, 2021
Updates the jasper version to fix a bug that was causing constant-valued
grib files for wave output.

Refs: NOAA-EMC#161 NOAA-EMC#164 NOAA-EMC/hpc-stack#137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants