Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile errors on orion and hera with develop #1450

Closed
JessicaMeixner-NOAA opened this issue Oct 5, 2022 · 22 comments
Closed

Compile errors on orion and hera with develop #1450

JessicaMeixner-NOAA opened this issue Oct 5, 2022 · 22 comments
Labels
bug Something isn't working

Comments

@JessicaMeixner-NOAA
Copy link
Collaborator

Description

When running ufs-weather-model develop branch (hash, e6da626) I get a failure for most of the comile jobs on orion (@pjpegion and others have gotten similar errors) and for compile 011 on hera (@MatthewMasarik-NOAA gets the same errror).

To Reproduce:

Check out the develop branch, run ./rt.sh -e (from ecflow server on hera).

Additional context

I know that the orion develop worked for me last week. I have not tried to back-track versions yet as I'm curious if this is a larger issue.

Output

Orion:
Code on orion is here: /work2/noaa/marine/jmeixner/ufs-develop/tests
rt dir: /work2/noaa/marine/jmeixner/stmp/jmeixner/FV3_RT/rt_445868

Main error is not being able to find crtm:

CMake Error at FV3/upp/CMakeLists.txt:48 (find_package):
  By not providing "Findcrtm.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "crtm", but
  CMake did not find one.

Hera:
code: /scratch1/NCEPDEV/climate/Jessica.Meixner/ufs-weather-model/tests
rt dir: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239

Main error:

Found Python: /apps/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.7.6-gi3efxgcxqilpjehkqnxrriedsuedoqu/bin/python3.7
Calling CCPP code generator (ccpp_prebuild.py) for all available suites ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
/scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239/compile_011/build_fv3_011/FV3/ccpp/physics/ccpp_static_api.F90(5012): error #6405: The same named entity from different modules and/or program units cannot be referenced.   [CDATA]
               ierr = FV3_GFS_v16_coupled_p8_sfcocn_time_vary_tsfinal_cap(cdata=cdata)
--------------------------------------------------------------------------------^
compilation aborted for /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239/compile_011/build_fv3_011/FV3/ccpp/physics/ccpp_static_api.F90 (code 1)
make[2]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/physics/ccpp_static_api.F90.o] Error 1
make[1]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/all] Error 2
@JessicaMeixner-NOAA JessicaMeixner-NOAA added the bug Something isn't working label Oct 5, 2022
@JessicaMeixner-NOAA
Copy link
Collaborator Author

@jkbk2004 I have retested on both hera and orion and am still having this same issue.

@WalterKolczynski-NOAA
Copy link

Can confirm the same error on Orion when I call tests/compile.sh directly to attempt build for global-workflow.

@jkbk2004
Copy link
Collaborator

@JessicaMeixner-NOAA I reset the permission of the whole hpc-stack directory on orion. Can you give a try?

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @jkbk2004, @JessicaMeixner-NOAA is away until Wednesday. I can test on orion and let you know the outcome.

@MatthewMasarik-NOAA
Copy link
Collaborator

@jkbk2004 I just tested on orion and found I get the same error.

@jkbk2004
Copy link
Collaborator

@MatthewMasarik-NOAA @BrianCurtis-NOAA @ChunxiZhang-NOAA @zach1221 can you take a look: /work/noaa/epic-ps/jongkim/4debug? As err.log shows, I am able to load modules ok: crtm. Can you give a try to run the jobs_card I put there? so that we can catch if module loading is ok with everyone.

@MatthewMasarik-NOAA
Copy link
Collaborator

@jkbk2004 I copied that directory and submitted the job_card. Here is the output of err.log (out.log is empty):

[matma@Orion-login-1 4debug]$ cat err.log 
++ date +%s
+ echo -n ' 1665579328,'
+ set +x
Lmod has detected the following error: The following module(s) are unknown:
"ufs_common"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore-cache load "ufs_common"

Also make sure that all modulefiles written in TCL start with the string
#%Module

Ps, I don't have account=nems so I set account=marine-cpu.

@BrianCurtis-NOAA
Copy link
Collaborator

make sure you module use modulefiles && module load ufs_<machine>.<compiler> (i.e. ufs_hera.intel)

@MatthewMasarik-NOAA
Copy link
Collaborator

make sure you module use modulefiles && module load ufs_<machine>.<compiler> (i.e. ufs_hera.intel)

Is this message for me, @BrianCurtis-NOAA?

If I try that, module use modulefiles && module load ufs_orion.intel, in the directory I copied I get an Lmod error saying "ufs_orion.intel" is unkown

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@BrianCurtis-NOAA @jkbk2004 @MatthewMasarik-NOAA I am back from leave and can try this again today. Brian I had a question about the module load you said we should do because I don't have to do this for other machines and I've never had to do this for orion before.

@BrianCurtis-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA @MatthewMasarik-NOAA @jkbk2004 Sorry for the confusion. I had an issue on orion that I found when testing a build on Orion with develop branch. I had to load git/2.28.0 before git would pull everything cleanly without error.

More context for the module use and module load, here's how I setup my env for running RT.

git clone [email protected]:ufs-community/ufs-weather-model --recursive
cd ufs-weather-model
module use modulefiles
module load ufs_<machine>.<compiler>
cd tests
./rt.sh -e > rt.out 2>&1 &

on Orion, at least, a module load git/2.28.0 helped git pull successfully in case that was an issue you saw as well.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

I've tried loading the git module last week and that did not solve my issue either. I've never had to load the ufs modules for any other machine...

@BrianCurtis-NOAA
Copy link
Collaborator

I've tried loading the git module last week and that did not solve my issue either. I've never had to load the ufs modules for any other machine...

rt.sh should automatically do it, yes.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

I was able to run on orion this morning with the latest ufs version. I'll try hera now.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Compile 11 on hera is still failing for me

@RatkoVasic-NOAA
Copy link
Collaborator

Same thing for me, it fails in compiling with DEBUG option on Hera.

Found Python: /apps/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.7.6-gi3efxgcxqilpjehkqnxrriedsuedoqu/bin/python3.7
Calling CCPP code generator (ccpp_prebuild.py) for all available suites ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
/scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_260589/compile_001/build_fv3_001/FV3/ccpp/physics/ccpp_static_api.F90(5012): error #6405: The same named entity from different modules and/or program units cannot be referenced.   [CDATA]
               ierr = FV3_GFS_v16_coupled_p8_sfcocn_time_vary_tsfinal_cap(cdata=cdata)
--------------------------------------------------------------------------------^
compilation aborted for /scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_260589/compile_001/build_fv3_001/FV3/ccpp/physics/ccpp_static_api.F90 (code 1)
make[2]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/physics/ccpp_static_api.F90.o] Error 1
make[1]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/all] Error 2
make: *** [all] Error 2
'''
I cloned fresh copy of ufs_weather_model.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

I tried again today and the compile 011 is still failing for me on hera. I have been okay on orion and was going to test that again but there are /work issues.

@DusanJovic-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA One option is to try to reduce the number of ccpp SDFs in the FV3/ccpp/suites directory to only those actually used by the regression test. This test that is failing does not explicitly list suites, so it tries to build them all. Currently there are more than 90 suite definitions there. Not all of them are used, we use (regression test) only about a third.
Can you run this script:
/scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/suites_run.sh
in FV3/ccpp/suites directory in your working copy, and try to rerun that test again.

@RatkoVasic-NOAA
Copy link
Collaborator

I just ran with 90 (out of 91 SDFs) and it worked. I just excluded first one on the list (suite_FV3_CPT_v0.xml). Still have no idea why this is happening, and only to few of us.

@ChunxiZhang-NOAA
Copy link
Contributor

suite_FV3_CPT_v0.xml is a deprecated SDF. To reduce the number of SDFs in the suites directory is a good option. And could make it happen soon.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA - running with your script first, the regression tests succeeded. I'm with @RatkoVasic-NOAA on the wondering why this is happening to a few of us.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Mar 9, 2023

git/28 module requirement (on orion) is case-by-case. If there is git clone issue, the problem is resolved clearly with new version. I am closing this issue. If the issue is persistent, we can re-open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants