-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpld_control_p8 fails in same spot on Gaea C5 & Hercules w/ spack-stack #1791
Comments
@mathomp4 fyi |
@ulmononian Do you have the full log file available? |
@mathomp4 yes. i've attached both the |
@ulmononian Can you compile and run with debugging flags on? Because that is not the most elucidating traceback... Or maybe compile MAPL with debugging? I have to imagine there is more to that traceback. |
@mathomp4 i can quickly do |
@mathomp4 sorry for the delay on this. i did run |
Huh. You are dying in History at: which is some pretty old code. Can you share the HISTORY.rc you are using? I might need to call in @bena-nasa (our History expert) to see if maybe the file is oddly set up. |
circling back here, @mathomp4 and i discussed (offline) that the history file in question is https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/gocart/AERO_HISTORY.rc.IN. it appears that
|
@that is quite bizarre. Your History file looks fine, you are not doing anything fancy, there's just nothing I see there that could could explain this error. It's almost like this is some sort of compiler bug/memory bug. You are failing on a call that is simply passing in an optional string to a procedure at the call, it's like the memory of the string being passed there is fubarred. |
thanks for looking into this @bena-nasa! so far, we have not successfully been able to run the coupled model (S2SWA - ocean + atm w/ waves and aerosols enabled) on hercules or c5. these are machines we are just beginning to use and so are trying to get the weather model working on each. as for what is different between these failing runs on hercules/c5 and successful runs: it is only the stack. we use a common set of libraries on each machine (so hercules/c5 libraries are identical in version to the other machines where the model works), but the compiler/mpi versions differ between machines. there is nothing different in the model setup/configuration or input data. we have only run into issues w/ compiler/mpi version on NCAR cheyenne, where the gnu/openmpi combination simply does not work with the aerosol model. it could easily be that we are seeing some unknown compiler/mpi issues with hercules/c5 when the aerosol model is turned on. but given that the model fails in the same spot on each of the machines, i am more inclined to believe it could be a memory issue as you suggested, as the compiler/mpi pair is different on the two machines. i will look at some node adjustments/stack size settings and see if we can make any progress there. thanks so much! |
@mathomp4 following up on the compiler versions on hercules and gaea c5: hercules: i can get more info about these if it helpful. please correct me if i misunderstood you, but you were saying that GEOS/MAPL has shown issues running w/ intel compiler versions newer than 2021.9.1? |
@ulmononian I think so, yes, but I need to refresh my memory. At the moment, our Intel version of choice is Intel Classic ifort 2021.6.0. I do see we have intel 2021.7 on our cluster, so I can try that out and see, maybe it shows the issue... The Intel MPI version you use on hercules is good. As for cray-mpich, 🤷🏼 , but if it worked before, probably will now. |
unfortunately, we do not have access to compilers older than those listed in my previous comment. do you know if support for the newer intel compilers is planned, and if so, if there's a timeline for that support? perhaps this should be tracked as an issue on the mapl github... |
this issue was resolved by upgrading intel compiler version to intel/2023.1.0 so that ifort/2021.7.1 is not used (see JCSDA/spack-stack#675 and JCSDA/spack-stack#673). |
@jkbk2004 the issue is resolved and could be closed! |
Description
strangely,
cpld_control_p8
fails during the model run step in the same place on gaea c5 and hercules when using spack-stack/1.4.0, i.e. here:to me, the only meaningful line in the
err
file is:108: fv3.exe 0000000001D9F59B aerosol_cap_mp_mo 348 Aerosol_Cap.F90
and some mentions oflibmpi.so
andlibc.so
(toward the very end).rundirs are (hercules)
/work2/noaa/epic-ps/cbook/HERCULES/add_hercules/rt_403736/cpld_control_p8_intel
and (gaea c5)/lustre/f2/dev/role.epic/sandbox/cam_tests/test_c5/rt_232052/cpld_control_p8_intel
.the mapl and esmf versions on each machine are the same (i.e. 2.35.2 and 8.4.2). the specific module env on c5 is:
and on hercules:
cpld_control_noaero_p8
works on both machines.i tried updating the gocart hash and parm/gocart files that @junwang-noaa updates in #1745, but this did not resolve the issue.
To Reproduce:
on gaea:
for hercules, just change the branch to
-b feature/add_hercules
Additional context
Needs to be fixed before #1707 , #1733 , or #1784 can be merged.
The text was updated successfully, but these errors were encountered: