Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for MSU Hercules #1732

Closed
ulmononian opened this issue May 1, 2023 · 18 comments · Fixed by #1733
Closed

Add support for MSU Hercules #1732

ulmononian opened this issue May 1, 2023 · 18 comments · Fixed by #1733
Assignees
Labels
enhancement New feature or request

Comments

@ulmononian
Copy link
Collaborator

Description

The WM needs updated to support MSU's new HPC, Hercules.

Solution

Update all necessary files to enable WM functionality on Hercules. spack-stack/1.3.1 is currently being installed, so testing can begin there shortly.

Relates to

PR #1707, Issue #1651

@ulmononian ulmononian added the enhancement New feature or request label May 1, 2023
@ulmononian ulmononian mentioned this issue May 2, 2023
37 tasks
@ulmononian
Copy link
Collaborator Author

Some sacct commands are not working as expected on Hercules (on login nodes 1,3,4, at least). Both of the commands sacctmgr show assoc where user=$USER format=cluster,partition,account,user%20,qos%60 and sacctmgr show assoc where user=$USER format=account are returning:

sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:hercules-mn-1:6819: Connection refused
sacctmgr: error: Sending PersistInit msg: Connection refused

Simply entering sacct results in the same error, as well.

Since some sacct commands are run automtically in the ufs-weather-model regression test framework (e.g., in the submit_and_wait function of rt_utils.sh used to submit and check the status of the compile job_card), the job can't get past the compile step (even if successful) because the sacct commands cause rt.sh to abort.

@ulmononian
Copy link
Collaborator Author

sacct issue resolved itself as of 5/9/23. however, it was suggested that loading noaatools before using the command may be necessary...

@ulmononian
Copy link
Collaborator Author

some issues with the newly installed rocoto. debugging efforts underway. error screencap below

Screen Shot 2023-05-09 at 5 16 07 PM

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented May 12, 2023

@jkbk2004 @jkbk2004 @ulmononian
(Originally posted in NOAA-EMC/hpc-stack#521 (comment) )

A new hpc-stack built on Hercules with the following compilers:
intel-oneapi-compilers/2022.2.1
intel-oneapi-mpi/2021.7.1

The stack could be loaded as following:

module use /work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2_hrcs/modulefiles/stack
module load hpc
module load hpc-intel-oneapi-compilers
module load hpc-intel-oneapi-mpi

Please see below a the modules in the stack when inquired using "module list":

---- /work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2_hrcs/modulefiles/mpi/intel-oneapi-compilers/2022.2.1/intel-oneapi-mpi/2021.7.1 ----
   atlas/ecmwf-0.24.1        fms/2022.04                      ncio/1.1.2             upp/10.0.10
   crtm/2.4.0                hdf5/1.10.6               (D)    nemsio/2.5.4    (D)    wrf_io/1.2.0
   eckit/ecmwf-1.16.0        madis/4.3                        nemsiogfs/2.5.3
   esmf/8.3.0b09      (D)    mapl/2.22.0-esmf-8.3.0b09        netcdf/4.7.4    (D)
   fckit/ecmwf-0.9.2         ncdiag/1.0.0                     pio/2.5.7
   
---- /work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2_hrcs/modulefiles/compiler/intel-oneapi-compilers/2022.2.1 ----
   bacio/2.4.1                          ip/3.3.3                 sfcio/1.4.1
   bufr/11.7.0                          jasper/2.0.25     (D)    sigio/2.3.2
   g2/3.4.5                             jpeg/9.1.0               sp/2.3.3
   g2c/1.6.4                            landsfcutil/2.4.1        szip/2.1.1
   g2tmpl/1.10.0                        libpng/1.6.37            udunits/2.2.28 (D)
   gfsio/1.4.1                          metplus/4.1.3            w3emc/2.9.2
   gftl-shared/v1.5.0                   nccmp/1.8.9.0     (D)    w3nco/2.4.1    (D)
   grib_util/1.2.4                      nco/5.0.6         (D)    yafyaml/v0.5.1
   gsl/2.7.1                     (D)    nemsio/2.5.4             zlib/1.2.11    (D)
   hdf5/1.10.6                          netcdf/4.7.4
   hpc-intel-oneapi-mpi/2021.7.1 (L)    prod_util/1.2.2

Feel free to test it if possible. There could be potential errors with ESMF during the run-time, let me know if any errors result!

@MichaelLueken
Copy link
Collaborator

@natalie-perlin I was able to create a new modulefile from the ufs_orion.intel.lua modulefile for Hercules. The file can be found - /work/noaa/epic-ps/mlueken/ufs-weather-model/modulefiles/ufs_hercules.intel.lua. After making necessary changes to make the RTs run on Hercules, I submitted the tests, but they failed while trying to compile. The message in the err file is:

CMake Error at CMakeLists.txt:143 (find_package):
  By not providing "FindNetCDF.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "NetCDF", but
  CMake did not find one.

  Could not find a package configuration file provided by "NetCDF" (requested
  version 4.7.4) with any of the following names:

    NetCDFConfig.cmake
    netcdf-config.cmake

  Add the installation prefix of "NetCDF" to CMAKE_PREFIX_PATH or set
  "NetCDF_DIR" to a directory containing one of the above files.  If "NetCDF"
  provides a separate development package or SDK, be sure it has been
  installed.

I do see that netcdf/4.7.4 module is load:

Currently Loaded Modules:
  1) rocoto/1.3.5                         14) esmf/8.3.0b09
  2) contrib/0.1                          15) fms/2022.04
  3) hpc/1.2.0                            16) bacio/2.4.1
  4) intel-oneapi-compilers/2022.2.1      17) crtm/2.4.0
  5) hpc-intel-oneapi-compilers/2022.2.1  18) g2/3.4.5
  6) intel-oneapi-mpi/2021.7.1            19) g2tmpl/1.10.2
  7) hpc-intel-oneapi-mpi/2021.7.1        20) ip/3.3.3
  8) jasper/2.0.25                        21) sp/2.3.3
  9) zlib/1.2.11                          22) w3emc/2.9.2
 10) libpng/1.6.37                        23) gftl-shared/v1.5.0
 11) hdf5/1.10.6                          24) mapl/2.22.0-esmf-8.3.0b09
 12) netcdf/4.7.4                         25) ufs_common
 13) pio/2.5.7                            26) ufs_hercules.intel

Any clarification on why the ufs-weather-model might not be accepting netcdf/4.7.4 would be greatly appreciated!

Please see /work/noaa/stmp/mlueken/stmp/mlueken/FV3_RT/rt_2447792 for the test output.

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 17, 2023

@MichaelLueken would you be willing to open up permissions to your RT path? i haven't been testing with hpc-stack on hercules (focusing on spack-stack), but i'd be happy to take a look for you. one thing to note is that esmf versions prior to 8.4.1 have been shown to cause segfault failures on hercules, so i would suggest updating the esmf/mapl versions for your hpc-stack tests.

if you're interested in testing spack-stack on hercules (currently 112/126 RTs pass w/ spack-stack/1.3.1 there), feel free to check out my fork branch at #1733.

@MichaelLueken
Copy link
Collaborator

@ulmononian I thought that I opened up the permissions to my RT path -
drwxr-Sr-x 3 mlueken stmp 4096 May 17 10:59 mlueken

I did a second pass and it looks like everything in my stmp path should be opened now.

I'm certainly willing to test your fork's branch at #1733 as well!

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 17, 2023

just to note: sacct issue re-emerged on hercules login-3. this is with contrib and noaatools loaded.

[cbook@hercules-login-3 log_hercules.intel]$ sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:hercules-mn-1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

this issue causes failures to job status checking in the RT scrips. for example, even if the compile step completes successfully, the run step won't proceed because the job monitoring logic hits this sacct error and rt.sh fails.

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 17, 2023

@MichaelLueken i think i found the cause of the netcdf issue you mentioned: it looks like in /work/noaa/epic-ps/mlueken/ufs-weather-model, the CMakeModules submodule directory is empty (along with some other submodule directories). perhaps it the checkout/recursive clone did not complete properly?

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 17, 2023

@MichaelLueken i can confirm that after doing git submodule update --init --recursive within my copy of /work/noaa/epic-ps/mlueken/ufs-weather-model, the model compile as expected.

@zach1221 zach1221 moved this from Todo to In Progress in Backlog: platforms and RT May 23, 2023
@ulmononian
Copy link
Collaborator Author

for some reason, the srun command w/ --mpi=pmi2 that was added to fix previously reported srun errors started failing with

+ srun --label --mpi=pmi2 -n 160 ./fv3.exe
 80: slurmstepd: error: mpi/pmi2: value not properly terminated in client request
 80: slurmstepd: error: mpi/pmi2: request not begin with 'cmd='
 80: slurmstepd: error: mpi/pmi2: full request is: 0000000000000000000000000000000000000000000000
 80: cmd=put kvsname=18323.0 key=bc-80-seg-2/3 value=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 80: slurmstepd: error: mpi/pmi2: invalid client request
 80: slurmstepd: error: mpi/pmi2: request not begin with 'cmd='
 80: slurmstepd: error: mpi/pmi2: full request is: 

to fix this, export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi2.so was added to the fv3_slurm.IN_hercules file. the model will now run again w/ srun.

@MichaelLueken
Copy link
Collaborator

Hi @ulmononian, with respect to the requirement to include --mpi=pmi2 in the srun command, have you opened a Hercules helpdesk ticket to see if they would be able to build the MPI on the machine with this flag turned on? This seems to be something that the sys admins should include in the MPI build, rather than require users to add these extra steps into the srun command or adding export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi2.so to the fv3_slurm.IN_hercules file.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jun 9, 2023

Hi @ulmononian, with respect to the requirement to include --mpi=pmi2 in the srun command, have you opened a Hercules helpdesk ticket to see if they would be able to build the MPI on the machine with this flag turned on? This seems to be something that the sys admins should include in the MPI build, rather than require users to add these extra steps into the srun command or adding export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi2.so to the fv3_slurm.IN_hercules file.

@MichaelLueken this is a great question and point. i am not sure why this is necessary on the user side, so i can certainly put a helpdesk ticket in to inquire. i should note that i have only tested the ufs-wm w/ spack-stack-built executables on hercules, so i am not sure if it it a general issue or spack-specific. i should note that, in the MSU hercules docs, there is an explicit mention of using --mpi=pmi2 as the transport for srun, but it is unclear to me if this is only when using openmpi, or applies to the intel mpi families as well.

perhaps more bizarre to me is that, for a month or more, the export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi2.so was not needed in conjunction with --mpi=pmi2. this just started happening, ostensibly randomly. the solution to export the I_MPI_PMI_LIBRARY was an ad-hoc solution i tried, and not one suggested by MSU, but it allows the model to run so i went with it for now.

@ulmononian
Copy link
Collaborator Author

@MichaelLueken i contacted msu hercules helpdesk. i will let you know what i hear!

@MichaelLueken
Copy link
Collaborator

@ulmononian Thanks! I'm definitely interested in hearing why they want users to add --mpi=pmi2 to their srun command.

@ulmononian
Copy link
Collaborator Author

@ulmononian Thanks! I'm definitely interested in hearing why they want users to add --mpi=pmi2 to their srun command.

i forgot to follow-up on this, but the hercules team made some changes and the --mpi=pmi2 is no longer needed in srun commands. i tested w/out and it works as expected. with this, the exporting of I_MPI_PMI_LIBRARY is also no longer needed.

@ulmononian
Copy link
Collaborator Author

i opened an issue on the mapl repo regarding newer intel compiler/mapl compatibility issues: GEOS-ESM/MAPL#2213

@ulmononian
Copy link
Collaborator Author

cpld_control_p8 passes when RT is run against a test installation of spack-stack that was built w/ intel 2023.1.0 (which includes ifort 2021.9.0 rather than ifort 2021.7.1).

the msu sys. admins recently (7/14/23) installed intel 2023 by request, as the mapl team had alluded that there were known issues with ifort 2021.7.1 (which is the fortran compiler included w/ the original intel 2022 installation on hercules.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Backlog: platforms and RT Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants