Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After #75, Compsets using MPAS-O on GPUs fail during run-time #84

Open
gdicker1 opened this issue Dec 4, 2024 · 17 comments
Open

After #75, Compsets using MPAS-O on GPUs fail during run-time #84

gdicker1 opened this issue Dec 4, 2024 · 17 comments
Assignees
Labels
EW specific This has to do with EarthWorks only - files, goals, code that probably won't be wanted upstream external Has to do with externals invalid This doesn't seem right OpenACC Involves OpenACC porting

Comments

@gdicker1
Copy link
Contributor

gdicker1 commented Dec 4, 2024

Though #75 contains an initial GPU port of MPAS-Ocean, tests involving this compset on GPUs now fail at run-time. Some fields get NaNs and MPAS-O aborts.

Testing with GPU builds using the ewm-2.3.006 tag (before GPU MPAS-O) succeed.

Example steps to re-create this problem

  1. Clone EarthWorks, using a version equivalent to tag ewm-2.3.010 or later
  2. Create a case that uses GPUs and some non-simple CAM physics (e.g. F2000dev which uses cam7 physics)
    • Using GPUs in EarthWorks and CESM is under active development, please ask if you are unsure how to request GPUs for a case.
  3. Run ./case.setup, ./case.build, and ./case.submit
  4. The simulation will run for some time. The run will eventually fail due to MPAS-O aborting from finding NaNs in the fields.
    • The output from MPAS-O (during the run) will be in some file like fort.99 with extra error information also being added to the mpas_ocean_block_stats_${RANK} files.

Example output

Excerpt from a mpas_ocean_block_stats_0 file 1 :

  ERROR: NaN Detected in state see below for which field contained a NaN.
    -- Statistics information for block fields
      Field: latCell
          Min:   -0.6014597403056250
          Max:   -0.1402792456120920
      Field: lonCell
          Min:     4.136882946640790
          Max:     4.613029934416640
      Field: xCell
          Min:    -3356045.861461610
          Max:    -605355.3493913120
      Field: yCell
          Min:    -6131081.962771160
          Max:    -4865592.014447640
      Field: zCell
          Min:    -3605138.567225930
          Max:    -890822.8347366140
      Field: areaCell
          Min:     12398456274.30820
          Max:     12854325071.01830
 
 
  ERROR: NaN Detected in layerThickness.
    -- Statistics information for layerThickness fields
...

Footnotes

  1. Full file path on Derecho: "/glade/derecho/scratch/gdicker/ewv24_2024Nov18170000/ew-v24test-gpu/ERS_Ln9_P64_G4-a100-openacc_Vnuopc.T62_oQU120.MPASOOnly.derecho_nvhpc.ew-outfrq9s.G.20241118_170035_wv9uyx/run/mpas_ocean_block_stats_0"

@gdicker1 gdicker1 added invalid This doesn't seem right external Has to do with externals EW specific This has to do with EarthWorks only - files, goals, code that probably won't be wanted upstream OpenACC Involves OpenACC porting labels Dec 4, 2024
@dazlich
Copy link
Contributor

dazlich commented Jan 13, 2025

@gdicker1 Rich brought this issue to my attention. I'd like to test this out - do you have a script I can use to try this?

@gdicker1
Copy link
Contributor Author

@dazlich, sure!

CASEDIR="you name it"
./cime/scripts/create_newcase -- case "${CASEDIR}" --res mpasa120_oQU120 --compset CHAOS2000dev --project UCSU0085 --input-dir /glade/campaign/univ/ucsu0085/inputdata --output-root "${CASEDIR}/.." --driver nuopc --compiler nvhpc --ngpus-per-node 4 --gpu-type a100 --gpu-offload openacc
cd "${CASEDIR}"
./case.setup
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --sharedlib-only
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --model-only
./case.submit

Very little changes. Just note:

  • The last 4 arguments to create_newcase are the most important. Must use nvhpc compilers and have the correct values for the 3 GPU-related args
  • The test infrastructure builds shared libraries and the model in separate steps. I also think it speeds up the NVHPC build. It still takes a long time with NVHPC (even longer for GPU builds than CPU-only), but at least this seems faster to me.
  • CHAOS2000dev isn't required, I got the "NaN detected in LayerThickness" error with a "MPASOOnly" compset within this Derecho testdir: "/glade/derecho/scratch/gdicker/ewv24_2024Nov18170000/ew-v24test-gpu/ERS_Ln9_P64_G4-a100-openacc_Vnuopc.T62_oQU120.MPASOOnly.derecho_nvhpc.ew-outfrq9s.G.20241118_170035_wv9uyx"
    • MPASOOnly = 2000_DATM%NYF_SLND_DICE%SSMI_MPASO_SROF_SGLC_SWAV_SESP

@dazlich
Copy link
Contributor

dazlich commented Jan 15, 2025

I've dug a little further. I've run gpu and cpu (nvhpc) cases for ewm-2.3.006 and ewm-2.3.007 for both the split-explicit and split-implicit time integration schemes.

  • I have successful two month simulations for all cpu cases.
  • ewm-2.3.006, split-explicit fails after about 15 days, dies from signal 15. There are empty mpas_ocean_block_stats files so apparently there was a state validation failure but no useful diagnostic message.
  • ewm-2.3.006, split-implicit fails after about two days. There is an mpas_ocean_block_stats file implying state validation failure, but again empty. Again, died from signal 15.
  • ewm-2.3.007, split-explicit fails on timestep 1. Here the messages are explicit, state validation failure, and the mpas_ocean_block_stats files have data. The exit code is 255.
  • ewm-2.3.007, split-implicit fails on timestep 1. This time it stops due to exceeding an iteration limit. The exit code is 255.

The code never ran satisfactorily on gpu, but the ewm-2.3.007 tag fails immediately. I will now see if I can track where the solutions diverge from the cpu solutions.

@dazlich dazlich self-assigned this Jan 17, 2025
@dazlich
Copy link
Contributor

dazlich commented Feb 20, 2025 via email

@gdicker1
Copy link
Contributor Author

For just one case, look at the end of "${CASEDIR}/cmake_macros/nvhpc.cmake". You could just add ,autocompare to all the -gpu=... lines.

If you want it for any cases you make, then look at "ccs_configs/machines/cmake_macros/nvhpc.cmake" instead.

@dazlich
Copy link
Contributor

dazlich commented Feb 23, 2025

@areanddee @gdicker1 The gpu ocean is failing because the GPUFLAGS were not making it into the build command. I have hacked the build line for mpas.ocean in cime_config/buildlib to force them to be there. Now the ocean runs 1 month.

diff --git a/cime_config/buildlib b/cime_config/buildlib
index 84d83fb..90e4c27 100755
--- a/cime_config/buildlib
+++ b/cime_config/buildlib
@@ -284,7 +284,7 @@ def _build_mpaso():
# build the library
makefile = os.path.join(casetools, "Makefile")
complib = os.path.join(libroot, "libocn.a")

  •    cmd = "{} complib -j {} MODEL=mpaso COMPLIB={} -f {} USER_CPPDEFS=\"-DUSE_PIO2 -DMPAS_PIO_SUPPORT -D_MPI -DEXCLUDE_INIT_MODE -DMPAS_NO_ESMF_INIT -DMPAS_EXTERNAL_ESMF_LIB -DMPAS_PERF_MOD_TIMERS -DUSE_LAPACK -DMPAS_NAMELIST_SUFFIX=ocean\" FIXEDFLAGS={} {}" \
    
  •    cmd = "{} complib -j {} MODEL=mpaso COMPLIB={} -f {} USER_CPPDEFS=\"-DMPAS_OPENACC -DUSE_PIO2 -DMPAS_PIO_SUPPORT -D_MPI -DEXCLUDE_INIT_MODE -DMPAS_NO_ESMF_INIT -DMPAS_EXTERNAL_ESMF_LIB -DMPAS_PERF_MOD_TIMERS -DUSE_LAPACK -DMPAS_NAMELIST_SUFFIX=ocean -acc -gpu=cc80,lineinfo,nofma -Minfo=accel \" FIXEDFLAGS={} {}" \
           .format(gmake, gmake_j, complib, makefile, fixedflags, get_standard_makefile_args(case))
    

I ran the update/ocean3p75 branch of EarthWorks. I ran the split-explicit time-integration scheme. I still need to examine the output, but the fact it ran one simulated month gives me hope. I also need to test this with the semi-implicit scheme.

For now, fixing this one line in your build script should work regardless of your branch. For the long term we need to fix this script to get the GPUFLAGS out of the nvhpc.cmake file in ccs_config, and the -DMPAS_OPENACC into buildlib in some general way.

@areanddee
Copy link
Contributor

areanddee commented Feb 23, 2025 via email

@gdicker1
Copy link
Contributor Author

That is good news!

@dazlich, just FYI, there are two ways we pass GPUFLAGS currently in EarthWorks:

  1. The "CESM way" via Depends.nvhpc. Add your source files to an object and then add a compile line like PUMAS_OBJS in ccs_configs/machines/Depends.nvhpc.
  2. The "MPAS-A way" in its Makefile. See CAM/src/dynamics/mpas/Makefile where I append -DMPAS_OPENACC to the CPPFLAGS at the top and always add the GPUFLAGS to the build commands at the bottom.

Anyway that works seems acceptable to me (buildlib, Depends.nvhpc, or Makefile)! The Depends.nvhpc way may be preferred by CESM developers.

@dazlich
Copy link
Contributor

dazlich commented Feb 24, 2025 via email

@dazlich
Copy link
Contributor

dazlich commented Feb 24, 2025 via email

@dazlich
Copy link
Contributor

dazlich commented Feb 24, 2025

@gdicker1 @areanddee - I've had a look at the output after one simulated month (120km). Four runs, split-explicit/cpu(A), split-implicit/cpu(B), split-explicit/gpu(C) and split-implicit/gpu(D). To the eye maps of things like the surface temperature and salinity fields look similar. But the difference A and C is significantly larger than the difference between A and B. This suggests there are some small things that the gpu is not quite doing right. Time-series plot of various domain-integrated quantities like kinetic energy also show this.

So we appear to have some subtle debugging work to do, but at least we can run the thing to do so.

@dazlich
Copy link
Contributor

dazlich commented Feb 24, 2025

Some info regarding the split-implicit scheme - in src/mode_forward/mpas_ocn_time_integration_si.F, the directive at line 515 couldn't satisfy present(sshSubcycleCur,sshSubcycleNew) so I added copyin for them in the directive at line 487. This appears to be the right thing since the two gpu runs agree as closely as the two cpu runs.

@dazlich
Copy link
Contributor

dazlich commented Feb 24, 2025

One more piece of info - the intel and nvhpc compilers give indistinguishable solutions. The gpu is definitely going somewhere else even on the first day.

Image

@gdicker1
Copy link
Contributor Author

But the difference A and C is significantly larger than the difference between A and B. This suggests there are some small things that the gpu is not quite doing right.

This is somewhat to be expected since GPUs implement some math routines differently - but also undesirable in climate runs. I'd defer to @sjsprecious about checking what differences are acceptable and the tools to use to help with this. (Jian, feel free to chime in if you have time.)

From the recent MPAS-A work I know that exponentiation of floats (e.g. something like 6.02^2.01, some of our exner calcs have this) and transcendental functions (e.g. sin) give slightly different results on GPUs. You can add -Kieee (if not already added) and -gpu=math_uniform to your compilation to help confirm this is the case. At least it has helped as we iteratively port function-by-function and can narrow the answer changes down to particular loops. (Unfortunately, there seems to be close to no documentation on math_uniform...).

@dazlich
Copy link
Contributor

dazlich commented Feb 24, 2025 via email

@dazlich
Copy link
Contributor

dazlich commented Feb 25, 2025 via email

@dazlich
Copy link
Contributor

dazlich commented Feb 25, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EW specific This has to do with EarthWorks only - files, goals, code that probably won't be wanted upstream external Has to do with externals invalid This doesn't seem right OpenACC Involves OpenACC porting
Projects
None yet
Development

No branches or pull requests

3 participants