-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After #75, Compsets using MPAS-O on GPUs fail during run-time #84
Comments
@gdicker1 Rich brought this issue to my attention. I'd like to test this out - do you have a script I can use to try this? |
@dazlich, sure! CASEDIR="you name it"
./cime/scripts/create_newcase -- case "${CASEDIR}" --res mpasa120_oQU120 --compset CHAOS2000dev --project UCSU0085 --input-dir /glade/campaign/univ/ucsu0085/inputdata --output-root "${CASEDIR}/.." --driver nuopc --compiler nvhpc --ngpus-per-node 4 --gpu-type a100 --gpu-offload openacc
cd "${CASEDIR}"
./case.setup
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --sharedlib-only
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --model-only
./case.submit Very little changes. Just note:
|
I've dug a little further. I've run gpu and cpu (nvhpc) cases for ewm-2.3.006 and ewm-2.3.007 for both the split-explicit and split-implicit time integration schemes.
The code never ran satisfactorily on gpu, but the ewm-2.3.007 tag fails immediately. I will now see if I can track where the solutions diverge from the cpu solutions. |
@gdicker1
I am back to tackling this. I am trying to use PCAST and want to add a -gpu-autocompare flag for compilation. Where would I do this?
On Jan 13, 2025, at 2:48 PM, G. Dylan Dickerson ***@***.***> wrote:
** Caution: EXTERNAL Sender **
@dazlich<https://github.com/dazlich>, sure!
CASEDIR="you name it"
./cime/scripts/create_newcase -- case "${CASEDIR}" --res mpasa120_oQU120 --compset CHAOS2000dev --project UCSU0085 --input-dir /glade/campaign/univ/ucsu0085/inputdata --output-root "${CASEDIR}/.." --driver nuopc --compiler nvhpc --ngpus-per-node 4 --gpu-type a100 --gpu-offload openacc
cd "${CASEDIR}"
./case.setup
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --sharedlib-only
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --model-only
./case.submit
Very little changes. Just note:
* The last 4 arguments to create_newcase are the most important. Must use nvhpc compilers and have the correct values for the 3 GPU-related args
* The test infrastructure builds shared libraries and the model in separate steps. I also think it speeds up the NVHPC build. It still takes a long time with NVHPC (even longer for GPU builds than CPU-only), but at least this seems faster to me.
* CHAOS2000dev isn't required, I got the "NaN detected in LayerThickness" error with a "MPASOOnly" compset within this Derecho testdir: "/glade/derecho/scratch/gdicker/ewv24_2024Nov18170000/ew-v24test-gpu/ERS_Ln9_P64_G4-a100-openacc_Vnuopc.T62_oQU120.MPASOOnly.derecho_nvhpc.ew-outfrq9s.G.20241118_170035_wv9uyx"
* MPASOOnly = 2000_DATM%NYF_SLND_DICE%SSMI_MPASO_SROF_SGLC_SWAV_SESP
—
Reply to this email directly, view it on GitHub<#84 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADS3XWCIXZCXYW425UACELL2KQYEFAVCNFSM6AAAAABTAWTOXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBYGI4TANRSGA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
For just one case, look at the end of If you want it for any cases you make, then look at |
@areanddee @gdicker1 The gpu ocean is failing because the GPUFLAGS were not making it into the build command. I have hacked the build line for mpas.ocean in cime_config/buildlib to force them to be there. Now the ocean runs 1 month. diff --git a/cime_config/buildlib b/cime_config/buildlib
I ran the update/ocean3p75 branch of EarthWorks. I ran the split-explicit time-integration scheme. I still need to examine the output, but the fact it ran one simulated month gives me hope. I also need to test this with the semi-implicit scheme. For now, fixing this one line in your build script should work regardless of your branch. For the long term we need to fix this script to get the GPUFLAGS out of the nvhpc.cmake file in ccs_config, and the -DMPAS_OPENACC into buildlib in some general way. |
Great news!
Rich
…On Sat, Feb 22, 2025 at 5:57 PM dazlich ***@***.***> wrote:
@areanddee <https://github.com/areanddee> @gdicker1
<https://github.com/gdicker1> The gpu ocean is failing because the
GPUFLAGS were not making it into the build command. I have hacked the build
line for mpas.ocean in cime_config/buildlib to force them to be there. Now
the ocean runs 1 month.
diff --git a/cime_config/buildlib b/cime_config/buildlib
index 84d83fb..90e4c27 100755
--- a/cime_config/buildlib
+++ b/cime_config/buildlib
@@ -284,7 +284,7 @@ def _build_mpaso():
# build the library
makefile = os.path.join(casetools, "Makefile")
complib = os.path.join(libroot, "libocn.a")
-
cmd = "{} complib -j {} MODEL=mpaso COMPLIB={} -f {} USER_CPPDEFS=\"-DUSE_PIO2 -DMPAS_PIO_SUPPORT -D_MPI -DEXCLUDE_INIT_MODE -DMPAS_NO_ESMF_INIT -DMPAS_EXTERNAL_ESMF_LIB -DMPAS_PERF_MOD_TIMERS -DUSE_LAPACK -DMPAS_NAMELIST_SUFFIX=ocean\" FIXEDFLAGS={} {}" \
-
cmd = "{} complib -j {} MODEL=mpaso COMPLIB={} -f {} USER_CPPDEFS=\"-DMPAS_OPENACC -DUSE_PIO2 -DMPAS_PIO_SUPPORT -D_MPI -DEXCLUDE_INIT_MODE -DMPAS_NO_ESMF_INIT -DMPAS_EXTERNAL_ESMF_LIB -DMPAS_PERF_MOD_TIMERS -DUSE_LAPACK -DMPAS_NAMELIST_SUFFIX=ocean -acc -gpu=cc80,lineinfo,nofma -Minfo=accel \" FIXEDFLAGS={} {}" \
.format(gmake, gmake_j, complib, makefile, fixedflags, get_standard_makefile_args(case))
rc, out, err = run_cmd(cmd, from_dir=os.path.join(objroot, "ocn", "obj"))
I ran the update/ocean3p75 branch of EarthWorks. I ran the split-explicit
time-integration scheme. I still need to examine the output, but the fact
it ran one simulated month gives me hope. I also need to test this with the
semi-implicit scheme.
For now, fixing this one line in your build script should work regardless
of your branch. For the long term we need to fix this script to get the
GPUFLAGS out of the nvhpc.cmake file in ccs_config, and the -DMPAS_OPENACC
into buildlib in some general way.
—
Reply to this email directly, view it on GitHub
<#84 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZ2GWBNEU3VRFHFHMB4VAPD2REMHNAVCNFSM6AAAAABTAWTOXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZWGQ4DAMBWGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: dazlich]*dazlich* left a comment (EarthWorksOrg/EarthWorks#84)
<#84 (comment)>
@areanddee <https://github.com/areanddee> @gdicker1
<https://github.com/gdicker1> The gpu ocean is failing because the
GPUFLAGS were not making it into the build command. I have hacked the build
line for mpas.ocean in cime_config/buildlib to force them to be there. Now
the ocean runs 1 month.
diff --git a/cime_config/buildlib b/cime_config/buildlib
index 84d83fb..90e4c27 100755
--- a/cime_config/buildlib
+++ b/cime_config/buildlib
@@ -284,7 +284,7 @@ def _build_mpaso():
# build the library
makefile = os.path.join(casetools, "Makefile")
complib = os.path.join(libroot, "libocn.a")
-
cmd = "{} complib -j {} MODEL=mpaso COMPLIB={} -f {} USER_CPPDEFS=\"-DUSE_PIO2 -DMPAS_PIO_SUPPORT -D_MPI -DEXCLUDE_INIT_MODE -DMPAS_NO_ESMF_INIT -DMPAS_EXTERNAL_ESMF_LIB -DMPAS_PERF_MOD_TIMERS -DUSE_LAPACK -DMPAS_NAMELIST_SUFFIX=ocean\" FIXEDFLAGS={} {}" \
-
cmd = "{} complib -j {} MODEL=mpaso COMPLIB={} -f {} USER_CPPDEFS=\"-DMPAS_OPENACC -DUSE_PIO2 -DMPAS_PIO_SUPPORT -D_MPI -DEXCLUDE_INIT_MODE -DMPAS_NO_ESMF_INIT -DMPAS_EXTERNAL_ESMF_LIB -DMPAS_PERF_MOD_TIMERS -DUSE_LAPACK -DMPAS_NAMELIST_SUFFIX=ocean -acc -gpu=cc80,lineinfo,nofma -Minfo=accel \" FIXEDFLAGS={} {}" \
.format(gmake, gmake_j, complib, makefile, fixedflags, get_standard_makefile_args(case))
rc, out, err = run_cmd(cmd, from_dir=os.path.join(objroot, "ocn", "obj"))
I ran the update/ocean3p75 branch of EarthWorks. I ran the split-explicit
time-integration scheme. I still need to examine the output, but the fact
it ran one simulated month gives me hope. I also need to test this with the
semi-implicit scheme.
For now, fixing this one line in your build script should work regardless
of your branch. For the long term we need to fix this script to get the
GPUFLAGS out of the nvhpc.cmake file in ccs_config, and the -DMPAS_OPENACC
into buildlib in some general way.
—
Reply to this email directly, view it on GitHub
<#84 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZ2GWBNEU3VRFHFHMB4VAPD2REMHNAVCNFSM6AAAAABTAWTOXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZWGQ4DAMBWGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That is good news! @dazlich, just FYI, there are two ways we pass GPUFLAGS currently in EarthWorks:
Anyway that works seems acceptable to me (buildlib, Depends.nvhpc, or Makefile)! The Depends.nvhpc way may be preferred by CESM developers. |
Thanks, @gdicker1 - I will play with those later today.
On Feb 24, 2025, at 9:52 AM, G. Dylan Dickerson ***@***.***> wrote:
** Caution: EXTERNAL Sender **
That is good news!
@dazlich<https://github.com/dazlich>, just FYI, there are two ways we pass GPUFLAGS currently in EarthWorks:
1. The "CESM way" via Depends.nvhpc. Add your source files to an object and then add a compile line like PUMAS_OBJS in ccs_configs/machines/Depends.nvhpc<https://github.com/EarthWorksOrg/ccs_config_cesm/blob/ew-main/machines/Depends.nvhpc>.
2. The "MPAS-A way" in its Makefile. See CAM/src/dynamics/mpas/Makefile<https://github.com/EarthWorksOrg/CAM/blob/ew-main/src/dynamics/mpas/Makefile> where I append -DMPAS_OPENACC to the CPPFLAGS at the top and always add the GPUFLAGS to the build commands at the bottom.
Anyway that works seems acceptable to me (buildlib, Depends.nvhpc, or Makefile)! The Depends.nvhpc way may be preferred by CESM developers.
—
Reply to this email directly, view it on GitHub<#84 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADS3XWFVKQCQRUCZQX4ZBNT2RNE6BAVCNFSM6AAAAABTAWTOXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZZGA3TGMBZHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
[gdicker1]gdicker1 left a comment (EarthWorksOrg/EarthWorks#84)<#84 (comment)>
That is good news!
@dazlich<https://github.com/dazlich>, just FYI, there are two ways we pass GPUFLAGS currently in EarthWorks:
1. The "CESM way" via Depends.nvhpc. Add your source files to an object and then add a compile line like PUMAS_OBJS in ccs_configs/machines/Depends.nvhpc<https://github.com/EarthWorksOrg/ccs_config_cesm/blob/ew-main/machines/Depends.nvhpc>.
2. The "MPAS-A way" in its Makefile. See CAM/src/dynamics/mpas/Makefile<https://github.com/EarthWorksOrg/CAM/blob/ew-main/src/dynamics/mpas/Makefile> where I append -DMPAS_OPENACC to the CPPFLAGS at the top and always add the GPUFLAGS to the build commands at the bottom.
Anyway that works seems acceptable to me (buildlib, Depends.nvhpc, or Makefile)! The Depends.nvhpc way may be preferred by CESM developers.
—
Reply to this email directly, view it on GitHub<#84 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADS3XWFVKQCQRUCZQX4ZBNT2RNE6BAVCNFSM6AAAAABTAWTOXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZZGA3TGMBZHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Actually what seems to work simply is to add this block:
ifeq ($(COMP_NAME),mpaso)
ifeq ($(strip $(COMPILER)),nvhpc)
# mpas ocean files need gpuflags
CPPDEFS += -DMPAS_OPENACC
FFLAGS +=$(GPUFLAGS)
endif
endif
to cime/CIME/Tools/Makefile. I put it after the mpassi block.
|
@gdicker1 @areanddee - I've had a look at the output after one simulated month (120km). Four runs, split-explicit/cpu(A), split-implicit/cpu(B), split-explicit/gpu(C) and split-implicit/gpu(D). To the eye maps of things like the surface temperature and salinity fields look similar. But the difference A and C is significantly larger than the difference between A and B. This suggests there are some small things that the gpu is not quite doing right. Time-series plot of various domain-integrated quantities like kinetic energy also show this. So we appear to have some subtle debugging work to do, but at least we can run the thing to do so. |
Some info regarding the split-implicit scheme - in src/mode_forward/mpas_ocn_time_integration_si.F, the directive at line 515 couldn't satisfy present(sshSubcycleCur,sshSubcycleNew) so I added copyin for them in the directive at line 487. This appears to be the right thing since the two gpu runs agree as closely as the two cpu runs. |
This is somewhat to be expected since GPUs implement some math routines differently - but also undesirable in climate runs. I'd defer to @sjsprecious about checking what differences are acceptable and the tools to use to help with this. (Jian, feel free to chime in if you have time.) From the recent MPAS-A work I know that exponentiation of floats (e.g. something like 6.02^2.01, some of our |
Yes, so I just did an intel run. I assume intel vs nvhpc on cpu would have differences akin to gpu vs cpu. The two cpu solutions are indistinguishable after a month.
|
Ok, I’ve added -gpu=autocompare (PCAST). Something is different right off the bat.
deg0014.hsn.de.hpc.ucar.edu 85: PCAST Double ssh(:) in function ocn_ale_thickness, /glade/derecho/scratch/dazlich/EarthWorks.pcast.pcast/bld/ocn/source/mpas_ocn_thick_ale.F:152
deg0014.hsn.de.hpc.ucar.edu 85: idx: 0 FAIL ABS act: 0.00000000000000000e+00 exp: 2.96446599741934058e+00 dif: 2.96446599741934058e+00
deg0014.hsn.de.hpc.ucar.edu 85: idx: 1 FAIL ABS act: 0.00000000000000000e+00 exp: 2.92303812547556596e+00 dif: 2.92303812547556596e+00
This difference isn’t the math. Something failed to get initialized on the gpu (column 1).
… On Feb 24, 2025, at 2:21 PM, Dazlich,Donald ***@***.***> wrote:
Yes, so I just did an intel run. I assume intel vs nvhpc on cpu would have differences akin to gpu vs cpu. The two cpu solutions are indistinguishable after a month.
|
Ignore the previous message
|
Though #75 contains an initial GPU port of MPAS-Ocean, tests involving this compset on GPUs now fail at run-time. Some fields get NaNs and MPAS-O aborts.
Testing with GPU builds using the ewm-2.3.006 tag (before GPU MPAS-O) succeed.
Example steps to re-create this problem
Example output
Excerpt from a mpas_ocean_block_stats_0 file 1 :
Footnotes
Full file path on Derecho: "/glade/derecho/scratch/gdicker/ewv24_2024Nov18170000/ew-v24test-gpu/ERS_Ln9_P64_G4-a100-openacc_Vnuopc.T62_oQU120.MPASOOnly.derecho_nvhpc.ew-outfrq9s.G.20241118_170035_wv9uyx/run/mpas_ocean_block_stats_0" ↩
The text was updated successfully, but these errors were encountered: