HIP and MPI+HIP updates #361

ohearnk · 2024-04-15T03:06:23Z

This MR ports the updated CUDA and MPI+CUDA codes (including recently merged f-function optimizations) to HIP and MPI+HIP versions, respectively. Also, this MR begins to unify the CUDA and HIP versions to simpily future GPU code maintainence.

Additional work also included in this PR concerns the following items:

Integer-based atomic operation code paths (used in place of double precision atomics -mainly for older GPUs lacking hardware support [pre-pascal on NVIDIA GPUs])
Wrappers around CUDA/HIP APIs (better error checking)
HIP diagonalization (rocsolver re-enabled, CPU code disabled)
Several bug fixed (memory leaks, initialization errors)
ROCm version detection and disabling HIP builds against known buggy ROCm versions (>= v5.4.3)

Limitations / Known Issues:

f function code (CMake flag -DENABLEF=TRUE) does not compile on HIP versions due to resource limitations (stack frame size exceeded in device functions). Errors are similar to the following (from a build on the AMD Accelerator Cloud platform):

error: stack frame size (143248) exceeds limit (131056) in function '_Z19getGrad_kernel_ffffPiPS_PdPS1_PP15HIP_vector_typeIiLj2EEPPhS7_S2_ii'
26 warnings and 1 error generated when compiling for gfx90a.
CMake Error at quick_hip_kernels_generated_gpu_get2e_grad_ffff.cu.o.Release.cmake:287 (message):
  Error generating file
  QUICK/build_ubuntu22_hip_gfx90a_rocm5.4.2_enablef_commit_0456378d/src/gpu/hip/CMakeFiles/quick_hip_kernels.dir//./quick_hip_kernels_generated_gpu_get2e_grad_ffff.cu.o

gmake[2]: *** [src/gpu/hip/CMakeFiles/quick_hip_kernels.dir/build.make:2080: src/gpu/hip/CMakeFiles/quick_hip_kernels.dir/quick_hip_kernels_generated_gpu_get2e_grad_ffff.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:570: src/gpu/hip/CMakeFiles/quick_hip_kernels.dir/all] Error 2
gmake: *** [Makefile:156: all] Error 2

Closes #344.

…MPI+HIP codes. Unify CUDA and HIP code paths (CUDA / HIP => GPU, CUDA_MPIV / HIP_MPIV => MPIV_GPU, etc.).

ohearnk · 2024-04-15T03:23:18Z

As noted above, all tests (full test suite) are passing for the CUDA and MPI+CUDA (1 GPU) versions. However, some tests are failing for the HIP and MPI+HIP versions. See the logs below from tests on the MI210s on the AMD AAC. Interestingly, the test failures are slightly different between the HIP and MPI+HIP versions.

This is a bit difficult to debug at least when comparing against the working HIP / MPI+HIP versions from the 23.08b release as there are also a number of test failures there. It may be better to pick a commit before the f-function optimizations and run tests there for comparison.

Test configuration on the AMD AAC:

RHEL9 partition (1CN128C8G2H_2IB_MI210_RHEL9)
ROCm v5.7.1, UCX v1.15.0, OpenMPI v4.1.6, GCC v11.3.1 (gfortran)
f-function support disabled
CMake configuration (HIP version):

cmake .. -DCOMPILER=MANUAL -DCMAKE_C_COMPILER=hipcc -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_Fortran_COMPILER=gfortran -DMPI= -DHIP=TRUE -DQUICK_USER_ARCH=gfx90a -DENABLEF= -DCMAKE_INSTALL_PREFIX=${PWD}/../install_rhel9_hip_gfx90a_rocm5.7.1_ucx1.15.0_ompi4.1.6 -DHIP_TOOLKIT_ROOT_DIR=/shared/apps/rhel9/opt/rocm-5.7.1

HIP test summary and diffs:
runtest_hip.log
hip_test_diffs.log

MPI+HIP test summary and diffs:
runtest_mpi_hip_1gpu.log
mpi_hip_test_diffs.log

…preprocessor definitions for performance and storage considerations. Refactor preprocessor defintions to avoid unnecessary arithmetic.

…aths for older HIP builds.

…regarding STORE_OPERATOR). Fix segfault in debug builds of GPU code without ERI f function supported enabled but basis contains f functions. Remove unneeded DGEMM operation in CUDA codes in SCF/USCF methods. Other code clean-up.

…ggled on in CMake build.

…power functions (inlined device functions calling pow to preprocessor definitions using multiplication operations). Other code clean-up.

…s. Add CMake option to enable LLVM-based address sanitizer (ASAN) for debugging with HIP builds.

… and replace with emulation at full double precision for pre-Pascal NVIDIA GPUs (previously toggled via USE_LEGACY_ATOMICS). Note that the old code was leading to slow and possibly failing SCF convergence which was only exposed during testing with tighter density matrix convergence thresholds and integral cut-offs. This is likely due to the truncation used for energy and gradient calculations (1e-6 and 1e-12, respectively).

…ces.

…t-offs (abs -> fabs). Tune exchange correlation code.

…ion (< v5.3.0) due to poor performance and use CPU diagonalization routines instead.

…lds on AMD MI300 series GPUs.

… function arguments to save stack space.

…ean up).

…Remove GPU functions from exported function interfaces (not required and may negatively impact compilation). Other code clean-up.

…code reorganization which led to incorrect preprocessor constants being used by several device functions [incorrect memory layout used in accumulating partial results]).

…I routines. Further device code reorganization to localize scopes for eventually removal of relocatable device code flags. Other code clean-up.

agoetz · 2025-02-18T19:45:34Z

I have tested git commit id ab27c99 on following platforms. I tested without f functions on delorean and chinotto and with f functions on Expanse A100 GPU nodes. All tests of the full test suite pass (serial, mpi, cuda, cudampi).

delorean

AMD Ryzen 9 3900X 12-Core
RTX 2080 Ti
Almalinux 9.5
cmake 3.26.5
GNU 11.5.0 compiler
OpenMPI 4.0.5
CUDA 12.5

chinotto

2x 8-core Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
2x RTX 2080 Ti
1x Titan V
CentOS 7.9.2009
cmake 3.28.1
GNU 10.2.1 compiler
OpenMPI 4.0.5
CUDA 12.0

Expanse

2x 20-core Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
4x A100 PCIe-40GB (2x used for testing)
Rocky Linux 8.9
cmake 3.28.1
GNU 10.2.0 compiler
OpenMPI 4.1.3
CUDA 11.2.2

agoetz · 2025-02-18T19:53:33Z

Timers are not correct. It looks like 2e GRADIENT TIME is now added to EXC GRADIENT TIME.

Below outputs are for test QUICK-tests/benchmarks/input/morphine.in form the QUICK-tests repo executed on Expanse A100 GPU node.

QUICK-23.08a

------------- TIMING ---------------
| INITIAL GUESS TIME  =     1.127289064(  5.95%)
| DFT GRID OPERATIONS =     0.194754000(  1.03%)
| TOTAL LOAD BALANCING TIME =     0.007372800(  0.04%)
| TOTAL SCF TIME      =     6.730798000( 35.53%)
|       TOTAL OP TIME      =     6.292387000( 33.22%)
|             TOTAL 1e TIME      =     0.102761000(  0.54%)
|             TOTAL 2e TIME      =     2.693532000( 14.22%)
|             TOTAL EXC TIME     =     3.463376000( 18.28%)
|       TOTAL DII TIME      =     0.427670000(  2.26%)
|             TOTAL DIAG TIME    =     0.269553000(  1.42%)
| TOTAL GRADIENT TIME      =     8.303771000( 43.83%)
|       TOTAL 1e GRADIENT TIME      =     0.903336000( 4.82%)
|       TOTAL 2e GRADIENT TIME      =     5.969338000(31.51%)
|       TOTAL EXC GRADIENT TIME     =     1.421189000(  7.50%)
| TOTAL TIME          =    18.944405000
------------------------------------
| Job cpu time:  0 days  0 hours  0 minutes 18.9 seconds.

This PR 361, git commit id ab27c99

------------- TIMING ---------------
| INITIAL GUESS TIME  =     1.623547528(  8.93%)
| DFT GRID OPERATIONS =     0.137886000(  0.76%)
| TOTAL SCF TIME      =     6.646664000( 36.57%)
|       TOTAL OP TIME      =     6.225443000( 34.26%)
|             TOTAL 1e TIME      =     0.099403000(  0.55%)
|             TOTAL 2e TIME      =     2.631439000( 14.48%)
|             TOTAL EXC TIME     =     3.461705000( 19.05%)
|       TOTAL DII TIME      =     0.410465000(  2.26%)
|             TOTAL DIAG TIME    =     0.258074000(  1.42%)
| TOTAL GRADIENT TIME      =     8.619257000( 47.43%)
|       TOTAL 1e GRADIENT TIME      =     0.894716000( 4.98%)
|       TOTAL 2e GRADIENT TIME      =     0.000379000( 0.00%)     <-- This is wrong
|       TOTAL EXC GRADIENT TIME     =     7.714436000( 42.45%)    <-- This is wrong
| TOTAL TIME          =    18.172917000
------------------------------------

Port updated CUDA and MPI+CUDA codes (f function support) to HIP and …

c388fde

…MPI+HIP codes. Unify CUDA and HIP code paths (CUDA / HIP => GPU, CUDA_MPIV / HIP_MPIV => MPIV_GPU, etc.).

ohearnk requested review from agoetz and Madu86 April 15, 2024 03:06

ohearnk self-assigned this Apr 15, 2024

ohearnk added 11 commits April 17, 2024 21:47

Merge branch 'master' into hip-f-func-porting.

534ea10

Merge branch 'master' into hip-f-func-porting.

43e5ad9

Fix uninitialized variable usage.

a77c08c

Add missed file during CUDA source conversion via hipify-perl (*.cuh).

5e3f244

Fix source file permissions. Remove unused code.

6510281

Merge branch 'master' into hip-f-func-porting.

4815e0b

Deduplicate GPU codes (CUDA/HIP). Change several static constants to …

779130d

…preprocessor definitions for performance and storage considerations. Refactor preprocessor defintions to avoid unnecessary arithmetic.

Fix include path for HIP builds. Match preprocessor controlled code p…

c0a1a64

…aths for older HIP builds.

Conditionally compile ROCsolver code (for SCF diagonalizations) if to…

0bf89e1

…ggled on in CMake build.

Further GPU code deduplication. Use faster math functions for simple …

f937da6

…power functions (inlined device functions calling pow to preprocessor definitions using multiplication operations). Other code clean-up.

ohearnk force-pushed the hip-f-func-porting branch from fbd9602 to f937da6 Compare June 26, 2024 18:33

ohearnk added 13 commits July 1, 2024 11:09

Remove unnecessary DGEMM in SCF for CUDA GPU codepaths.

815b8c9

Ensure QUICK GPU architectures are always set correctly for HIP build…

72782c8

…s. Add CMake option to enable LLVM-based address sanitizer (ASAN) for debugging with HIP builds.

Fix declaration for emulated double precision atomic addition.

2ea2acc

Reduce the number of atomics used during computation of operator mati…

dc9031b

…ces.

Reduce the number of atomics used during computation of operator mati…

55d9ccd

…ces.

Fix memory leaks.

e2b3f9b

Fix deallocation issue.

b15e410

OEI code tuning.

0f0085c

Hand-tune ERI gradient code.

6aa0c5d

More ERI gradient tuning. Other code clean-up.

5a0f95e

Fix truncation of double precision absolute value calculations for cu…

609e55a

…t-offs (abs -> fabs). Tune exchange correlation code.

Remove superfluous arithmetic in generated one electron integral code.

d57901c

ohearnk force-pushed the hip-f-func-porting branch from 52c8e65 to 53c25af Compare December 9, 2024 18:19

Merge branch 'master' into hip-f-func-porting.

508fc52

ohearnk force-pushed the hip-f-func-porting branch from 1a98223 to 508fc52 Compare December 10, 2024 15:23

ohearnk added 3 commits December 11, 2024 12:38

HIP: switch to non-deprecated functions. Use native atomic functions.

f3bfa32

Remove unused code.

870e0df

Further restrict HIP builds to known working ROCm versions.

f98e306

ohearnk marked this pull request as ready for review December 18, 2024 18:16

ohearnk added 6 commits December 21, 2024 18:56

Disable diagonalization on the GPU with rocSOLVER for older ROCm vers…

3724740

…ion (< v5.3.0) due to poor performance and use CPU diagonalization routines instead.

Make error messages GPU-agnostic.

4133692

Update log file citation to match that in README.

c04b2b8

Update README to reflect HIP support being restored.

5e4c79a

Add GFX942 target to CMake flags (QUICK_USER_ARCH) for supporting bui…

0456378

…lds on AMD MI300 series GPUs.

Clean up ERI f-function specific code (ffff integrals). Remove unused…

b43412e

… function arguments to save stack space.

ohearnk force-pushed the hip-f-func-porting branch from d2f68c5 to b43412e Compare January 9, 2025 20:24

ohearnk added 11 commits January 13, 2025 14:43

More ERI f-function cleanup (remove unused variables).

dae1f96

Clean up GPU ERI code drivers (remove unused variables, other code cl…

f683f50

…ean up).

Clean up GPU horizonal recurrence relation code.

01d08a9

Remove Boys function differentation codepaths (not implemented yet). …

bcadc8c

…Remove GPU functions from exported function interfaces (not required and may negatively impact compilation). Other code clean-up.

Merge branch 'master' into hip-f-func-porting.

6a73b0d

Fix gradient calculations (error originated in commit bcadc8c due to …

47cdf68

…code reorganization which led to incorrect preprocessor constants being used by several device functions [incorrect memory layout used in accumulating partial results]).

Split out GPU LRI-specific VRR functions for compilation only with LR…

883fc99

…I routines. Further device code reorganization to localize scopes for eventually removal of relocatable device code flags. Other code clean-up.

Remove more unused variables.

c291c25

Merge branch 'master' into hip-f-func-porting.

6b7f3c0

Fix several GPU LRI code compilation issues (with Amber).

c49405d

One more GPU LRI code compilation fix (LEGACY_ATOMICS codepath).

ab27c99

ohearnk added bug Something isn't working Code cleanup Code cleanup or refactoring labels Feb 19, 2025

ohearnk added this to the QUICK-25.03 milestone Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP and MPI+HIP updates #361

HIP and MPI+HIP updates #361

ohearnk commented Apr 15, 2024 •

edited

Loading

ohearnk commented Apr 15, 2024 •

edited

Loading

agoetz commented Feb 18, 2025

agoetz commented Feb 18, 2025

HIP and MPI+HIP updates #361

Are you sure you want to change the base?

HIP and MPI+HIP updates #361

Conversation

ohearnk commented Apr 15, 2024 • edited Loading

ohearnk commented Apr 15, 2024 • edited Loading

agoetz commented Feb 18, 2025

agoetz commented Feb 18, 2025

ohearnk commented Apr 15, 2024 •

edited

Loading

ohearnk commented Apr 15, 2024 •

edited

Loading