Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran #153

bellenlau · 2024-11-13T15:51:24Z

Line 69 in X_irredux_residuals.F fills Xo_res with zeros

on the GPU for CUDAFortran version (as being a device variable)
on the CPU for OpenACC version (as being a host variable)

The time taken for this operation on the CPU and on the GPU differs significantly and becomes evident when distributing the simulation in the OpenACC version, as being done on the CPU and not on the GPU.

In the following pictures, a small test running on 4 gpus is traced with nsys; nvtx range labelled "issue3" wraps line 68-69.

OpenACC small test

CUDAFortran small test

The time becomes detrimental when running larger systems maximally distributed (e.g. GrCo-7k on 16 nodes in the following picture). This simulation takes 3 minutes for CUDAFortran version, but overcomes the walltime for OpenACC version.

If zeroing Xo_res is actually needed, a possible fix is kernels construct: bellenlau@8efaf61, but I am not sure if a counterpart in OpenMP offload exists.

With kernels, small test on 4 gpus, OpenACC version:

The data: issue.tar.gz
software stack: nvhpc/23.1, no present clauses, openmpi/4.1.4 on Leonardo

The text was updated successfully, but these errors were encountered:

sangallidavide · 2024-11-13T16:08:05Z

Thanks Laura. Very accurate.

In the tech-gpu branch
https://github.com/yambo-code/yambo/blob/tech-gpu/src/pol_function/X_irredux_residuals.F#L69
I see this

 Xo_res    = cZERO
 call devxlib_memset_d(Xo_res,cZERO)

In cudaf, the "on GPU" zeroing is done via devxlib in the second line, while the "on CPU" zeroing is done via fortran n te first line.

I understand from your fix

 !DEV_ACC kernels
 Xo_res    = cZERO
 !DEV_ACC end kernels
 call devxlib_memset_d(Xo_res,cZERO)

that that Xo_res = cZERO is not an "on CPU" operation in the openacc case.

Indeed, as Laura is asking, @andrea-ferretti , is Xo_res = cZERO needed at all?
From the logic I see, it should be a device only variable

Alternatively, could this

 call devxlib_memset_h(Xo_res,cZERO)
 call devxlib_memset_d(Xo_res,cZERO)

be an alternative which works also in OpenMP?

We need to discuss the logic of OpenACC and devxlib in yambo ...

bellenlau · 2024-11-13T16:33:04Z

Hello Davide,

from what I see from the profiler the zeroing of Xo_res in CUDAFortran is done not once but twice on the GPU; firstly because Xo_res = 0 is actually a GPU operation in CUDAFortran as Xo_res is a device variable (the offload is automatically managed by CUDAFortran) and once due to devixlib. In OpenACC this automatic offload does not exist because Xo_res is a host variable, and thus the first Xo_res does the operation on the CPU and devixlib does the operation on the GPU.

If the zeroing on cpu is needed, than using both devxlib for the host and then for the device should fix, so that all versions do the same. However I expect this to slow down the simulation as in the OpenACC trace when the system is well distributed among MPI tasks... I hope the zeroing on CPU is actually not needed :)

sangallidavide · 2024-11-13T16:41:45Z

Ok. Clear.

Yeah, if zeroing on CPU is not done in CUDAF, I'd say it is not needed in OpenACC neither.

andrea-ferretti · 2024-11-13T17:31:46Z

Dear all,
thanks for the comments. Indeed, I think Xo_res is only needed on the device, we should just drop the first line in
[ Xo_res = cZERO
call devxlib_memset_d(Xo_res,cZERO)]

I am also wondering whether Xo_res in X_irredux.F (in CPU parts) should not be replaced by Xo_res_p (which would point to CPU workspace in that case)... we may have issues with pointer slicing, though (this may be the reason why the pointer is not used throughtout).

bellenlau changed the title ~~X0_res initialization on CPU for OpenACC version and on GPU for CUDAFortran~~ Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran Nov 13, 2024

bellenlau mentioned this issue Nov 13, 2024

fix[openacc,data]: zeroing Xo_res on GPU #154

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran #153

Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran #153

bellenlau commented Nov 13, 2024 •

edited

Loading

sangallidavide commented Nov 13, 2024

bellenlau commented Nov 13, 2024

sangallidavide commented Nov 13, 2024

andrea-ferretti commented Nov 13, 2024 •

edited

Loading

Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran #153

Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran #153

Comments

bellenlau commented Nov 13, 2024 • edited Loading

sangallidavide commented Nov 13, 2024

bellenlau commented Nov 13, 2024

sangallidavide commented Nov 13, 2024

andrea-ferretti commented Nov 13, 2024 • edited Loading

bellenlau commented Nov 13, 2024 •

edited

Loading

andrea-ferretti commented Nov 13, 2024 •

edited

Loading