-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran #153
Comments
Thanks Laura. Very accurate. In the tech-gpu branch
In cudaf, the "on GPU" zeroing is done via devxlib in the second line, while the "on CPU" zeroing is done via fortran n te first line. I understand from your fix
that that Indeed, as Laura is asking, @andrea-ferretti , is Alternatively, could this
be an alternative which works also in OpenMP? We need to discuss the logic of OpenACC and devxlib in yambo ... |
Hello Davide, from what I see from the profiler the zeroing of Xo_res in CUDAFortran is done not once but twice on the GPU; firstly because Xo_res = 0 is actually a GPU operation in CUDAFortran as Xo_res is a device variable (the offload is automatically managed by CUDAFortran) and once due to devixlib. In OpenACC this automatic offload does not exist because Xo_res is a host variable, and thus the first Xo_res does the operation on the CPU and devixlib does the operation on the GPU. If the zeroing on cpu is needed, than using both devxlib for the host and then for the device should fix, so that all versions do the same. However I expect this to slow down the simulation as in the OpenACC trace when the system is well distributed among MPI tasks... I hope the zeroing on CPU is actually not needed :) |
Ok. Clear. Yeah, if zeroing on CPU is not done in CUDAF, I'd say it is not needed in OpenACC neither. |
Dear all, I am also wondering whether Xo_res in X_irredux.F (in CPU parts) should not be replaced by Xo_res_p (which would point to CPU workspace in that case)... we may have issues with pointer slicing, though (this may be the reason why the pointer is not used throughtout). |
Line 69 in X_irredux_residuals.F fills Xo_res with zeros
The time taken for this operation on the CPU and on the GPU differs significantly and becomes evident when distributing the simulation in the OpenACC version, as being done on the CPU and not on the GPU.
In the following pictures, a small test running on 4 gpus is traced with nsys; nvtx range labelled "issue3" wraps line 68-69.
OpenACC small test
CUDAFortran small test
The time becomes detrimental when running larger systems maximally distributed (e.g. GrCo-7k on 16 nodes in the following picture). This simulation takes 3 minutes for CUDAFortran version, but overcomes the walltime for OpenACC version.
If zeroing Xo_res is actually needed, a possible fix is kernels construct: bellenlau@8efaf61, but I am not sure if a counterpart in OpenMP offload exists.
With kernels, small test on 4 gpus, OpenACC version:
The data: issue.tar.gz
software stack: nvhpc/23.1, no present clauses, openmpi/4.1.4 on Leonardo
The text was updated successfully, but these errors were encountered: