Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make G2G as default during the compilation... #874

Merged
merged 2 commits into from
Dec 10, 2024
Merged

Conversation

alazzaro
Copy link
Member

@alazzaro alazzaro commented Dec 9, 2024

.. and control it at runtime via DBCSR_USE_ACC_G2G env variable (OFF by default).

@alazzaro
Copy link
Member Author

alazzaro commented Dec 9, 2024

FYI @gsitaram

@hfp
Copy link
Member

hfp commented Dec 10, 2024

Quick Q, is G2G only hinging on GPU-aware MPI?

@alazzaro
Copy link
Member Author

Quick Q, is G2G only hinging on GPU-aware MPI?

Yes, exactly.
The feature is now promoted to be "a runtime flag", but it is still experimental (there are few things to consider). It will be officially released in 2025...

@hfp
Copy link
Member

hfp commented Dec 10, 2024

Quick Q, is G2G only hinging on GPU-aware MPI?

Yes, exactly. The feature is now promoted to be "a runtime flag", but it is still experimental (there are few things to consider). It will be officially released in 2025...

Thanks!

I see __DBCSR_ACC_G2G also requires to calculate norms on GPU (to keep data in place). In general, I wonder if G2G would work in any case or if there are missing transfers. Without norms on GPU, I can think some transfers are missing, but is there anything else?

I consider implementing norms on GPU for OpenCL too. I think all contemporary MPIs have GPU support if say pointers are registered, etc.

@alazzaro
Copy link
Member Author

I see __DBCSR_ACC_G2G also requires to calculate norms on GPU (to keep data in place). In general, I wonder if G2G would work in any case or if there are missing transfers. Without norms on GPU, I can think some transfers are missing, but is there anything else?

The short answer is: no, it would not work in any case, that's why it is still experimental. For this reason, I've added 31dc51a. Indeed, there can be cases where the kernels are too big (for instance, 50x50), so no kernel will be jitted and the library will fall-back to the CPU (without host data!) and fail...

The norms are only one part of the story (and only relevant when we do apply filtering). Another thing @gsitaram introduced: with G2G we move the B-transposed data, so we dont' need to run B-transpose for each step of the multiplication.

Overall, the speed-up on LUMI was quite significant for the H2O-DFT-LS...

@alazzaro alazzaro merged commit a17f5d1 into develop Dec 10, 2024
22 checks passed
@alazzaro alazzaro deleted the compile_g2g branch December 10, 2024 09:44
@hfp
Copy link
Member

hfp commented Dec 10, 2024

Overall, the speed-up on LUMI was quite significant for the H2O-DFT-LS...

Good to know! I will try to get my way through it at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants