Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuSolver fails with nvfortran >= 23.11 #76

Open
sangallidavide opened this issue Apr 17, 2024 · 6 comments
Open

cuSolver fails with nvfortran >= 23.11 #76

sangallidavide opened this issue Apr 17, 2024 · 6 comments
Assignees

Comments

@sangallidavide
Copy link
Member

The bug happens when running with GPU support (CUDAF)

Detected on my desktop (nvfortran 24.3, cuda 12.3) and on Leonardo (nvoftran 23.11, cuda 11.8 and 12.3)

Error message

[ERROR] STOP signal received while in[04] Optics
[ERROR] LINEAR ALGEBRA driver [SERIAL_lin_system_gpu]cusolverDnCgetrs failed

Error code is CUSOLVER_STATUS_EXECUTION_FAILED
https://docs.nvidia.com/cuda/cusolver/index.html

(Sometimes it fails also before, at cuSoverDnCreate)

sangallidavide added a commit that referenced this issue Apr 19, 2024
MODIFIED *  configure include/version/version.m4 modules/mod_X.F pol_function/X_redux.F

Bugs:
- [yambo] Fix for issue #76

Patch sent by:  Davide Sangalli <[email protected]>
sangallidavide added a commit that referenced this issue Apr 19, 2024
MODIFIED *  include/version/version.m4 modules/mod_X.F pol_function/X_redux.F

Bugs:
- [yambo] Fix for issue #76
  imported in branch 5.2

Patch sent by:  Davide Sangalli <[email protected]>
sangallidavide added a commit that referenced this issue Apr 19, 2024
MODIFIED *  configure include/version/version.m4 modules/mod_X.F pol_function/X_redux.F

Bugs:
- [yambo] Fix for issue #76

Patch sent by:  Davide Sangalli <[email protected]>
@sangallidavide
Copy link
Member Author

Bug fixed by moving the contained subroutine in X_redux.F to an independent subroutine

@andreamarini
Copy link
Member

I am having the same problem. In which branch you splitted the X_redux?

@andreamarini andreamarini reopened this Apr 22, 2024
@sangallidavide
Copy link
Member Author

The original branch is https://github.com/yambo-code/yambo-devel/tree/tech/devel-gpu
However such branch is quite ahead of the develop. Probably the best is to see the gpl master

This is the commit:
7197a33

@andreamarini
Copy link
Member

I realized the all past runs on eliud and mo with cuda failed not because of a buggy compilation but exactly because of a crash of cuSolver.

https://media.yambo-code.eu/robots/develop/eliud.kipchoge.2_develop_1_error.php

If these fails are connected to this bug that it should introduced ASAP in the bug-fixes.

@sangallidavide
Copy link
Member Author

The cusolver error does not affect tests like Al111/04_HF
So the situation on eliud is different.

@sangallidavide
Copy link
Member Author

Here the fails were likely due to the cuSolver:
https://media.yambo-code.eu/robots/develop/mo.farah.4_develop_1_error.php

As you can see, for Al111, 02_eels fails, while 04_HF is ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants