Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory bug in DCU calculation (Stress Memory) #3710

Closed
16 tasks
pxlxingliang opened this issue Mar 14, 2024 · 5 comments · Fixed by #4047
Closed
16 tasks

Out of memory bug in DCU calculation (Stress Memory) #3710

pxlxingliang opened this issue Mar 14, 2024 · 5 comments · Fixed by #4047
Assignees
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues

Comments

@pxlxingliang
Copy link
Collaborator

Describe the bug

I use Sugon DCU to calculate the SCF of 216 Si, and when calcualte the stress, ABACUS stopped, and throw below error:

009_216Si.zip

Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).
@pxlxingliang pxlxingliang added the Bugs Bugs that only solvable with sufficient knowledge of DFT label Mar 14, 2024
@WHUweiqingzhou WHUweiqingzhou self-assigned this Mar 18, 2024
@WHUweiqingzhou
Copy link
Collaborator

@denghuilu could you have a look and leave a suggestion?

@denghuilu
Copy link
Member

The error encountered appears to be an Out of Memory (OOM) issue, as indicated by the program's output. The computation of stress typically demands additional device memory, which may lead to this problem, especially when dealing with a significantly large system.

@WHUweiqingzhou
Copy link
Collaborator

@dyzheng could you have a look?

@dyzheng
Copy link
Collaborator

dyzheng commented Mar 25, 2024

@mohanchen mohanchen assigned Qianruipku and unassigned WHUweiqingzhou and dyzheng Apr 9, 2024
@mohanchen mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues and removed Bugs Bugs that only solvable with sufficient knowledge of DFT labels May 5, 2024
@WHUweiqingzhou WHUweiqingzhou assigned dyzheng and unassigned Qianruipku May 8, 2024
@dyzheng
Copy link
Collaborator

dyzheng commented May 8, 2024

#4047 will solve this Issue, maybe in this week.

@WHUweiqingzhou WHUweiqingzhou changed the title Out of memory bug in DCU calculation Out of memory bug in DCU calculation (Stress Memory) May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues
Projects
None yet
6 participants