Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory issue while executing lmp #4182

Closed
mayankaditya opened this issue Oct 4, 2024 · 6 comments · Fixed by #4190
Closed

GPU memory issue while executing lmp #4182

mayankaditya opened this issue Oct 4, 2024 · 6 comments · Fixed by #4190
Labels

Comments

@mayankaditya
Copy link

Summary

Dear DeePMD developers,

I am running a machine-learned MD using the DeePMD package from docker. I am facing a memory issue while executing the LAMMPS simulations (provided with the DeePMD package) in the GPU server.
I am getting the following error:
2024-10-04 10:22:31.041258: W tensorflow/core/common_runtime/bfc_allocator.cc:474] _____***********************
2024-10-04 10:22:31.041306: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:681 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[28800000,50] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Please take a look at the attached script file and advise.

DeePMD-kit Version

2.1.1

Backend and its version

TensorFlow Version 2.8.0

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

No response

Details

Dear DeePMD developers,

I am running a machine-learned MD using the DeePMD package.
I am facing a memory issue while executing the LAMMPS simulations (provided with the DeePMD package) in the GPU server.
I am getting the following error:
2024-10-04 10:22:31.041258: W tensorflow/core/common_runtime/bfc_allocator.cc:474] _____***********************
2024-10-04 10:22:31.041306: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:681 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[28800000,50] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Please take a look at the attached script file and advise.

@mayankaditya
Copy link
Author

I am using four GPU with each having 80GB GPU RAM.

@njzjz
Copy link
Member

njzjz commented Oct 5, 2024

Please take a look at the attached script file and advise.

Where is the attachment?

@mayankaditya
Copy link
Author

mayankaditya commented Oct 6, 2024 via email

@mayankaditya
Copy link
Author

After submitting the post, I realized the attachment was missing or not visible. I don't know why. I have sent the same through email ([email protected]) . Please have a look and advise how to tackle the issue.

Thanks,
Mayank

@njzjz
Copy link
Member

njzjz commented Oct 7, 2024

I am using four GPU with each having 80GB GPU RAM.

It seems you only use 1 MPI rank. Each MPI rank can only use one card.

@njzjz
Copy link
Member

njzjz commented Oct 7, 2024

Each MPI rank can only use one card.

I realize this is not documented anywhere.

@njzjz njzjz added Docs and removed wontfix labels Oct 7, 2024
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 7, 2024
@njzjz njzjz linked a pull request Oct 7, 2024 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue Oct 8, 2024
Fix #4182.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated `lammps-command.md` to clarify GPU usage and unit handling in
LAMMPS.
- Enhanced `howtoset_num_nodes.md` with new sections on MPI and
multiprocessing for TensorFlow and PyTorch, improving clarity and
usability.
	- Added guidance on GPU resource allocation for parallel processes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz njzjz closed this as completed Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants