-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory issue while executing lmp #4182
Comments
I am using four GPU with each having 80GB GPU RAM. |
Where is the attachment? |
Sorry, I missed the attachment. Please take a look at the attached job
script and output.
…On Sun, Oct 6, 2024 at 1:01 AM Jinzhe Zeng ***@***.***> wrote:
Please take a look at the attached script file and advise.
Where is the attachment?
—
Reply to this email directly, view it on GitHub
<#4182 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADC5MSUVQKIQ6BPFXMHIISLZ2A5C5AVCNFSM6AAAAABPLTDK5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJVGE3DOMRTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
After submitting the post, I realized the attachment was missing or not visible. I don't know why. I have sent the same through email ([email protected]) . Please have a look and advise how to tackle the issue. Thanks, |
It seems you only use 1 MPI rank. Each MPI rank can only use one card. |
I realize this is not documented anywhere. |
Fix deepmodeling#4182. Signed-off-by: Jinzhe Zeng <[email protected]>
Fix #4182. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Documentation** - Updated `lammps-command.md` to clarify GPU usage and unit handling in LAMMPS. - Enhanced `howtoset_num_nodes.md` with new sections on MPI and multiprocessing for TensorFlow and PyTorch, improving clarity and usability. - Added guidance on GPU resource allocation for parallel processes. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Jinzhe Zeng <[email protected]>
Summary
Dear DeePMD developers,
I am running a machine-learned MD using the DeePMD package from docker. I am facing a memory issue while executing the LAMMPS simulations (provided with the DeePMD package) in the GPU server.
I am getting the following error:
2024-10-04 10:22:31.041258: W tensorflow/core/common_runtime/bfc_allocator.cc:474] _____***********************
2024-10-04 10:22:31.041306: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:681 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[28800000,50] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Please take a look at the attached script file and advise.
DeePMD-kit Version
2.1.1
Backend and its version
TensorFlow Version 2.8.0
Python Version, CUDA Version, GCC Version, LAMMPS Version, etc
No response
Details
Dear DeePMD developers,
I am running a machine-learned MD using the DeePMD package.
I am facing a memory issue while executing the LAMMPS simulations (provided with the DeePMD package) in the GPU server.
I am getting the following error:
2024-10-04 10:22:31.041258: W tensorflow/core/common_runtime/bfc_allocator.cc:474] _____***********************
2024-10-04 10:22:31.041306: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:681 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[28800000,50] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Please take a look at the attached script file and advise.
The text was updated successfully, but these errors were encountered: