GPU memory issue while executing lmp #4182

mayankaditya · 2024-10-04T10:31:45Z

Summary

Dear DeePMD developers,

I am running a machine-learned MD using the DeePMD package from docker. I am facing a memory issue while executing the LAMMPS simulations (provided with the DeePMD package) in the GPU server.
I am getting the following error:
2024-10-04 10:22:31.041258: W tensorflow/core/common_runtime/bfc_allocator.cc:474] _____***********************
2024-10-04 10:22:31.041306: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:681 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[28800000,50] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Please take a look at the attached script file and advise.

DeePMD-kit Version

2.1.1

Backend and its version

TensorFlow Version 2.8.0

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

No response

Details

Dear DeePMD developers,

I am running a machine-learned MD using the DeePMD package.
I am facing a memory issue while executing the LAMMPS simulations (provided with the DeePMD package) in the GPU server.
I am getting the following error:
2024-10-04 10:22:31.041258: W tensorflow/core/common_runtime/bfc_allocator.cc:474] _____***********************
2024-10-04 10:22:31.041306: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:681 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[28800000,50] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Please take a look at the attached script file and advise.

mayankaditya · 2024-10-04T10:40:32Z

I am using four GPU with each having 80GB GPU RAM.

njzjz · 2024-10-05T19:31:37Z

Please take a look at the attached script file and advise.

Where is the attachment?

mayankaditya · 2024-10-06T05:58:11Z

Sorry, I missed the attachment. Please take a look at the attached job script and output.

…

On Sun, Oct 6, 2024 at 1:01 AM Jinzhe Zeng ***@***.***> wrote: Please take a look at the attached script file and advise. Where is the attachment? — Reply to this email directly, view it on GitHub <#4182 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADC5MSUVQKIQ6BPFXMHIISLZ2A5C5AVCNFSM6AAAAABPLTDK5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJVGE3DOMRTGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mayankaditya · 2024-10-06T13:57:04Z

After submitting the post, I realized the attachment was missing or not visible. I don't know why. I have sent the same through email ([email protected]) . Please have a look and advise how to tackle the issue.

Thanks,
Mayank

njzjz · 2024-10-07T19:06:50Z

I am using four GPU with each having 80GB GPU RAM.

It seems you only use 1 MPI rank. Each MPI rank can only use one card.

njzjz · 2024-10-07T19:13:47Z

Each MPI rank can only use one card.

I realize this is not documented anywhere.

Fix deepmodeling#4182. Signed-off-by: Jinzhe Zeng <[email protected]>

Fix #4182.  ## Summary by CodeRabbit - **Documentation** - Updated `lammps-command.md` to clarify GPU usage and unit handling in LAMMPS. - Enhanced `howtoset_num_nodes.md` with new sections on MPI and multiprocessing for TensorFlow and PyTorch, improving clarity and usability. - Added guidance on GPU resource allocation for parallel processes.  Signed-off-by: Jinzhe Zeng <[email protected]>

mayankaditya added the wontfix label Oct 4, 2024

njzjz added Docs and removed wontfix labels Oct 7, 2024

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 7, 2024

docs: document more for multiprocessing

cd95b73

Fix deepmodeling#4182. Signed-off-by: Jinzhe Zeng <[email protected]>

njzjz mentioned this issue Oct 7, 2024

docs: document more for multiprocessing #4190

Merged

njzjz linked a pull request Oct 7, 2024 that will close this issue

docs: document more for multiprocessing #4190

Merged

njzjz closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory issue while executing lmp #4182

GPU memory issue while executing lmp #4182

mayankaditya commented Oct 4, 2024

mayankaditya commented Oct 4, 2024

njzjz commented Oct 5, 2024

mayankaditya commented Oct 6, 2024 via email

mayankaditya commented Oct 6, 2024

njzjz commented Oct 7, 2024

njzjz commented Oct 7, 2024

GPU memory issue while executing lmp #4182

GPU memory issue while executing lmp #4182

Comments

mayankaditya commented Oct 4, 2024

Summary

DeePMD-kit Version

Backend and its version

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

Details

mayankaditya commented Oct 4, 2024

njzjz commented Oct 5, 2024

mayankaditya commented Oct 6, 2024 via email

mayankaditya commented Oct 6, 2024

njzjz commented Oct 7, 2024

njzjz commented Oct 7, 2024