Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

Open
Manyi-Yang opened this issue Feb 18, 2025 · 4 comments
Open

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

Manyi-Yang opened this issue Feb 18, 2025 · 4 comments
Assignees
Labels

Comments

@Manyi-Yang
Copy link

Bug summary

When I run dp train("se_e2_a") simulation with Deepmd-kit V3.0.1, I met following memory problem.
Using the same input file and computational source, but change the version of Deepmd-kit from V3.0.1 to V 2.2.9, it works well.

[2025-02-18 07:20:30,704] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-02-18 07:22:41,365] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
025-02-18 07:04:22.712642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 92498 MB memory: -> device: 0, name: NVIDIA GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2025-02-18 07:04:22.717361: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2025-02-18 07:04:22.843798: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaab004f1070
[2025-02-18 07:04:24,676] DEEPMD INFO Adjust batch size from 1024 to 2048
[2025-02-18 07:04:24,729] DEEPMD INFO Adjust batch size from 2048 to 4096
[2025-02-18 07:04:24,786] DEEPMD INFO Adjust batch size from 4096 to 8192
[2025-02-18 07:04:24,995] DEEPMD INFO Adjust batch size from 8192 to 16384
[2025-02-18 07:04:25,642] DEEPMD INFO Adjust batch size from 16384 to 32768
[2025-02-18 07:04:26,180] DEEPMD INFO Adjust batch size from 32768 to 65536
[2025-02-18 07:04:27,249] DEEPMD INFO Adjust batch size from 65536 to 131072
2025-02-18 07:04:27.405366: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 26.33GiB (28270264320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2025-02-18 07:04:27.598927: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2025-02-18 07:04:27.599079: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1

DeePMD-kit Version

Deepmd-kit V3.0.1

Backend and its version

TensorFlow

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

"descriptor": { "type": "se_e2_a", "_sel": [90, 80, 120, 12], "rcut_smth": 3.0, "rcut": 4.0, "neuron": [ 32, 64, 128 ], "resnet_dt": false, "axis_neuron": 16, "seed": 7232, "_comment": " that's all" }, "fitting_net": { "neuron": [ 240, 240, 240, 240

Steps to Reproduce

dp train --mpi-log=master input.json >> train_gpu.log 2>&1

Further Information, Files, and Links

No response

@Yi-FanLi
Copy link
Collaborator

Yi-FanLi commented Feb 19, 2025

Hi @Manyi-Yang, we encountered this issue previously, and we thought it was due to some update of TensorFlow. Therefore, we added the info about setting the batch size in the printed message: #3822, #4283. Could you please try to set export DP_INFER_BATCH_SIZE=65536 or even a smaller value, and see if the error still appears?

@Manyi-Yang
Copy link
Author

Hello YiFan,

Thank you very much for your reply.
I have test it with export DP_INFER_BATCH_SIZE=65536. It looks like it does not work here. I've also tried using smaller values such as 500, but that didn't work either.

> [2025-02-20 02:44:13,838] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
> [2025-02-20 02:46:39,386] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
> 2025-02-20 02:46:39.660054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 92505 MB memory:  -> device: 0, name: NVIDIA GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
> 2025-02-20 02:46:39.664742: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
> 2025-02-20 02:46:39.819202: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaad9c11df0
> [2025-02-20 02:46:42,876] DEEPMD INFO    Adjust batch size from 1024 to 2048
> [2025-02-20 02:46:42,928] DEEPMD INFO    Adjust batch size from 2048 to 4096
> [2025-02-20 02:46:42,986] DEEPMD INFO    Adjust batch size from 4096 to 8192
> [2025-02-20 02:46:43,196] DEEPMD INFO    Adjust batch size from 8192 to 16384
> [2025-02-20 02:46:43,865] DEEPMD INFO    Adjust batch size from 16384 to 32768
> [2025-02-20 02:46:44,422] DEEPMD INFO    Adjust batch size from 32768 to 65536
> [2025-02-20 02:46:45,507] DEEPMD INFO    Adjust batch size from 65536 to 131072
> 2025-02-20 02:46:45.874190: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 26.33GiB (28277473280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
> 2025-02-20 02:46:45.875366: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 23.70GiB (25449725952 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.877794: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 21.33GiB (22904752128 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879037: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 19.20GiB (20614277120 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879084: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 17.28GiB (18552848384 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879103: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 15.55GiB (16697563136 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879121: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 14.00GiB (15027806208 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880518: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 12.60GiB (13525024768 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880546: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 11.34GiB (12172521472 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880573: W external/local_tsl/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
> 2025-02-20 02:46:45.880600: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x4003e4000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880631: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400400000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880666: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400420000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880883: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400460000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.882674: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400580000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.884786: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400720000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.887071: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.887606: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.887768: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.916206: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.916241: W external/local_tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 10.37GiB (rounded to 11132284160)requested by op Sum
> If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
> Current allocation summary follows.
> Current allocation summary follows.
> 2025-02-20 02:46:55.916259: I external/local_tsl/tsl/framework/bfc_allocator.cc:1039] BFCAllocator dump for GPU_0_bfc
> 2025-02-20 02:46:55.916278: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (256):        Total Chunks: 10, Chunks in use: 10. 2.5KiB allocated for chunks. 2.5KiB in use in bin. 52B client-requested in use in bin.
> 2025-02-20 02:46:55.916289: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (512):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916300: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (1024):       Total Chunks: 2, Chunks in use: 2. 2.5KiB allocated for chunks. 2.5KiB in use in bin. 2.0KiB client-requested in use in bin.
> 2025-02-20 02:46:55.916309: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (2048):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916316: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (4096):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916323: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (8192):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916329: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (16384):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916335: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (32768):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916341: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (65536):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916355: F external/local_tsl/tsl/framework/bfc_allocator.cc:1043] Check failed: b->free_chunks.size() == bin_info.total_chunks_in_bin - bin_info.total_chunks_in_use (1 vs. 2)

@Yi-FanLi
Copy link
Collaborator

Hi @Manyi-Yang, it seems that your environmental variable is not taking effect. I made the following comparison test:

If I add export DP_INFER_BATCH_SIZE=512 in my slurm script and submit the job, the output will look like this:

2025-02-23 00:08:36.396489: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740287316.734467   60773 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740287316.853829   60773 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
I0000 00:00:1740287322.651444   60773 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78763 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:dd:00.0, compute capability: 9.0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1740287322.703260   60773 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
I0000 00:00:1740287323.072218   60773 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78763 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:dd:00.0, compute capability: 9.0
[2025-02-23 00:08:43,212] DEEPMD INFO    # ---------------output of dp test--------------- 

In contrast, if I do not have this line, my output will look like this:

2025-02-23 00:09:24.353888: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740287364.502249   60912 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740287364.508846   60912 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
I0000 00:00:1740287369.269889   60912 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78763 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:dd:00.0, compute capability: 9.0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1740287369.322719   60912 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
I0000 00:00:1740287369.687459   60912 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78763 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:dd:00.0, compute capability: 9.0
[2025-02-23 00:09:29,782] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms). 
[2025-02-23 00:09:29,832] DEEPMD INFO    # ---------------output of dp test--------------- 
[2025-02-23 00:09:29,832] DEEPMD INFO    # testing system : validation_data_reformat/atomic_system
[2025-02-23 00:09:32,288] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2025-02-23 00:09:34,175] DEEPMD INFO    Adjust batch size from 2048 to 4096
[2025-02-23 00:09:38,130] DEEPMD INFO    Adjust batch size from 4096 to 8192

As shown above, if the environment variable takes effect, there are two key differences:

  1. The line DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms). does not appear.
  2. The lines DEEPMD INFO Adjust batch size from xxx to xxx do not appear.

They do not appear because the code automatically adjusting the batch size is not executed when the batch size is given by the user.

Can you tell me how your batch size is set?

@Manyi-Yang
Copy link
Author

Hi @Yi-FanLi
Sorry, I found I made a typo mistake when export the BATCH SIZE.
With your help, now my problem has been solved. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants