Skip to content

Commit

Permalink
Add a reminder for the illegal memory error (#3822)
Browse files Browse the repository at this point in the history
When using the GPU version of the neighbor stat code, one may encounter
the following issue and the training will stop:

```
[2024-05-24 23:00:42,027] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2024-05-24 23:00:42,139] DEEPMD INFO    Adjust batch size from 2048 to 4096
[2024-05-24 23:00:42,285] DEEPMD INFO    Adjust batch size from 4096 to 8192
[2024-05-24 23:00:42,628] DEEPMD INFO    Adjust batch size from 8192 to 16384
[2024-05-24 23:00:43,180] DEEPMD INFO    Adjust batch size from 16384 to 32768
[2024-05-24 23:00:44,341] DEEPMD INFO    Adjust batch size from 32768 to 65536
[2024-05-24 23:00:46,713] DEEPMD INFO    Adjust batch size from 65536 to 131072
2024-05-24 23:00:52.071120: E tensorflow/compiler/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-05-24 23:00:52.075435: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1
/bin/sh: line 1: 1397100 Aborted
```

This should be due to some issue of TensorFlow. One may use the
environment variable `DP_INFER_BATCH_SIZE` to avoid this issue.

This PR remind the user to set a small `DP_INFER_BATCH_SIZE` to avoid
this issue.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

- **Bug Fixes**
- Added a log message to guide users on setting the
`DP_INFER_BATCH_SIZE` environment variable to avoid TensorFlow illegal
memory access issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Yifan Li李一帆 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
(cherry picked from commit d754672)
Signed-off-by: Jinzhe Zeng <[email protected]>
  • Loading branch information
3 people authored and njzjz committed Jul 3, 2024
1 parent 1c3ac97 commit 4c57e4a
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions deepmd_utils/utils/batch_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,11 @@ def __init__(self, initial_batch_size: int = 1024, factor: float = 2.0) -> None:
self.maximum_working_batch_size = initial_batch_size
if self.is_gpu_available():
self.minimal_not_working_batch_size = 2**31
log.info(
"If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. "
"To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. "
"The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms). "
)
else:
self.minimal_not_working_batch_size = (
self.maximum_working_batch_size + 1
Expand Down

0 comments on commit 4c57e4a

Please sign in to comment.