Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a reminder for the illegal memory error (#3822)
When using the GPU version of the neighbor stat code, one may encounter the following issue and the training will stop: ``` [2024-05-24 23:00:42,027] DEEPMD INFO Adjust batch size from 1024 to 2048 [2024-05-24 23:00:42,139] DEEPMD INFO Adjust batch size from 2048 to 4096 [2024-05-24 23:00:42,285] DEEPMD INFO Adjust batch size from 4096 to 8192 [2024-05-24 23:00:42,628] DEEPMD INFO Adjust batch size from 8192 to 16384 [2024-05-24 23:00:43,180] DEEPMD INFO Adjust batch size from 16384 to 32768 [2024-05-24 23:00:44,341] DEEPMD INFO Adjust batch size from 32768 to 65536 [2024-05-24 23:00:46,713] DEEPMD INFO Adjust batch size from 65536 to 131072 2024-05-24 23:00:52.071120: E tensorflow/compiler/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2024-05-24 23:00:52.075435: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1 /bin/sh: line 1: 1397100 Aborted ``` This should be due to some issue of TensorFlow. One may use the environment variable `DP_INFER_BATCH_SIZE` to avoid this issue. This PR remind the user to set a small `DP_INFER_BATCH_SIZE` to avoid this issue. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - **Bug Fixes** - Added a log message to guide users on setting the `DP_INFER_BATCH_SIZE` environment variable to avoid TensorFlow illegal memory access issues. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Yifan Li李一帆 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> (cherry picked from commit d754672) Signed-off-by: Jinzhe Zeng <[email protected]>
- Loading branch information