[BUG]Segmentation fault while running DPLR #2895

Kehan-Cai-nanako · 2023-10-03T15:17:56Z

Bug summary

While running DPLR molecular dynamics simulation with LAMMPS by using a well-trained model, the program occasionally raise an error of segmentation fault and terminate the program. It is worth mentioning that, the problem does not occur all the time, but will occur with higher chance when the size of system is larger(~10000 to 30000 real atoms in DPLR) and when the simulation time is longer(~100 ps to 1 ns). In addition, the problem has been reproduced on at least two clusters, namely, running DPLR MD simulation via a user-created Conda environment on the Della cluster of Princeton University or via an image container on the Perlmutter cluster of NERSC, so I believe it is a universal problem.

DeePMD-kit Version

2.2.4

TensorFlow Version

2.9.0

How did you download the software?

docker

Input Files, Running Commands, Error Log, etc.

unlink: cannot unlink 'frozen_model.pb': No such file or directory
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit: Successfully load libcudart.so
1: DeePMD-kit: Successfully load libcudart.so
2: DeePMD-kit: Successfully load libcudart.so
3: DeePMD-kit: Successfully load libcudart.so
0: 2023-10-02 07:20:25.563709: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
1: 2023-10-02 07:20:25.659166: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
3: 2023-10-02 07:20:25.830325: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2: 2023-10-02 07:20:26.021502: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
3: 2023-10-02 07:20:38.509113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:c1:00.0, compute capability: 8.0
0: 2023-10-02 07:20:38.509285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:03:00.0, compute capability: 8.0
2: 2023-10-02 07:20:38.511058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:82:00.0, compute capability: 8.0
1: 2023-10-02 07:20:38.520851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:41:00.0, compute capability: 8.0
1: 2023-10-02 07:20:38.533009: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2: 2023-10-02 07:20:38.533008: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
3: 2023-10-02 07:20:38.533018: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
0: 2023-10-02 07:20:38.543769: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
1: 2023-10-02 07:20:38.653084: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
0: 2023-10-02 07:20:38.653566: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
3: 2023-10-02 07:20:38.653682: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2: 2023-10-02 07:20:38.659171: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: 2023-10-02 07:20:38.961150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:41:00.0, compute capability: 8.0
2: 2023-10-02 07:20:38.961174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:82:00.0, compute capability: 8.0
0: 2023-10-02 07:20:38.961410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:03:00.0, compute capability: 8.0
3: 2023-10-02 07:20:38.961338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:c1:00.0, compute capability: 8.0
2: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: 2023-10-02 07:20:39.211466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:82:00.0, compute capability: 8.0
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: 2023-10-02 07:20:39.224662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:03:00.0, compute capability: 8.0
3: 2023-10-02 07:20:39.227708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:c1:00.0, compute capability: 8.0
1: 2023-10-02 07:20:39.229050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:41:00.0, compute capability: 8.0
1: /opt/deepmd-kit/bin/lmp: line 11: 2288062 Segmentation fault /opt/deepmd-kit/bin/_lmp "$@"
srun: error: nid008668: task 1: Exited with exit code 139
srun: Terminating StepId=16480142.0
0: slurmstepd: error: *** STEP 16480142.0 ON nid008668 CANCELLED AT 2023-10-02T11:01:10 ***
srun: error: nid008668: tasks 0,2-3: Terminated
srun: Force Terminated StepId=16480142.0

give_yifan.zip

Steps to Reproduce

The neural network model is too large to attach above, please feel free to ask for the model in order to reproduce the error.

Run the command:

sbatch run.slurm

Further Information, Files, and Links

No response

When debuging deepmodeling#3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in deepmodeling#2895 may be also caused by it. This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external. We still need to figure out the reason for the error of `_norm_copy_coord_gpu` behind the segfault. Signed-off-by: Jinzhe Zeng <[email protected]>

When debuging #3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in #2895 may be also caused by it. This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external. We still need to figure out the reason for the error of `copy_coord_gpu` behind the segfault. --------- Signed-off-by: Jinzhe Zeng <[email protected]>

[A segfault](https://github.com/deepmodeling/deepmd-kit/actions/runs/7782245372/job/21218255452) sometimes appears in the tests after #3223. The reason is that the required shape of the electric field is set to nall * 3 in the C interface, but a vector of nloc * 3 is given in the tests and lammps and used in the C++ interface. It was not caught before, as the program didn't write to these addresses (only reading it usually won't cause a segfault, unless the address is invalid). Perhaps fix #2895. --------- Signed-off-by: Jinzhe Zeng <[email protected]>

njzjz · 2024-02-06T22:06:17Z

I'll close the issue since #3108 and #3237 fixed possible segfault in two different places. The current issue does not have enough information to locate the problem, but if the problem still exists, feel free to reopen the issue.

…ing#3237) [A segfault](https://github.com/deepmodeling/deepmd-kit/actions/runs/7782245372/job/21218255452) sometimes appears in the tests after deepmodeling#3223. The reason is that the required shape of the electric field is set to nall * 3 in the C interface, but a vector of nloc * 3 is given in the tests and lammps and used in the C++ interface. It was not caught before, as the program didn't write to these addresses (only reading it usually won't cause a segfault, unless the address is invalid). Perhaps fix deepmodeling#2895. --------- Signed-off-by: Jinzhe Zeng <[email protected]> (cherry picked from commit e281a8c)

[A segfault](https://github.com/deepmodeling/deepmd-kit/actions/runs/7782245372/job/21218255452) sometimes appears in the tests after #3223. The reason is that the required shape of the electric field is set to nall * 3 in the C interface, but a vector of nloc * 3 is given in the tests and lammps and used in the C++ interface. It was not caught before, as the program didn't write to these addresses (only reading it usually won't cause a segfault, unless the address is invalid). Perhaps fix #2895. --------- Signed-off-by: Jinzhe Zeng <[email protected]> (cherry picked from commit e281a8c)

Kehan-Cai-nanako added the bug label Oct 3, 2023

njzjz added this to Bugfixes for DeePMD-kit Nov 8, 2023

github-project-automation bot moved this to Todo in Bugfixes for DeePMD-kit Nov 8, 2023

njzjz mentioned this issue Jan 5, 2024

fix segfault in ~Region #3108

Merged

njzjz mentioned this issue Feb 5, 2024

c: change the required shape of electric field to nloc * 3 #3237

Merged

njzjz closed this as completed Feb 6, 2024

github-project-automation bot moved this from Todo to Done in Bugfixes for DeePMD-kit Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Segmentation fault while running DPLR #2895

[BUG]Segmentation fault while running DPLR #2895

Kehan-Cai-nanako commented Oct 3, 2023

njzjz commented Feb 6, 2024

[BUG]Segmentation fault while running DPLR #2895

[BUG]Segmentation fault while running DPLR #2895

Comments

Kehan-Cai-nanako commented Oct 3, 2023

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Feb 6, 2024