Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Segmentation fault while running DPLR #2895

Closed
Kehan-Cai-nanako opened this issue Oct 3, 2023 · 1 comment
Closed

[BUG]Segmentation fault while running DPLR #2895

Kehan-Cai-nanako opened this issue Oct 3, 2023 · 1 comment
Labels

Comments

@Kehan-Cai-nanako
Copy link

Bug summary

While running DPLR molecular dynamics simulation with LAMMPS by using a well-trained model, the program occasionally raise an error of segmentation fault and terminate the program. It is worth mentioning that, the problem does not occur all the time, but will occur with higher chance when the size of system is larger(~10000 to 30000 real atoms in DPLR) and when the simulation time is longer(~100 ps to 1 ns). In addition, the problem has been reproduced on at least two clusters, namely, running DPLR MD simulation via a user-created Conda environment on the Della cluster of Princeton University or via an image container on the Perlmutter cluster of NERSC, so I believe it is a universal problem.

DeePMD-kit Version

2.2.4

TensorFlow Version

2.9.0

How did you download the software?

docker

Input Files, Running Commands, Error Log, etc.

unlink: cannot unlink 'frozen_model.pb': No such file or directory
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit: Successfully load libcudart.so
1: DeePMD-kit: Successfully load libcudart.so
2: DeePMD-kit: Successfully load libcudart.so
3: DeePMD-kit: Successfully load libcudart.so
0: 2023-10-02 07:20:25.563709: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
1: 2023-10-02 07:20:25.659166: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
3: 2023-10-02 07:20:25.830325: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2: 2023-10-02 07:20:26.021502: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
3: 2023-10-02 07:20:38.509113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:c1:00.0, compute capability: 8.0
0: 2023-10-02 07:20:38.509285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:03:00.0, compute capability: 8.0
2: 2023-10-02 07:20:38.511058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:82:00.0, compute capability: 8.0
1: 2023-10-02 07:20:38.520851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:41:00.0, compute capability: 8.0
1: 2023-10-02 07:20:38.533009: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2: 2023-10-02 07:20:38.533008: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
3: 2023-10-02 07:20:38.533018: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
0: 2023-10-02 07:20:38.543769: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
1: 2023-10-02 07:20:38.653084: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
0: 2023-10-02 07:20:38.653566: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
3: 2023-10-02 07:20:38.653682: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2: 2023-10-02 07:20:38.659171: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: 2023-10-02 07:20:38.961150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:41:00.0, compute capability: 8.0
2: 2023-10-02 07:20:38.961174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:82:00.0, compute capability: 8.0
0: 2023-10-02 07:20:38.961410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:03:00.0, compute capability: 8.0
3: 2023-10-02 07:20:38.961338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:c1:00.0, compute capability: 8.0
2: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2: 2023-10-02 07:20:39.211466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:82:00.0, compute capability: 8.0
0: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
3: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
1: DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
0: 2023-10-02 07:20:39.224662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:03:00.0, compute capability: 8.0
3: 2023-10-02 07:20:39.227708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:c1:00.0, compute capability: 8.0
1: 2023-10-02 07:20:39.229050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 72948 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:41:00.0, compute capability: 8.0
1: /opt/deepmd-kit/bin/lmp: line 11: 2288062 Segmentation fault /opt/deepmd-kit/bin/_lmp "$@"
srun: error: nid008668: task 1: Exited with exit code 139
srun: Terminating StepId=16480142.0
0: slurmstepd: error: *** STEP 16480142.0 ON nid008668 CANCELLED AT 2023-10-02T11:01:10 ***
srun: error: nid008668: tasks 0,2-3: Terminated
srun: Force Terminated StepId=16480142.0

give_yifan.zip

Steps to Reproduce

The neural network model is too large to attach above, please feel free to ask for the model in order to reproduce the error.

Run the command:

sbatch run.slurm

Further Information, Files, and Links

No response

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jan 5, 2024
When debuging deepmodeling#3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in deepmodeling#2895 may be also caused by it.
This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external.
We still need to figure out the reason for the error of `_norm_copy_coord_gpu` behind the segfault.

Signed-off-by: Jinzhe Zeng <[email protected]>
wanghan-iapcm pushed a commit that referenced this issue Jan 5, 2024
When debuging #3103, I notice a segfault in `~Region`, preventing the
actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and
`rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`,
and finnally recover the original pointer that can be deleted. However,
`deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the
pointers are not recovered. `~Region` tries to delete a pointer that it
doesn't own, causing the segfault. The CUDA error message is not visible
due to segfault. The segfault in #2895 may be also caused by it.
This PR adds a new constructor to `Region` to accept the external
pointers. `~Region` will delete `boxt` and`rec_boxt` only when the
pointer is not external.
We still need to figure out the reason for the error of `copy_coord_gpu`
behind the segfault.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
wanghan-iapcm pushed a commit that referenced this issue Feb 6, 2024
[A
segfault](https://github.com/deepmodeling/deepmd-kit/actions/runs/7782245372/job/21218255452)
sometimes appears in the tests after #3223. The reason is that the
required shape of the electric field is set to nall * 3 in the C
interface, but a vector of nloc * 3 is given in the tests and lammps and
used in the C++ interface. It was not caught before, as the program
didn't write to these addresses (only reading it usually won't cause a
segfault, unless the address is invalid).
Perhaps fix #2895.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz
Copy link
Member

njzjz commented Feb 6, 2024

I'll close the issue since #3108 and #3237 fixed possible segfault in two different places. The current issue does not have enough information to locate the problem, but if the problem still exists, feel free to reopen the issue.

@njzjz njzjz closed this as completed Feb 6, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Bugfixes for DeePMD-kit Feb 6, 2024
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Apr 6, 2024
…ing#3237)

[A
segfault](https://github.com/deepmodeling/deepmd-kit/actions/runs/7782245372/job/21218255452)
sometimes appears in the tests after deepmodeling#3223. The reason is that the
required shape of the electric field is set to nall * 3 in the C
interface, but a vector of nloc * 3 is given in the tests and lammps and
used in the C++ interface. It was not caught before, as the program
didn't write to these addresses (only reading it usually won't cause a
segfault, unless the address is invalid).
Perhaps fix deepmodeling#2895.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
(cherry picked from commit e281a8c)
njzjz added a commit that referenced this issue Apr 6, 2024
[A
segfault](https://github.com/deepmodeling/deepmd-kit/actions/runs/7782245372/job/21218255452)
sometimes appears in the tests after #3223. The reason is that the
required shape of the electric field is set to nall * 3 in the C
interface, but a vector of nloc * 3 is given in the tests and lammps and
used in the C++ interface. It was not caught before, as the program
didn't write to these addresses (only reading it usually won't cause a
segfault, unless the address is invalid).
Perhaps fix #2895.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
(cherry picked from commit e281a8c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

2 participants