-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] The Error from Neighbor Statistics in dpmd v2.2.10 #3960
Comments
Cannot reproduce.
|
Do you use the CPU, the GPU, or any special hardware? |
Ok I tested again. GPU 2024-07-09 23:21:31.892816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-09 23:21:31.892969: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-09 23:21:31.902558: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-09 23:21:38.686760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94857 MB memory: -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-09 23:21:38.711388: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-09 23:21:38.837613: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaaed73a920
Traceback (most recent call last):
File "/usr/local/bin/dp", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/deepmd_utils/main.py", line 657, in main
deepmd_main(args)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/main.py", line 90, in main
neighbor_stat(**dict_args)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/neighbor_stat.py", line 61, in neighbor_stat
min_nbor_dist, max_nbor_size = nei.get_stat(data)
File "/usr/local/lib/python3.10/dist-packages/deepmd/utils/neighbor_stat.py", line 220, in get_stat
raise RuntimeError(
RuntimeError: Some atoms are overlapping in data_set_cll_v1/set.000. Please check your training data to remove duplicated atoms. CPU 2024-07-09 23:21:49.235437: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-09 23:21:49.235475: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-09 23:21:49.236018: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-09 23:21:50.966455: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-07-09 23:21:50.966497: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:129] retrieving CUDA diagnostic information for host: nid007017
2024-07-09 23:21:50.966503: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:136] hostname: nid007017
2024-07-09 23:21:50.966774: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:159] libcuda reported version is: 550.54.15
2024-07-09 23:21:50.966795: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:163] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for aarch64 535.129.03 Release Build (dvs-builder@U16-I3-D03-3-2) Thu Oct 19 19:04:08 UTC 2023
GCC version: gcc version 7.5.0 (SUSE Linux)
"
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:deepmd_utils.utils.batch_size:You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
2024-07-09 23:21:51.070038: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
DEEPMD INFO training data with min nbor dist: 0.8696584261076299
DEEPMD INFO training data with max nbor size: [13 68 54 14]
DEEPMD INFO min_nbor_dist: 0.869658
DEEPMD INFO max_nbor_size: [13 68 54 14]
|
Could you print deepmd-kit/deepmd/utils/neighbor_stat.py Lines 216 to 223 in f8a0b31
Also, could you try setting If the CPU and the GPU give different results, it may be a bug from TensorFlow. |
I know TensorFlow has some issues when using a large batch of data, as indicated in #3822. (Although I never encounter this issue) |
I print the dt: min_nbor_dist = 100.0
max_nbor_size = np.zeros(1 if self.mixed_type else self.ntypes, dtype=int)
for mn, dt, jj in self.iterator(data):
if np.isinf(dt):
log.warning(
"Atoms with no neighbors found in %s. Please make sure it's what you expected."
% jj
)
if dt < min_nbor_dist:
print("dt inside the condition", dt)
if math.isclose(dt, 0.0, rel_tol=1e-6):
# it's unexpected that the distance between two atoms is zero
# zero distance will cause nan (#874)
raise RuntimeError(
"Some atoms are overlapping in %s. Please check your"
" training data to remove duplicated atoms." % jj
)
min_nbor_dist = dt
max_nbor_size = np.maximum(mn, max_nbor_size)
# do sqrt in the final
min_nbor_dist = math.sqrt(min_nbor_dist)
log.info("training data with min nbor dist: " + str(min_nbor_dist))
log.info("training data with max nbor size: " + str(max_nbor_size))
return min_nbor_dist, max_nbor_size SummaryCPU: work Detailsfor CPU 2024-07-10 09:44:48.686184: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-10 09:44:48.686217: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-10 09:44:48.686730: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-10 09:44:50.401311: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-07-10 09:44:50.401360: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:129] retrieving CUDA diagnostic information for host: nid006786
2024-07-10 09:44:50.401366: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:136] hostname: nid006786
2024-07-10 09:44:50.401628: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:159] libcuda reported version is: 550.54.15
2024-07-10 09:44:50.401651: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:163] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for aarch64 535.129.03 Release Build (dvs-builder@U16-I3-D03-3-2) Thu Oct 19 19:04:08 UTC 2023
GCC version: gcc version 7.5.0 (SUSE Linux)
"
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:deepmd_utils.utils.batch_size:You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
2024-07-10 09:44:50.501442: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
dt inside the condition 0.7563057780999999
DEEPMD INFO training data with min nbor dist: 0.8696584261076299
DEEPMD INFO training data with max nbor size: [13 68 54 14]
DEEPMD INFO min_nbor_dist: 0.869658
DEEPMD INFO max_nbor_size: [13 68 54 14] for GPU 2024-07-10 09:46:11.127051: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-10 09:46:11.127088: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-10 09:46:11.127604: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-10 09:46:13.216417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94763 MB memory: -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-10 09:46:13.239287: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-10 09:46:13.292631: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaabcdf9f20
dt inside the condition 0.0
Traceback (most recent call last):
File "/usr/local/bin/dp", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/deepmd_utils/main.py", line 657, in main
deepmd_main(args)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/main.py", line 90, in main
neighbor_stat(**dict_args)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/neighbor_stat.py", line 61, in neighbor_stat
min_nbor_dist, max_nbor_size = nei.get_stat(data)
File "/usr/local/lib/python3.10/dist-packages/deepmd/utils/neighbor_stat.py", line 221, in get_stat
raise RuntimeError(
RuntimeError: Some atoms are overlapping in data_set_cll_v1/set.000. Please check your training data to remove duplicated atoms. GPU with 2024-07-10 09:47:03.995265: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-10 09:47:03.995298: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-10 09:47:03.995794: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-10 09:47:06.056207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94763 MB memory: -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-10 09:47:06.079145: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-10 09:47:06.131845: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaaf8995c20
dt inside the condition 0.7563057780999999
DEEPMD INFO training data with min nbor dist: 0.8696584261076299
DEEPMD INFO training data with max nbor size: [13 68 54 14]
DEEPMD INFO min_nbor_dist: 0.869658
DEEPMD INFO max_nbor_size: [13 68 54 14] |
I don't know the root reason but evidence shows that the unexpected behavior comes from the upstream. |
for the time being, I close this issue. |
Bug summary
With the Dockerfile provided by the CSCS engineer, I compiled the dpmd version 2.2.10 in the new cluster. But when I tried to test dp train with my dataset. I found it threw an error.
From my validation, I found there is no atom overlapping with each other in the data set. The dataset and python script to check the overlapping are provided below.
When I only modify the dpmd version number to v2.2.9 in the Docker file, I found the neighbor statistics works well.
I believe this is not fixed when updating to dpmd v2.2.11 because the change from v2.2.10 to v2.2.11 is only the f-string formatting in the
deepmd/utils/neighbor_stat.py
DeePMD-kit Version
2.2.10 and 2.2.9
Backend and its version
no provided
How did you download the software?
Others (write below)
Input Files, Running Commands, Error Log, etc.
A home made docker image
Steps to Reproduce
dp neighbor-stat -s data_set_cll_v1 -t Bi H O V -r 6.0
for version 2.2.9Further Information, Files, and Links
check.tgz
The text was updated successfully, but these errors were encountered: