Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The Error from Neighbor Statistics in dpmd v2.2.10 #3960

Closed
robinzyb opened this issue Jul 9, 2024 · 8 comments
Closed

[BUG] The Error from Neighbor Statistics in dpmd v2.2.10 #3960

robinzyb opened this issue Jul 9, 2024 · 8 comments

Comments

@robinzyb
Copy link
Contributor

robinzyb commented Jul 9, 2024

Bug summary

With the Dockerfile provided by the CSCS engineer, I compiled the dpmd version 2.2.10 in the new cluster. But when I tried to test dp train with my dataset. I found it threw an error.

WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
/usr/local/lib/python3.10/dist-packages/deepmd_utils/utils/compat.py:362: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead.
  warnings.warn(
DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
2024-07-09 14:03:12.068634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94677 MB memory:  -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-09 14:03:12.089546: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-09 14:03:12.143871: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaaf5979450
Traceback (most recent call last):
  File "/usr/local/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/deepmd_utils/main.py", line 657, in main
    deepmd_main(args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/main.py", line 74, in main
    train_dp(**dict_args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 149, in train
    jdata = update_sel(jdata)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 512, in update_sel
    jdata_cpy["model"] = Model.update_sel(jdata, jdata["model"])
  File "/usr/local/lib/python3.10/dist-packages/deepmd/model/model.py", line 566, in update_sel
    return cls.update_sel(global_jdata, local_jdata)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/model/model.py", line 723, in update_sel
    local_jdata_cpy["descriptor"] = Descriptor.update_sel(
  File "/usr/local/lib/python3.10/dist-packages/deepmd/descriptor/descriptor.py", line 511, in update_sel
    return cls.update_sel(global_jdata, local_jdata)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/descriptor/se.py", line 162, in update_sel
    return update_one_sel(global_jdata, local_jdata_cpy, False)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 479, in update_one_sel
    tmp_sel = get_sel(
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 440, in get_sel
    _, max_nbor_size = get_nbor_stat(jdata, rcut, one_type=one_type)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 425, in get_nbor_stat
    min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/utils/neighbor_stat.py", line 220, in get_stat
    raise RuntimeError(
RuntimeError: Some atoms are overlapping in /capstor/scratch/cscs/zyongbin/dpmd_test/all_data_set/data_set_cll_v1/set.000. Please check your training data to remove duplicated atoms.

From my validation, I found there is no atom overlapping with each other in the data set. The dataset and python script to check the overlapping are provided below.

When I only modify the dpmd version number to v2.2.9 in the Docker file, I found the neighbor statistics works well.

DeePMD-kit v2.2.9
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD INFO    training data with min nbor dist: 0.8696584261076291
DEEPMD INFO    training data with max nbor size: [13 68 54 14]
DEEPMD INFO    min_nbor_dist: 0.869658
DEEPMD INFO    max_nbor_size: [13 68 54 14]

I believe this is not fixed when updating to dpmd v2.2.11 because the change from v2.2.10 to v2.2.11 is only the f-string formatting in the deepmd/utils/neighbor_stat.py

DeePMD-kit Version

2.2.10 and 2.2.9

Backend and its version

no provided

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

A home made docker image

Steps to Reproduce

  1. Download my data set
  2. dp neighbor-stat -s data_set_cll_v1 -t Bi H O V -r 6.0 for version 2.2.9
  3. test the same command for version 2.2.10

Further Information, Files, and Links

check.tgz

@robinzyb robinzyb added the bug label Jul 9, 2024
@robinzyb robinzyb changed the title [BUG] Error from Neighbor Statistics in dpmd v2.2.10 [BUG] The Error from Neighbor Statistics in dpmd v2.2.10 Jul 9, 2024
@njzjz
Copy link
Member

njzjz commented Jul 9, 2024

Cannot reproduce.

(/home/jz748/deepmd-kit-2.2.10) [jz748@localhost all_data_set]$ dp neighbor-stat -s data_set_cll_v1  -t Bi H O V -r 6.0
2024-07-09 16:43:18.475249: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /home/jz748/deepmd-kit-2.2.10/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:98: disable_resource_variables (from tensorflow.python.ops.resource_variables_toggle) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-09 16:43:20.632160: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.632336: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.757110: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.757288: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.757422: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.757554: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.977423: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.977652: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.977810: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.977968: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.978123: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.978244: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.990836: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.991052: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.991199: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.991330: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.991463: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.991560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5506 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5
2024-07-09 16:43:20.991950: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-09 16:43:20.992109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 6804 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2080 SUPER, pci bus id: 0000:02:00.0, compute capability: 7.5
2024-07-09 16:43:21.005108: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-09 16:43:21.087263: I tensorflow/core/util/cuda_solvers.cc:178] Creating GpuSolver handles for stream 0x563206373190
2024-07-09 16:43:23.097430: W external/local_tsl/tsl/framework/bfc_allocator.cc:368] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2024-07-09 16:43:34.165682: W external/local_tsl/tsl/framework/bfc_allocator.cc:487] Allocator (GPU_0_bfc) ran out of memory trying to allocate 6.17GiB (rounded to 6623285760)requested by op sub
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2024-07-09 16:43:34.165708: I external/local_tsl/tsl/framework/bfc_allocator.cc:1044] BFCAllocator dump for GPU_0_bfc
2024-07-09 16:43:34.165715: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (256): 	Total Chunks: 5, Chunks in use: 5. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 26B client-requested in use in bin.
2024-07-09 16:43:34.165720: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (512): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165725: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (1024): 	Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2024-07-09 16:43:34.165729: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (2048): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165733: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (4096): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165741: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (8192): 	Total Chunks: 1, Chunks in use: 0. 11.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165749: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (16384): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165772: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (32768): 	Total Chunks: 1, Chunks in use: 1. 32.0KiB allocated for chunks. 32.0KiB in use in bin. 32.0KiB client-requested in use in bin.
2024-07-09 16:43:34.165777: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (65536): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165783: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (131072): 	Total Chunks: 1, Chunks in use: 1. 128.0KiB allocated for chunks. 128.0KiB in use in bin. 128.0KiB client-requested in use in bin.
2024-07-09 16:43:34.165789: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (262144): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165809: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (524288): 	Total Chunks: 4, Chunks in use: 4. 3.29MiB allocated for chunks. 3.29MiB in use in bin. 3.28MiB client-requested in use in bin.
2024-07-09 16:43:34.165814: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (1048576): 	Total Chunks: 1, Chunks in use: 1. 1.07MiB allocated for chunks. 1.07MiB in use in bin. 863.8KiB client-requested in use in bin.
2024-07-09 16:43:34.165837: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (2097152): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165857: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (4194304): 	Total Chunks: 1, Chunks in use: 0. 5.47MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165861: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (8388608): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165885: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (16777216): 	Total Chunks: 1, Chunks in use: 1. 20.25MiB allocated for chunks. 20.25MiB in use in bin. 20.24MiB client-requested in use in bin.
2024-07-09 16:43:34.165890: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165895: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (67108864): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165900: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (134217728): 	Total Chunks: 1, Chunks in use: 0. 235.75MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 16:43:34.165920: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (268435456): 	Total Chunks: 2, Chunks in use: 1. 5.12GiB allocated for chunks. 263.19MiB in use in bin. 263.19MiB client-requested in use in bin.
2024-07-09 16:43:34.165925: I external/local_tsl/tsl/framework/bfc_allocator.cc:1067] Bin for 6.17GiB was 256.00MiB, Chunk State: 
2024-07-09 16:43:34.165950: I external/local_tsl/tsl/framework/bfc_allocator.cc:1073]   Size: 4.86GiB | Requested Size: 66.4KiB | in_use: 0 | bin_num: 20, prev:   Size: 263.19MiB | Requested Size: 263.19MiB | in_use: 1 | bin_num: -1
2024-07-09 16:43:34.165971: I external/local_tsl/tsl/framework/bfc_allocator.cc:1080] Next region of size 5494603776
2024-07-09 16:43:34.165975: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151a08000000 of size 275970304 next 19
2024-07-09 16:43:34.165995: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] Free  at 151a1872f900 of size 5218633472 next 18446744073709551615
2024-07-09 16:43:34.166000: I external/local_tsl/tsl/framework/bfc_allocator.cc:1080] Next region of size 268435456
2024-07-09 16:43:34.166004: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151d24000000 of size 21228544 next 10
2024-07-09 16:43:34.166008: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] Free  at 151d2543ec00 of size 247206912 next 18446744073709551615
2024-07-09 16:43:34.166012: I external/local_tsl/tsl/framework/bfc_allocator.cc:1080] Next region of size 8388608
2024-07-09 16:43:34.166015: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151db0800000 of size 884736 next 20
2024-07-09 16:43:34.166019: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151db08d8000 of size 884736 next 24
2024-07-09 16:43:34.166024: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151db09b0000 of size 884736 next 27
2024-07-09 16:43:34.166028: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] Free  at 151db0a88000 of size 5734400 next 18446744073709551615
2024-07-09 16:43:34.166032: I external/local_tsl/tsl/framework/bfc_allocator.cc:1080] Next region of size 2097152
2024-07-09 16:43:34.166035: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58e00000 of size 256 next 1
2024-07-09 16:43:34.166039: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58e00100 of size 256 next 2
2024-07-09 16:43:34.166042: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58e00200 of size 256 next 3
2024-07-09 16:43:34.166046: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58e00300 of size 256 next 4
2024-07-09 16:43:34.166049: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58e00400 of size 256 next 5
2024-07-09 16:43:34.166053: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58e00500 of size 1280 next 6
2024-07-09 16:43:34.166056: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58e00a00 of size 794368 next 8
2024-07-09 16:43:34.166061: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58ec2900 of size 32768 next 23
2024-07-09 16:43:34.166065: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] Free  at 151e58eca900 of size 11264 next 11
2024-07-09 16:43:34.166069: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58ecd500 of size 131072 next 9
2024-07-09 16:43:34.166073: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 151e58eed500 of size 1125120 next 18446744073709551615
2024-07-09 16:43:34.166077: I external/local_tsl/tsl/framework/bfc_allocator.cc:1105]      Summary of in-use Chunks by size: 
2024-07-09 16:43:34.166081: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 5 Chunks of size 256 totalling 1.2KiB
2024-07-09 16:43:34.166085: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 1 Chunks of size 1280 totalling 1.2KiB
2024-07-09 16:43:34.166089: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 1 Chunks of size 32768 totalling 32.0KiB
2024-07-09 16:43:34.166093: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 1 Chunks of size 131072 totalling 128.0KiB
2024-07-09 16:43:34.166098: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 1 Chunks of size 794368 totalling 775.8KiB
2024-07-09 16:43:34.166102: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 3 Chunks of size 884736 totalling 2.53MiB
2024-07-09 16:43:34.166106: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 1 Chunks of size 1125120 totalling 1.07MiB
2024-07-09 16:43:34.166110: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 1 Chunks of size 21228544 totalling 20.25MiB
2024-07-09 16:43:34.166115: I external/local_tsl/tsl/framework/bfc_allocator.cc:1108] 1 Chunks of size 275970304 totalling 263.19MiB
2024-07-09 16:43:34.166120: I external/local_tsl/tsl/framework/bfc_allocator.cc:1112] Sum Total of in-use chunks: 287.95MiB
2024-07-09 16:43:34.166124: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Total bytes in pool: 5773524992 memory_limit_: 5773524992 available bytes: 0 curr_region_allocation_bytes_: 17179869184
2024-07-09 16:43:34.166131: I external/local_tsl/tsl/framework/bfc_allocator.cc:1119] Stats: 
Limit:                      5773524992
InUse:                       301938944
MaxInUse:                   5763106816
NumAllocs:                         359
MaxAllocSize:               3280103424
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2024-07-09 16:43:34.166141: W external/local_tsl/tsl/framework/bfc_allocator.cc:499] *****__________________________________________________________________________________________*___*
2024-07-09 16:43:34.166155: W tensorflow/core/framework/op_kernel.cc:1827] RESOURCE_EXHAUSTED: failed to allocate memory
2024-07-09 16:43:34.166172: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: RESOURCE_EXHAUSTED: failed to allocate memory
	 [[{{node sub}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

2024-07-09 16:43:34.166183: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: RESOURCE_EXHAUSTED: failed to allocate memory
	 [[{{node sub}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[Min/_53]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

2024-07-09 16:43:34.166197: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 10630027486316472407
2024-07-09 16:43:34.166210: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 4783488273807447279
2024-07-09 16:43:34.166234: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: ABORTED: Stopping remaining executors.
DEEPMD INFO    training data with min nbor dist: 0.8696584261076299
DEEPMD INFO    training data with max nbor size: [13 68 54 14]
DEEPMD INFO    min_nbor_dist: 0.869658
DEEPMD INFO    max_nbor_size: [13 68 54 14]

@njzjz
Copy link
Member

njzjz commented Jul 9, 2024

Do you use the CPU, the GPU, or any special hardware?

@robinzyb
Copy link
Contributor Author

robinzyb commented Jul 9, 2024

Ok I tested again.
works for CPU but errors for GPU


GPU
command CUDA_VISIBLE_DEVICES=0 dp neighbor-stat -s data_set_cll_v1 -t Bi H O V -r 6.0

2024-07-09 23:21:31.892816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-09 23:21:31.892969: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-09 23:21:31.902558: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-09 23:21:38.686760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94857 MB memory:  -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-09 23:21:38.711388: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-09 23:21:38.837613: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaaed73a920
Traceback (most recent call last):
  File "/usr/local/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/deepmd_utils/main.py", line 657, in main
    deepmd_main(args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/main.py", line 90, in main
    neighbor_stat(**dict_args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/neighbor_stat.py", line 61, in neighbor_stat
    min_nbor_dist, max_nbor_size = nei.get_stat(data)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/utils/neighbor_stat.py", line 220, in get_stat
    raise RuntimeError(
RuntimeError: Some atoms are overlapping in data_set_cll_v1/set.000. Please check your training data to remove duplicated atoms.

CPU
command CUDA_VISIBLE_DEVICES="" dp neighbor-stat -s data_set_cll_v1 -t Bi H O V -r 6.0

2024-07-09 23:21:49.235437: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-09 23:21:49.235475: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-09 23:21:49.236018: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-09 23:21:50.966455: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-07-09 23:21:50.966497: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:129] retrieving CUDA diagnostic information for host: nid007017
2024-07-09 23:21:50.966503: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:136] hostname: nid007017
2024-07-09 23:21:50.966774: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:159] libcuda reported version is: 550.54.15
2024-07-09 23:21:50.966795: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:163] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for aarch64  535.129.03  Release Build  (dvs-builder@U16-I3-D03-3-2)  Thu Oct 19 19:04:08 UTC 2023
GCC version:  gcc version 7.5.0 (SUSE Linux) 
"
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:deepmd_utils.utils.batch_size:You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
2024-07-09 23:21:51.070038: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
DEEPMD INFO    training data with min nbor dist: 0.8696584261076299
DEEPMD INFO    training data with max nbor size: [13 68 54 14]
DEEPMD INFO    min_nbor_dist: 0.869658
DEEPMD INFO    max_nbor_size: [13 68 54 14]

@njzjz
Copy link
Member

njzjz commented Jul 9, 2024

Could you print dt below?

if dt < min_nbor_dist:
if math.isclose(dt, 0.0, rel_tol=1e-6):
# it's unexpected that the distance between two atoms is zero
# zero distance will cause nan (#874)
raise RuntimeError(
"Some atoms are overlapping in %s. Please check your"
" training data to remove duplicated atoms." % jj
)

Also, could you try setting export DP_INFER_BATCH_SIZE=1024?

If the CPU and the GPU give different results, it may be a bug from TensorFlow.

@njzjz
Copy link
Member

njzjz commented Jul 9, 2024

Also, could you try setting export DP_INFER_BATCH_SIZE=1024?

I know TensorFlow has some issues when using a large batch of data, as indicated in #3822. (Although I never encounter this issue)

@robinzyb
Copy link
Contributor Author

I print the dt:

        min_nbor_dist = 100.0
        max_nbor_size = np.zeros(1 if self.mixed_type else self.ntypes, dtype=int)

        for mn, dt, jj in self.iterator(data):
            if np.isinf(dt):
                log.warning(
                    "Atoms with no neighbors found in %s. Please make sure it's what you expected."
                    % jj
                )
            if dt < min_nbor_dist:
                print("dt inside the condition", dt)
                if math.isclose(dt, 0.0, rel_tol=1e-6):
                    # it's unexpected that the distance between two atoms is zero
                    # zero distance will cause nan (#874)
                    raise RuntimeError(
                        "Some atoms are overlapping in %s. Please check your"
                        " training data to remove duplicated atoms." % jj
                    )
                min_nbor_dist = dt
            max_nbor_size = np.maximum(mn, max_nbor_size)

        # do sqrt in the final
        min_nbor_dist = math.sqrt(min_nbor_dist)
        log.info("training data with min nbor dist: " + str(min_nbor_dist))
        log.info("training data with max nbor size: " + str(max_nbor_size))
        return min_nbor_dist, max_nbor_size

Summary

CPU: work
GPU: not work
GPU with export DP_INFER_BATCH_SIZE=1024: work

Details

for CPU

2024-07-10 09:44:48.686184: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-10 09:44:48.686217: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-10 09:44:48.686730: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-10 09:44:50.401311: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-07-10 09:44:50.401360: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:129] retrieving CUDA diagnostic information for host: nid006786
2024-07-10 09:44:50.401366: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:136] hostname: nid006786
2024-07-10 09:44:50.401628: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:159] libcuda reported version is: 550.54.15
2024-07-10 09:44:50.401651: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:163] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for aarch64  535.129.03  Release Build  (dvs-builder@U16-I3-D03-3-2)  Thu Oct 19 19:04:08 UTC 2023
GCC version:  gcc version 7.5.0 (SUSE Linux) 
"
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/deepmd/utils/batch_size.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:deepmd_utils.utils.batch_size:You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
2024-07-10 09:44:50.501442: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
dt inside the condition 0.7563057780999999
DEEPMD INFO    training data with min nbor dist: 0.8696584261076299
DEEPMD INFO    training data with max nbor size: [13 68 54 14]
DEEPMD INFO    min_nbor_dist: 0.869658
DEEPMD INFO    max_nbor_size: [13 68 54 14]

for GPU

2024-07-10 09:46:11.127051: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-10 09:46:11.127088: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-10 09:46:11.127604: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-10 09:46:13.216417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94763 MB memory:  -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-10 09:46:13.239287: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-10 09:46:13.292631: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaabcdf9f20
dt inside the condition 0.0
Traceback (most recent call last):
  File "/usr/local/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/deepmd_utils/main.py", line 657, in main
    deepmd_main(args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/main.py", line 90, in main
    neighbor_stat(**dict_args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/neighbor_stat.py", line 61, in neighbor_stat
    min_nbor_dist, max_nbor_size = nei.get_stat(data)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/utils/neighbor_stat.py", line 221, in get_stat
    raise RuntimeError(
RuntimeError: Some atoms are overlapping in data_set_cll_v1/set.000. Please check your training data to remove duplicated atoms.

GPU with export DP_INFER_BATCH_SIZE=1024

2024-07-10 09:47:03.995265: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-10 09:47:03.995298: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-10 09:47:03.995794: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-07-10 09:47:06.056207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94763 MB memory:  -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-10 09:47:06.079145: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-10 09:47:06.131845: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaaf8995c20
dt inside the condition 0.7563057780999999
DEEPMD INFO    training data with min nbor dist: 0.8696584261076299
DEEPMD INFO    training data with max nbor size: [13 68 54 14]
DEEPMD INFO    min_nbor_dist: 0.869658
DEEPMD INFO    max_nbor_size: [13 68 54 14]

@njzjz
Copy link
Member

njzjz commented Jul 10, 2024

I don't know the root reason but evidence shows that the unexpected behavior comes from the upstream.

@robinzyb
Copy link
Contributor Author

for the time being, I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants