[BUG] Not converge in training on se_atten_v2 descriptor #3103

h840473807 · 2024-01-03T07:56:31Z

Bug summary

When I use "se_atten_v2" descriptor to train a model (I have successfully trained it by using se_e2_a), I found that training step will not converge randomly (only in 00.train/002). For the same input file (all parameters are same including seeds), I run the task for 5 times, it converge for only 2 times. The output lcurve.out is shown in the figure. In Figure 1 (not convergent), the value of rmse_f_trn oscillates around 1.0 (too high), while in the Figure 2 (convergent), the value of rmse_f_trn reduces rapidly to below 0.1（same as using se_e2_a）. Are there any bugs or problems with my input file?

DeePMD-kit Version

deepmd-kit-2.2.7

TensorFlow Version

/opt/deepmd-kit-2.2.7/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

It is my input file in 00.train/002
{
"model": {
"type_map": [
"C",
"H",
"O",
"P",
"F",
"Na"
],
"descriptor": {
"type": "se_atten_v2",
"sel": "auto",
"rcut_smth": 0.5,
"rcut": 6.0,
"neuron": [
25,
50,
100
],
"resnet_dt": false,
"axis_neuron": 16,
"seed": 347445250,
"attn_layer": 0,
"attn_mask": false,
"attn_dotr": true,
"type_one_side": false,
"precision": "default",
"trainable": true,
"exclude_types": [],
"set_davg_zero": true,
"attn": 128,
"activation_function": "gelu"
},
"type_embedding": {
"neuron": [
8
],
"resnet_dt": false,
"seed": 1374335067,
"activation_function": "tanh",
"precision": "default",
"trainable": true
},
"fitting_net": {
"neuron": [
240,
240,
240
],
"resnet_dt": true,
"seed": 587928124,
"activation_function": "tanh"
}
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"decay_steps": 2000,
"decay_rate": 0.95
},
"loss": {
"start_pref_e": 0.02,
"limit_pref_e": 1,
"start_pref_f": 1000,
"limit_pref_f": 1,
"_start_pref_v": 0,
"_limit_pref_v": 1
},
"training": {
"set_prefix": "set",
"stop_batch": 400000,
"batch_size": [
1,
1,
1,
1
],
"seed": 1867518886,
"disp_file": "lcurve.out",
"disp_freq": 2000,
"save_freq": 2000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all",
"systems": [
"../data.init/init/deepmd/0.5M",
"../data.init/init/deepmd/1.0M",
"../data.init/init/deepmd/1.5M",
"../data.init/init/deepmd/2.0M"
]
}
}

Steps to Reproduce

The file attached is my work file by DP-GEN, Enter the directory 00.train/002 and enter the command "dp train input.json" will run the task. Only 10000-50000 steps could tell whether training can converge. Non-convergence could be obversed after running for several times.

Further Information, Files, and Links

302Na_atten.zip

h840473807 · 2024-01-03T08:00:17Z

I ran the task in Bohrium. Mirror address is registry.dp.tech/dptech/deepmd-kit:2.2.7-cuda11.6

wanghan-iapcm · 2024-01-04T04:51:24Z

please try "set_davg_zero": fase

njzjz · 2024-01-04T08:36:08Z

For the same input file (all parameters are same including seeds)

This is the actual problem, which I can reproduce. I also find an interesting thing. When I run it multiple times, it randomly shows nan in lcurve.out, which needs to be investigated.

First run:

     91      3.25e+01      1.52e-02      1.03e+00    1.0e-03
     92           nan           nan           nan    1.0e-03
     93      3.60e+01      6.42e-02      1.14e+00    1.0e-03

Second run:

     91      3.25e+01      1.52e-02      1.03e+00    1.0e-03
     92      3.66e+01      5.83e-02      1.16e+00    1.0e-03
     93      3.60e+01      6.42e-02      1.14e+00    1.0e-03

Update: Some things I found:

The CPU version does not have such an issue;
The NN parameters may become nan, but cannot be stably reproduced.
Only training energy still has this issue.

When debuging deepmodeling#3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in deepmodeling#2895 may be also caused by it. This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external. We still need to figure out the reason for the error of `_norm_copy_coord_gpu` behind the segfault. Signed-off-by: Jinzhe Zeng <[email protected]>

When debuging #3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in #2895 may be also caused by it. This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external. We still need to figure out the reason for the error of `copy_coord_gpu` behind the segfault. --------- Signed-off-by: Jinzhe Zeng <[email protected]>

h840473807 added the bug label Jan 3, 2024

njzjz added the reproduced This bug has been reproduced by developers label Jan 4, 2024

njzjz added this to Bugfixes for DeePMD-kit Jan 4, 2024

github-project-automation bot moved this to Todo in Bugfixes for DeePMD-kit Jan 4, 2024

njzjz mentioned this issue Jan 5, 2024

fix segfault in ~Region #3108

Merged

njzjz added the help wanted label Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Not converge in training on se_atten_v2 descriptor #3103

[BUG] Not converge in training on se_atten_v2 descriptor #3103

h840473807 commented Jan 3, 2024

h840473807 commented Jan 3, 2024

wanghan-iapcm commented Jan 4, 2024

njzjz commented Jan 4, 2024 •

edited

Loading

[BUG] Not converge in training on se_atten_v2 descriptor #3103

[BUG] Not converge in training on se_atten_v2 descriptor #3103

Comments

h840473807 commented Jan 3, 2024

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

h840473807 commented Jan 3, 2024

wanghan-iapcm commented Jan 4, 2024

njzjz commented Jan 4, 2024 • edited Loading

njzjz commented Jan 4, 2024 •

edited

Loading