Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Not converge in training on se_atten_v2 descriptor #3103

Open
h840473807 opened this issue Jan 3, 2024 · 3 comments
Open

[BUG] Not converge in training on se_atten_v2 descriptor #3103

h840473807 opened this issue Jan 3, 2024 · 3 comments
Labels
bug help wanted reproduced This bug has been reproduced by developers

Comments

@h840473807
Copy link

Bug summary

When I use "se_atten_v2" descriptor to train a model (I have successfully trained it by using se_e2_a), I found that training step will not converge randomly (only in 00.train/002). For the same input file (all parameters are same including seeds), I run the task for 5 times, it converge for only 2 times. The output lcurve.out is shown in the figure. In Figure 1 (not convergent), the value of rmse_f_trn oscillates around 1.0 (too high), while in the Figure 2 (convergent), the value of rmse_f_trn reduces rapidly to below 0.1(same as using se_e2_a). Are there any bugs or problems with my input file?
图片

DeePMD-kit Version

deepmd-kit-2.2.7

TensorFlow Version

/opt/deepmd-kit-2.2.7/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

It is my input file in 00.train/002
{
"model": {
"type_map": [
"C",
"H",
"O",
"P",
"F",
"Na"
],
"descriptor": {
"type": "se_atten_v2",
"sel": "auto",
"rcut_smth": 0.5,
"rcut": 6.0,
"neuron": [
25,
50,
100
],
"resnet_dt": false,
"axis_neuron": 16,
"seed": 347445250,
"attn_layer": 0,
"attn_mask": false,
"attn_dotr": true,
"type_one_side": false,
"precision": "default",
"trainable": true,
"exclude_types": [],
"set_davg_zero": true,
"attn": 128,
"activation_function": "gelu"
},
"type_embedding": {
"neuron": [
8
],
"resnet_dt": false,
"seed": 1374335067,
"activation_function": "tanh",
"precision": "default",
"trainable": true
},
"fitting_net": {
"neuron": [
240,
240,
240
],
"resnet_dt": true,
"seed": 587928124,
"activation_function": "tanh"
}
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"decay_steps": 2000,
"decay_rate": 0.95
},
"loss": {
"start_pref_e": 0.02,
"limit_pref_e": 1,
"start_pref_f": 1000,
"limit_pref_f": 1,
"_start_pref_v": 0,
"_limit_pref_v": 1
},
"training": {
"set_prefix": "set",
"stop_batch": 400000,
"batch_size": [
1,
1,
1,
1
],
"seed": 1867518886,
"disp_file": "lcurve.out",
"disp_freq": 2000,
"save_freq": 2000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all",
"systems": [
"../data.init/init/deepmd/0.5M",
"../data.init/init/deepmd/1.0M",
"../data.init/init/deepmd/1.5M",
"../data.init/init/deepmd/2.0M"
]
}
}

Steps to Reproduce

The file attached is my work file by DP-GEN, Enter the directory 00.train/002 and enter the command "dp train input.json" will run the task. Only 10000-50000 steps could tell whether training can converge. Non-convergence could be obversed after running for several times.

Further Information, Files, and Links

302Na_atten.zip

@h840473807 h840473807 added the bug label Jan 3, 2024
@h840473807
Copy link
Author

I ran the task in Bohrium. Mirror address is registry.dp.tech/dptech/deepmd-kit:2.2.7-cuda11.6

@wanghan-iapcm
Copy link
Collaborator

please try "set_davg_zero": fase

@njzjz njzjz added the reproduced This bug has been reproduced by developers label Jan 4, 2024
@njzjz
Copy link
Member

njzjz commented Jan 4, 2024

For the same input file (all parameters are same including seeds)

This is the actual problem, which I can reproduce. I also find an interesting thing. When I run it multiple times, it randomly shows nan in lcurve.out, which needs to be investigated.

First run:

     91      3.25e+01      1.52e-02      1.03e+00    1.0e-03
     92           nan           nan           nan    1.0e-03
     93      3.60e+01      6.42e-02      1.14e+00    1.0e-03

Second run:

     91      3.25e+01      1.52e-02      1.03e+00    1.0e-03
     92      3.66e+01      5.83e-02      1.16e+00    1.0e-03
     93      3.60e+01      6.42e-02      1.14e+00    1.0e-03

Update: Some things I found:

  1. The CPU version does not have such an issue;
  2. The NN parameters may become nan, but cannot be stably reproduced.
  3. Only training energy still has this issue.

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jan 5, 2024
When debuging deepmodeling#3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in deepmodeling#2895 may be also caused by it.
This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external.
We still need to figure out the reason for the error of `_norm_copy_coord_gpu` behind the segfault.

Signed-off-by: Jinzhe Zeng <[email protected]>
wanghan-iapcm pushed a commit that referenced this issue Jan 5, 2024
When debuging #3103, I notice a segfault in `~Region`, preventing the
actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and
`rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`,
and finnally recover the original pointer that can be deleted. However,
`deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the
pointers are not recovered. `~Region` tries to delete a pointer that it
doesn't own, causing the segfault. The CUDA error message is not visible
due to segfault. The segfault in #2895 may be also caused by it.
This PR adds a new constructor to `Region` to accept the external
pointers. `~Region` will delete `boxt` and`rec_boxt` only when the
pointer is not external.
We still need to figure out the reason for the error of `copy_coord_gpu`
behind the segfault.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted reproduced This bug has been reproduced by developers
Projects
Development

No branches or pull requests

3 participants