-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Not converge in training on se_atten_v2 descriptor #3103
Comments
I ran the task in Bohrium. Mirror address is registry.dp.tech/dptech/deepmd-kit:2.2.7-cuda11.6 |
please try |
This is the actual problem, which I can reproduce. I also find an interesting thing. When I run it multiple times, it randomly shows First run:
Second run:
Update: Some things I found:
|
When debuging deepmodeling#3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in deepmodeling#2895 may be also caused by it. This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external. We still need to figure out the reason for the error of `_norm_copy_coord_gpu` behind the segfault. Signed-off-by: Jinzhe Zeng <[email protected]>
When debuging #3103, I notice a segfault in `~Region`, preventing the actual error message thrown. `_norm_copy_coord_gpu` replaces `boxt` and `rec_boxt` of `Region` with GPU pointers, runs `deepmd::copy_coord_gpu`, and finnally recover the original pointer that can be deleted. However, `deepmd::copy_coord_gpu` might throw CUDA errors for any reason, and the pointers are not recovered. `~Region` tries to delete a pointer that it doesn't own, causing the segfault. The CUDA error message is not visible due to segfault. The segfault in #2895 may be also caused by it. This PR adds a new constructor to `Region` to accept the external pointers. `~Region` will delete `boxt` and`rec_boxt` only when the pointer is not external. We still need to figure out the reason for the error of `copy_coord_gpu` behind the segfault. --------- Signed-off-by: Jinzhe Zeng <[email protected]>
Bug summary
When I use "se_atten_v2" descriptor to train a model (I have successfully trained it by using se_e2_a), I found that training step will not converge randomly (only in 00.train/002). For the same input file (all parameters are same including seeds), I run the task for 5 times, it converge for only 2 times. The output lcurve.out is shown in the figure. In Figure 1 (not convergent), the value of rmse_f_trn oscillates around 1.0 (too high), while in the Figure 2 (convergent), the value of rmse_f_trn reduces rapidly to below 0.1(same as using se_e2_a). Are there any bugs or problems with my input file?
DeePMD-kit Version
deepmd-kit-2.2.7
TensorFlow Version
/opt/deepmd-kit-2.2.7/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107
How did you download the software?
Others (write below)
Input Files, Running Commands, Error Log, etc.
It is my input file in 00.train/002
{
"model": {
"type_map": [
"C",
"H",
"O",
"P",
"F",
"Na"
],
"descriptor": {
"type": "se_atten_v2",
"sel": "auto",
"rcut_smth": 0.5,
"rcut": 6.0,
"neuron": [
25,
50,
100
],
"resnet_dt": false,
"axis_neuron": 16,
"seed": 347445250,
"attn_layer": 0,
"attn_mask": false,
"attn_dotr": true,
"type_one_side": false,
"precision": "default",
"trainable": true,
"exclude_types": [],
"set_davg_zero": true,
"attn": 128,
"activation_function": "gelu"
},
"type_embedding": {
"neuron": [
8
],
"resnet_dt": false,
"seed": 1374335067,
"activation_function": "tanh",
"precision": "default",
"trainable": true
},
"fitting_net": {
"neuron": [
240,
240,
240
],
"resnet_dt": true,
"seed": 587928124,
"activation_function": "tanh"
}
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"decay_steps": 2000,
"decay_rate": 0.95
},
"loss": {
"start_pref_e": 0.02,
"limit_pref_e": 1,
"start_pref_f": 1000,
"limit_pref_f": 1,
"_start_pref_v": 0,
"_limit_pref_v": 1
},
"training": {
"set_prefix": "set",
"stop_batch": 400000,
"batch_size": [
1,
1,
1,
1
],
"seed": 1867518886,
"disp_file": "lcurve.out",
"disp_freq": 2000,
"save_freq": 2000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all",
"systems": [
"../data.init/init/deepmd/0.5M",
"../data.init/init/deepmd/1.0M",
"../data.init/init/deepmd/1.5M",
"../data.init/init/deepmd/2.0M"
]
}
}
Steps to Reproduce
The file attached is my work file by DP-GEN, Enter the directory 00.train/002 and enter the command "dp train input.json" will run the task. Only 10000-50000 steps could tell whether training can converge. Non-convergence could be obversed after running for several times.
Further Information, Files, and Links
302Na_atten.zip
The text was updated successfully, but these errors were encountered: