"RuntimeError: can not compress model when attention layer is not 0." when using se_atten_v2 #3643

user-ting · 2024-04-04T03:14:45Z

Summary

I was training a model with "type": "se_atten_v2" in deepmd-kit 2.2.7, I believe that the documentation says that "Descriptors with se_e2_a, se_e3, se_e2_r and se_atten_v2 types are supported by the model compression feature.", however, i still got the error when compressing the model: "RuntimeError: can not compress model when attention layer is not 0."
Is there anying wrong with my training settings, thank you

DeePMD-kit Version

DeePMD-kit v2.2.7

TensorFlow Version

2.9.0

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

python 3.10.13; gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

Details

I train the model in a dpgen workflow.
The para.json was:
{
"type_map": [
"C"
],
"mass_map": [
12
],
"init_data_prefix": "/home1/zhangzt/mlPotential_DP/dpgen_test_attention/init/",
"init_data_sys": [
"density_varying_init/1.0_3000K/deepmd_data",
"density_varying_init/1.5_3000K/deepmd_data",
"density_varying_init/2.0_3000K/deepmd_data",
"density_varying_init/2.4_3000K/deepmd_data",
"density_varying_init/2.8_3000K/deepmd_data",
"density_varying_init/3.0_4000K/deepmd_data",
"density_varying_init/3.0_5000K/deepmd_data",
"density_varying_init/3.2_3000K/deepmd_data",
"density_varying_init/3.4_4000K/deepmd_data",
"density_varying_init/3.4_5000K/deepmd_data",
"density_varying_init/3.6_3000K/deepmd_data",
"density_varying_init/4.0_5000K/deepmd_data",
"density_varying_init/4.4_5000K/deepmd_data"
],
"init_batch_size": [10,10,10,10,10,10,10,10,10,10,10,10,10],
"sys_configs": [
["../mdconfigs/dense_1.0/00*/POSCAR"],
["../mdconfigs/dense_1.5/00*/POSCAR"],
["../mdconfigs/dense_2.0/00*/POSCAR"],
["../mdconfigs/dense_2.4/00*/POSCAR"],
["../mdconfigs/dense_2.8/00*/POSCAR"],
["../mdconfigs/dense_3.0/00*/POSCAR"],
["../mdconfigs/dense_3.2/00*/POSCAR"],
["../mdconfigs/dense_3.4/00*/POSCAR"],
["../mdconfigs/dense_3.6/00*/POSCAR"],
["../mdconfigs/dense_4.0/00*/POSCAR"],
["../mdconfigs/dense_4.4/00*/POSCAR"]
],
"sys_batch_size": [8,8,8,8,8,8,8,8,8,8,8],
"numb_models": 4,
"default_training_param": {
"model": {
"_comment": " model parameters",
"type_map": [
"C"
],
"descriptor": {
"type": "se_atten_v2",
"sel": [
80
],
"rcut": 4.0,
"rcut_smth": 3.5,
"neuron": [
10,
20,
40
],
"resnet_dt": false,
"seed": 17
},
"fitting_net": {
"type": "ener",
"neuron": [
120,
120,
120
],
"resnet_dt": false,
"seed": 9
}
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"stop_lr": 1e-08,
"decay_steps": 500
},
"loss": {
"type": "ener",
"start_pref_e": 0.02,
"limit_pref_e": 2.0,
"start_pref_f": 1000,
"limit_pref_f": 1.0,
"start_pref_v": 0,
"limit_pref_v": 0
},
"training": {
"numb_steps": 10000,
"seed": 11,
"disp_file": "lcurve.out",
"disp_freq": 1,
"save_freq": 40,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": true,
"profiling_file": "timeline.json"

}
},
"dp_compress": true,
"fp_task_max": 100,
"fp_task_min": 10,
"model_devi_engine": "lammps",
"model_devi_jobs": [
{
"sys_idx": [0,3,
4,5,8,
9,10],
"temps": [300,1000],
"trj_freq": 10,
"nsteps": 2000,
"ensemble": "nvt"
},
{
"sys_idx": [1,2,
4,6,7,
9,10],
"temps": [300,2000],
"trj_freq": 10,
"nsteps": 2000,
"ensemble": "nvt"
}
],
"model_devi_dt": 0.002,
"model_devi_skip": 0,
"model_devi_f_trust_lo": 0.01,
"model_devi_f_trust_hi": 1,
"model_devi_clean_traj": false,
"shuffle_poscar": false,
"fp_style": "vasp",
"fp_pp_path": "../ff",
"fp_pp_files": [
"POTCAR"
],
"fp_incar": "../incar/INCAR"
}
And the input.json in the work path was：
{
"model": {
"type_map": [
"C"
],
"descriptor": {
"type": "se_atten_v2",
"sel": 80,
"rcut": 4.0,
"rcut_smth": 3.5,
"neuron": [
10,
20,
40
],
"resnet_dt": false,
"seed": 1509306854,
"axis_neuron": 4,
"activation_function": "tanh",
"type_one_side": false,
"precision": "default",
"trainable": true,
"exclude_types": [],
"attn": 128,
"attn_layer": 2,
"attn_dotr": true,
"attn_mask": false,
"set_davg_zero": false
},
"fitting_net": {
"type": "ener",
"neuron": [
120,
120,
120
],
"resnet_dt": false,
"seed": 3358502526,
"numb_fparam": 0,
"numb_aparam": 0,
"activation_function": "tanh",
"precision": "default",
"trainable": true,
"rcond": null,
"atom_ener": [],
"use_aparam_as_mask": false
},
"data_stat_nbatch": 10,
"data_stat_protect": 0.01,
"data_bias_nsample": 10,
"srtab_add_bias": true,
"type": "standard",
"compress": {
"model_file": "frozen_model.pb",
"min_nbor_dist": 1.0073871479227832,
"table_config": [
5,
0.01,
0.1,
-1
],
"type": "se_e2_a"
}
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"stop_lr": 1e-08,
"decay_steps": 500,
"scale_by_worker": "linear"
},
"loss": {
"type": "ener",
"start_pref_e": 0.02,
"limit_pref_e": 2.0,
"start_pref_f": 1000,
"limit_pref_f": 1.0,
"start_pref_v": 0,
"limit_pref_v": 0,
"start_pref_ae": 0.0,
"limit_pref_ae": 0.0,
"start_pref_pf": 0.0,
"limit_pref_pf": 0.0,
"enable_atom_ener_coeff": false,
"start_pref_gf": 0.0,
"limit_pref_gf": 0.0,
"numb_generalized_coord": 0
},
"training": {
"numb_steps": 10000,
"seed": 992369420,
"disp_file": "lcurve.out",
"disp_freq": 1,
"save_freq": 40,
"save_ckpt": "model-compression/model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": true,
"profiling_file": "timeline.json",
"training_data": {
"systems": [
"../data.init/density_varying_init/1.0_3000K/deepmd_data",
"../data.init/density_varying_init/1.5_3000K/deepmd_data",
"../data.init/density_varying_init/2.0_3000K/deepmd_data",
"../data.init/density_varying_init/2.4_3000K/deepmd_data",
"../data.init/density_varying_init/2.8_3000K/deepmd_data",
"../data.init/density_varying_init/3.0_4000K/deepmd_data",
"../data.init/density_varying_init/3.0_5000K/deepmd_data",
"../data.init/density_varying_init/3.2_3000K/deepmd_data",
"../data.init/density_varying_init/3.4_4000K/deepmd_data",
"../data.init/density_varying_init/3.4_5000K/deepmd_data",
"../data.init/density_varying_init/3.6_3000K/deepmd_data",
"../data.init/density_varying_init/4.0_5000K/deepmd_data",
"../data.init/density_varying_init/4.4_5000K/deepmd_data"
],
"batch_size": [
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10
],
"set_prefix": "set",
"auto_prob": "prob_sys_size",
"sys_probs": null
},
"validation_data": null,
"enable_profiler": false,
"tensorboard": false,
"tensorboard_log_dir": "log",
"tensorboard_freq": 1
}
}
However, i got the error message in the train.log when compressing the model:

DEEPMD INFO saved checkpoint model.ckpt
DEEPMD INFO average training time: 4.8215 s/batch (exclude first 1 batches)
DEEPMD INFO finished training
DEEPMD INFO wall time: 65748.233 s
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD WARNING The following nodes are not in the graph: {'fitting_attr/aparam_nall', 'spin_attr/ntypes_spin'}. Skip freezeing these nodes. You may be freezing a checkpoint generated by an old version.
DEEPMD INFO The following nodes will be frozen: ['o_energy', 'model_attr/model_version', 'fitting_attr/dfparam', 'model_attr/tmap', 'o_virial', 'train_attr/min_nbor_dist', 'o_atom_energy', 'descrpt_attr/rcut', 'model_attr/model_type', 't_mesh', 'o_force', 'o_atom_virial', 'model_type', 'fitting_attr/daparam', 'train_attr/training_script', 'descrpt_attr/ntypes']
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py:370: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.convert_variables_to_constants
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py:370: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.convert_variables_to_constants
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/convert_to_constants.py:925: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/convert_to_constants.py:925: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
DEEPMD INFO 1640 ops in the final graph.
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD INFO

DEEPMD INFO stage 1: compress the model
DEEPMD WARNING Switch to serial execution due to lack of horovod module.
DEEPMD INFO _____ _____ __ __ _____ _ _ _
DEEPMD INFO | __ \ | __ \ | / || __ \ | | ()| |
DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |
DEEPMD INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
DEEPMD INFO | || || /| /| | | | | || || | | < | || |
DEEPMD INFO |/ _| _||| || |_||____/ ||_|| __|
DEEPMD INFO Please read and cite:
DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD INFO installed to: /home1/zhangzt/anaconda3/envs/deepmd
DEEPMD INFO source : v2.2.7
DEEPMD INFO source brach: HEAD
DEEPMD INFO source commit: 839f4fe
DEEPMD INFO source commit at: 2023-10-27 21:10:24 +0800
DEEPMD INFO build float prec: double
DEEPMD INFO build variant: cpu
DEEPMD INFO build with tf inc: /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/include;/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/../../../../include
DEEPMD INFO build with tf lib:
DEEPMD INFO ---Summary of the training---------------------------------------
DEEPMD INFO running on: n0125
DEEPMD INFO computing device: cpu:0
DEEPMD INFO Count of visible GPU: 0
DEEPMD INFO num_intra_threads: 0
DEEPMD INFO num_inter_threads: 0
DEEPMD INFO -----------------------------------------------------------------
DEEPMD INFO training without frame parameter
Traceback (most recent call last):
File "/home1/zhangzt/anaconda3/envs/deepmd/bin/dp", line 10, in
sys.exit(main())
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
deepmd_main(args)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 82, in main
compress(**dict_args)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/compress.py", line 150, in compress
train(
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
_do_work(jdata, run_opt, is_compress)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 280, in _do_work
model.build(train_data, stop_batch, origin_type_map=origin_type_map)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 298, in build
self.model.enable_compression()
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/model/model.py", line 618, in enable_compression
self.descrpt.enable_compression(
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/descriptor/se_atten.py", line 391, in enable_compression
raise RuntimeError("can not compress model when attention layer is not 0.")
RuntimeError: can not compress model when attention layer is not 0.

The text was updated successfully, but these errors were encountered:

njzjz · 2024-04-04T20:10:12Z

The error message is expected. The documentation needs to be clarified. (cc @nahso: please improve the documentation)

user-ting · 2024-04-05T07:56:37Z

Thank you very much! So now i have to set "attn_layer": 0 in param.json to compress the model. But does it mean that the attention mechanism is not used at all in this case? Is the performance of this model similar to the model with "type": "se_e2_a"?

user-ting · 2024-04-05T16:19:15Z

Also, there are always errors like this:
2024-04-06 00:00:55,781 - INFO : job: fc025e2356b34117afc0f7ae0925bf77d734dd7a 25659 terminated; fail_cout is 1; resubmitting job
2024-04-06 00:01:03,593 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a re-submit after terminated; new job_id is 26299
2024-04-06 00:01:05,772 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a job_id:26299 after re-submitting; the state now is <JobStatus.terminated: 4>
2024-04-06 00:01:05,772 - INFO : job: fc025e2356b34117afc0f7ae0925bf77d734dd7a 26299 terminated; fail_cout is 2; resubmitting job
2024-04-06 00:01:13,796 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a re-submit after terminated; new job_id is 26300
2024-04-06 00:01:15,950 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a job_id:26300 after re-submitting; the state now is <JobStatus.terminated: 4>
2024-04-06 00:01:15,951 - INFO : job: fc025e2356b34117afc0f7ae0925bf77d734dd7a 26300 terminated; fail_cout is 3; resubmitting job.
When a job terminated for 3 times, it won't re-submit again and thus the dpgen workflow also terminated. It requires resubmitting the dpgen run command again and again, if fortunately it won't terminate, which is exhuasting. Is there any solution for this problem?

njzjz · 2024-04-10T06:46:12Z

But does it mean that the attention mechanism is not used at all in this case? Is the performance of this model similar to the model with "type": "se_e2_a"?

See #3603

#3643  ## Summary by CodeRabbit - **Documentation** - Simplified the description for the number of attention layers in the code documentation. - Added a notice about model compression compatibility for `se_atten_v2` descriptor in the documentation.  --------- Signed-off-by: Jinzhe Zeng <[email protected]> Co-authored-by: Jinzhe Zeng <[email protected]>

deepmodeling#3643  - **Documentation** - Simplified the description for the number of attention layers in the code documentation. - Added a notice about model compression compatibility for `se_atten_v2` descriptor in the documentation.  --------- Signed-off-by: Jinzhe Zeng <[email protected]> Co-authored-by: Jinzhe Zeng <[email protected]> (cherry picked from commit 62832e8) Signed-off-by: Jinzhe Zeng <[email protected]>

#3643  - **Documentation** - Simplified the description for the number of attention layers in the code documentation. - Added a notice about model compression compatibility for `se_atten_v2` descriptor in the documentation.  --------- Signed-off-by: Jinzhe Zeng <[email protected]> Co-authored-by: Jinzhe Zeng <[email protected]> (cherry picked from commit 62832e8) Signed-off-by: Jinzhe Zeng <[email protected]>

deepmodeling#3643  ## Summary by CodeRabbit - **Documentation** - Simplified the description for the number of attention layers in the code documentation. - Added a notice about model compression compatibility for `se_atten_v2` descriptor in the documentation.  --------- Signed-off-by: Jinzhe Zeng <[email protected]> Co-authored-by: Jinzhe Zeng <[email protected]>

user-ting added the wontfix label Apr 4, 2024

njzjz added Docs and removed wontfix labels Apr 4, 2024

njzjz assigned nahso Apr 4, 2024

nahso mentioned this issue Apr 30, 2024

Clarifiy se_atten_v2 compression doc #3727

Merged

njzjz linked a pull request Apr 30, 2024 that will close this issue

Clarifiy se_atten_v2 compression doc #3727

Merged

njzjz closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"RuntimeError: can not compress model when attention layer is not 0." when using se_atten_v2 #3643

"RuntimeError: can not compress model when attention layer is not 0." when using se_atten_v2 #3643

user-ting commented Apr 4, 2024

njzjz commented Apr 4, 2024

user-ting commented Apr 5, 2024 •

edited

Loading

user-ting commented Apr 5, 2024

njzjz commented Apr 10, 2024

"RuntimeError: can not compress model when attention layer is not 0." when using se_atten_v2 #3643

"RuntimeError: can not compress model when attention layer is not 0." when using se_atten_v2 #3643

Comments

user-ting commented Apr 4, 2024

Summary

DeePMD-kit Version

TensorFlow Version

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

Details

njzjz commented Apr 4, 2024

user-ting commented Apr 5, 2024 • edited Loading

user-ting commented Apr 5, 2024

njzjz commented Apr 10, 2024

user-ting commented Apr 5, 2024 •

edited

Loading