Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: can not compress model when attention layer is not 0." when using se_atten_v2 #3643

Closed
user-ting opened this issue Apr 4, 2024 · 4 comments · Fixed by #3727
Assignees
Labels

Comments

@user-ting
Copy link

Summary

I was training a model with "type": "se_atten_v2" in deepmd-kit 2.2.7, I believe that the documentation says that "Descriptors with se_e2_a, se_e3, se_e2_r and se_atten_v2 types are supported by the model compression feature.", however, i still got the error when compressing the model: "RuntimeError: can not compress model when attention layer is not 0."
Is there anying wrong with my training settings, thank you

DeePMD-kit Version

DeePMD-kit v2.2.7

TensorFlow Version

2.9.0

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

python 3.10.13; gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

Details

  1. I train the model in a dpgen workflow.
    The para.json was:
    {
    "type_map": [
    "C"
    ],
    "mass_map": [
    12
    ],
    "init_data_prefix": "/home1/zhangzt/mlPotential_DP/dpgen_test_attention/init/",
    "init_data_sys": [
    "density_varying_init/1.0_3000K/deepmd_data",
    "density_varying_init/1.5_3000K/deepmd_data",
    "density_varying_init/2.0_3000K/deepmd_data",
    "density_varying_init/2.4_3000K/deepmd_data",
    "density_varying_init/2.8_3000K/deepmd_data",
    "density_varying_init/3.0_4000K/deepmd_data",
    "density_varying_init/3.0_5000K/deepmd_data",
    "density_varying_init/3.2_3000K/deepmd_data",
    "density_varying_init/3.4_4000K/deepmd_data",
    "density_varying_init/3.4_5000K/deepmd_data",
    "density_varying_init/3.6_3000K/deepmd_data",
    "density_varying_init/4.0_5000K/deepmd_data",
    "density_varying_init/4.4_5000K/deepmd_data"
    ],
    "init_batch_size": [10,10,10,10,10,10,10,10,10,10,10,10,10],
    "sys_configs": [
    ["../mdconfigs/dense_1.0/00*/POSCAR"],
    ["../mdconfigs/dense_1.5/00*/POSCAR"],
    ["../mdconfigs/dense_2.0/00*/POSCAR"],
    ["../mdconfigs/dense_2.4/00*/POSCAR"],
    ["../mdconfigs/dense_2.8/00*/POSCAR"],
    ["../mdconfigs/dense_3.0/00*/POSCAR"],
    ["../mdconfigs/dense_3.2/00*/POSCAR"],
    ["../mdconfigs/dense_3.4/00*/POSCAR"],
    ["../mdconfigs/dense_3.6/00*/POSCAR"],
    ["../mdconfigs/dense_4.0/00*/POSCAR"],
    ["../mdconfigs/dense_4.4/00*/POSCAR"]
    ],
    "sys_batch_size": [8,8,8,8,8,8,8,8,8,8,8],
    "numb_models": 4,
    "default_training_param": {
    "model": {
    "_comment": " model parameters",
    "type_map": [
    "C"
    ],
    "descriptor": {
    "type": "se_atten_v2",
    "sel": [
    80
    ],
    "rcut": 4.0,
    "rcut_smth": 3.5,
    "neuron": [
    10,
    20,
    40
    ],
    "resnet_dt": false,
    "seed": 17
    },
    "fitting_net": {
    "type": "ener",
    "neuron": [
    120,
    120,
    120
    ],
    "resnet_dt": false,
    "seed": 9
    }
    },
    "learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-08,
    "decay_steps": 500
    },
    "loss": {
    "type": "ener",
    "start_pref_e": 0.02,
    "limit_pref_e": 2.0,
    "start_pref_f": 1000,
    "limit_pref_f": 1.0,
    "start_pref_v": 0,
    "limit_pref_v": 0
    },
    "training": {
    "numb_steps": 10000,
    "seed": 11,
    "disp_file": "lcurve.out",
    "disp_freq": 1,
    "save_freq": 40,
    "save_ckpt": "model.ckpt",
    "disp_training": true,
    "time_training": true,
    "profiling": true,
    "profiling_file": "timeline.json"

    }
    },
    "dp_compress": true,
    "fp_task_max": 100,
    "fp_task_min": 10,
    "model_devi_engine": "lammps",
    "model_devi_jobs": [
    {
    "sys_idx": [0,3,
    4,5,8,
    9,10],
    "temps": [300,1000],
    "trj_freq": 10,
    "nsteps": 2000,
    "ensemble": "nvt"
    },
    {
    "sys_idx": [1,2,
    4,6,7,
    9,10],
    "temps": [300,2000],
    "trj_freq": 10,
    "nsteps": 2000,
    "ensemble": "nvt"
    }
    ],
    "model_devi_dt": 0.002,
    "model_devi_skip": 0,
    "model_devi_f_trust_lo": 0.01,
    "model_devi_f_trust_hi": 1,
    "model_devi_clean_traj": false,
    "shuffle_poscar": false,
    "fp_style": "vasp",
    "fp_pp_path": "../ff",
    "fp_pp_files": [
    "POTCAR"
    ],
    "fp_incar": "../incar/INCAR"
    }

  2. And the input.json in the work path was:
    {
    "model": {
    "type_map": [
    "C"
    ],
    "descriptor": {
    "type": "se_atten_v2",
    "sel": 80,
    "rcut": 4.0,
    "rcut_smth": 3.5,
    "neuron": [
    10,
    20,
    40
    ],
    "resnet_dt": false,
    "seed": 1509306854,
    "axis_neuron": 4,
    "activation_function": "tanh",
    "type_one_side": false,
    "precision": "default",
    "trainable": true,
    "exclude_types": [],
    "attn": 128,
    "attn_layer": 2,
    "attn_dotr": true,
    "attn_mask": false,
    "set_davg_zero": false
    },
    "fitting_net": {
    "type": "ener",
    "neuron": [
    120,
    120,
    120
    ],
    "resnet_dt": false,
    "seed": 3358502526,
    "numb_fparam": 0,
    "numb_aparam": 0,
    "activation_function": "tanh",
    "precision": "default",
    "trainable": true,
    "rcond": null,
    "atom_ener": [],
    "use_aparam_as_mask": false
    },
    "data_stat_nbatch": 10,
    "data_stat_protect": 0.01,
    "data_bias_nsample": 10,
    "srtab_add_bias": true,
    "type": "standard",
    "compress": {
    "model_file": "frozen_model.pb",
    "min_nbor_dist": 1.0073871479227832,
    "table_config": [
    5,
    0.01,
    0.1,
    -1
    ],
    "type": "se_e2_a"
    }
    },
    "learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-08,
    "decay_steps": 500,
    "scale_by_worker": "linear"
    },
    "loss": {
    "type": "ener",
    "start_pref_e": 0.02,
    "limit_pref_e": 2.0,
    "start_pref_f": 1000,
    "limit_pref_f": 1.0,
    "start_pref_v": 0,
    "limit_pref_v": 0,
    "start_pref_ae": 0.0,
    "limit_pref_ae": 0.0,
    "start_pref_pf": 0.0,
    "limit_pref_pf": 0.0,
    "enable_atom_ener_coeff": false,
    "start_pref_gf": 0.0,
    "limit_pref_gf": 0.0,
    "numb_generalized_coord": 0
    },
    "training": {
    "numb_steps": 10000,
    "seed": 992369420,
    "disp_file": "lcurve.out",
    "disp_freq": 1,
    "save_freq": 40,
    "save_ckpt": "model-compression/model.ckpt",
    "disp_training": true,
    "time_training": true,
    "profiling": true,
    "profiling_file": "timeline.json",
    "training_data": {
    "systems": [
    "../data.init/density_varying_init/1.0_3000K/deepmd_data",
    "../data.init/density_varying_init/1.5_3000K/deepmd_data",
    "../data.init/density_varying_init/2.0_3000K/deepmd_data",
    "../data.init/density_varying_init/2.4_3000K/deepmd_data",
    "../data.init/density_varying_init/2.8_3000K/deepmd_data",
    "../data.init/density_varying_init/3.0_4000K/deepmd_data",
    "../data.init/density_varying_init/3.0_5000K/deepmd_data",
    "../data.init/density_varying_init/3.2_3000K/deepmd_data",
    "../data.init/density_varying_init/3.4_4000K/deepmd_data",
    "../data.init/density_varying_init/3.4_5000K/deepmd_data",
    "../data.init/density_varying_init/3.6_3000K/deepmd_data",
    "../data.init/density_varying_init/4.0_5000K/deepmd_data",
    "../data.init/density_varying_init/4.4_5000K/deepmd_data"
    ],
    "batch_size": [
    10,
    10,
    10,
    10,
    10,
    10,
    10,
    10,
    10,
    10,
    10,
    10,
    10
    ],
    "set_prefix": "set",
    "auto_prob": "prob_sys_size",
    "sys_probs": null
    },
    "validation_data": null,
    "enable_profiler": false,
    "tensorboard": false,
    "tensorboard_log_dir": "log",
    "tensorboard_freq": 1
    }
    }

  3. However, i got the error message in the train.log when compressing the model:

DEEPMD INFO saved checkpoint model.ckpt
DEEPMD INFO average training time: 4.8215 s/batch (exclude first 1 batches)
DEEPMD INFO finished training
DEEPMD INFO wall time: 65748.233 s
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD WARNING The following nodes are not in the graph: {'fitting_attr/aparam_nall', 'spin_attr/ntypes_spin'}. Skip freezeing these nodes. You may be freezing a checkpoint generated by an old version.
DEEPMD INFO The following nodes will be frozen: ['o_energy', 'model_attr/model_version', 'fitting_attr/dfparam', 'model_attr/tmap', 'o_virial', 'train_attr/min_nbor_dist', 'o_atom_energy', 'descrpt_attr/rcut', 'model_attr/model_type', 't_mesh', 'o_force', 'o_atom_virial', 'model_type', 'fitting_attr/daparam', 'train_attr/training_script', 'descrpt_attr/ntypes']
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py:370: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.convert_variables_to_constants
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py:370: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.convert_variables_to_constants
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/convert_to_constants.py:925: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/convert_to_constants.py:925: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
DEEPMD INFO 1640 ops in the final graph.
WARNING:tensorflow:From /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD INFO

DEEPMD INFO stage 1: compress the model
DEEPMD WARNING Switch to serial execution due to lack of horovod module.
DEEPMD INFO _____ _____ __ __ _____ _ _ _
DEEPMD INFO | __ \ | __ \ | / || __ \ | | ()| |
DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |

DEEPMD INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
DEEPMD INFO | || || /| /| | | | | || || | | < | || |
DEEPMD INFO |
/ _| _||| || |_||____/ ||_|| __|
DEEPMD INFO Please read and cite:
DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD INFO installed to: /home1/zhangzt/anaconda3/envs/deepmd
DEEPMD INFO source : v2.2.7
DEEPMD INFO source brach: HEAD
DEEPMD INFO source commit: 839f4fe
DEEPMD INFO source commit at: 2023-10-27 21:10:24 +0800
DEEPMD INFO build float prec: double
DEEPMD INFO build variant: cpu
DEEPMD INFO build with tf inc: /home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/include;/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/../../../../include
DEEPMD INFO build with tf lib:
DEEPMD INFO ---Summary of the training---------------------------------------
DEEPMD INFO running on: n0125
DEEPMD INFO computing device: cpu:0
DEEPMD INFO Count of visible GPU: 0
DEEPMD INFO num_intra_threads: 0
DEEPMD INFO num_inter_threads: 0
DEEPMD INFO -----------------------------------------------------------------
DEEPMD INFO training without frame parameter
Traceback (most recent call last):
File "/home1/zhangzt/anaconda3/envs/deepmd/bin/dp", line 10, in
sys.exit(main())
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
deepmd_main(args)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 82, in main
compress(**dict_args)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/compress.py", line 150, in compress
train(
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
_do_work(jdata, run_opt, is_compress)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 280, in _do_work
model.build(train_data, stop_batch, origin_type_map=origin_type_map)
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 298, in build
self.model.enable_compression()
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/model/model.py", line 618, in enable_compression
self.descrpt.enable_compression(
File "/home1/zhangzt/anaconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/descriptor/se_atten.py", line 391, in enable_compression
raise RuntimeError("can not compress model when attention layer is not 0.")
RuntimeError: can not compress model when attention layer is not 0.

@njzjz njzjz added Docs and removed wontfix labels Apr 4, 2024
@njzjz
Copy link
Member

njzjz commented Apr 4, 2024

The error message is expected. The documentation needs to be clarified. (cc @nahso: please improve the documentation)

@user-ting
Copy link
Author

user-ting commented Apr 5, 2024

Thank you very much! So now i have to set "attn_layer": 0 in param.json to compress the model. But does it mean that the attention mechanism is not used at all in this case? Is the performance of this model similar to the model with "type": "se_e2_a"?

@user-ting
Copy link
Author

Also, there are always errors like this:
2024-04-06 00:00:55,781 - INFO : job: fc025e2356b34117afc0f7ae0925bf77d734dd7a 25659 terminated; fail_cout is 1; resubmitting job
2024-04-06 00:01:03,593 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a re-submit after terminated; new job_id is 26299
2024-04-06 00:01:05,772 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a job_id:26299 after re-submitting; the state now is <JobStatus.terminated: 4>
2024-04-06 00:01:05,772 - INFO : job: fc025e2356b34117afc0f7ae0925bf77d734dd7a 26299 terminated; fail_cout is 2; resubmitting job
2024-04-06 00:01:13,796 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a re-submit after terminated; new job_id is 26300
2024-04-06 00:01:15,950 - INFO : job:fc025e2356b34117afc0f7ae0925bf77d734dd7a job_id:26300 after re-submitting; the state now is <JobStatus.terminated: 4>
2024-04-06 00:01:15,951 - INFO : job: fc025e2356b34117afc0f7ae0925bf77d734dd7a 26300 terminated; fail_cout is 3; resubmitting job.
When a job terminated for 3 times, it won't re-submit again and thus the dpgen workflow also terminated. It requires resubmitting the dpgen run command again and again, if fortunately it won't terminate, which is exhuasting. Is there any solution for this problem?

@njzjz
Copy link
Member

njzjz commented Apr 10, 2024

But does it mean that the attention mechanism is not used at all in this case? Is the performance of this model similar to the model with "type": "se_e2_a"?

See #3603

@njzjz njzjz linked a pull request Apr 30, 2024 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue May 5, 2024
#3643

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **Documentation**
- Simplified the description for the number of attention layers in the
code documentation.
- Added a notice about model compression compatibility for `se_atten_v2`
descriptor in the documentation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>
@njzjz njzjz closed this as completed May 5, 2024
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jul 2, 2024
deepmodeling#3643

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

- **Documentation**
- Simplified the description for the number of attention layers in the
code documentation.
- Added a notice about model compression compatibility for `se_atten_v2`
descriptor in the documentation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>
(cherry picked from commit 62832e8)
Signed-off-by: Jinzhe Zeng <[email protected]>
njzjz added a commit that referenced this issue Jul 3, 2024
#3643

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

- **Documentation**
- Simplified the description for the number of attention layers in the
code documentation.
- Added a notice about model compression compatibility for `se_atten_v2`
descriptor in the documentation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>
(cherry picked from commit 62832e8)
Signed-off-by: Jinzhe Zeng <[email protected]>
mtaillefumier pushed a commit to mtaillefumier/deepmd-kit that referenced this issue Sep 18, 2024
deepmodeling#3643

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **Documentation**
- Simplified the description for the number of attention layers in the
code documentation.
- Added a notice about model compression compatibility for `se_atten_v2`
descriptor in the documentation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants