Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于零初始化和扩展层的位置 #28

Open
ouyanxi1125 opened this issue Jun 7, 2024 · 4 comments
Open

关于零初始化和扩展层的位置 #28

ouyanxi1125 opened this issue Jun 7, 2024 · 4 comments

Comments

@ouyanxi1125
Copy link

  1. 关于扩展层的位置,源码上是等间距均匀分布,请问有什么理论或者实验依据吗?
  2. 关于零初始化,选的是 'down_proj' in k or 'o_proj' in k 这两个,即attention和MLP的最后一层,请问有啥理论和实验依据吗?
    感谢解答【拱手】
@hills-code
Copy link
Collaborator

  1. interleave进行扩展我们参考的是Automated Progressive Learning for Efficient Training of Vision Transformers的图(c),这边扩展的位置不一定是最佳的,yi-9b和solar也使用了不一样的扩展位置,值得探索
  2. 对于零初始化,之前有文章提出对于LN进行初始化,Staged
    training for transformer language models,我们发现在llama的设置下,LN设置为0会导致梯度为0无法训练,我们在论文中进行了分析,于是改为对down_proj和o_proj进行初始化
    image

@ouyanxi1125
Copy link
Author

超级感谢!关于第二点我看了论文,很理解,但是至于为啥是down_proj和o_proj,还是存在疑问,up_proj是不是也是可行的?

@hills-code
Copy link
Collaborator

我们参照了adapter的方式在输出的地方清零,在up的时候清零我没算过,不确定有没有梯度,你可以试一下

@JiwenJ
Copy link

JiwenJ commented Jul 26, 2024

你好,请问下down_proj, o_proj初始化为0,o_proj,down_proj有梯度吗。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants