【Hackathon No.40】为 Paddle 新增 ASGD API #58834

WintersMontagne10335 · 2023-11-08T16:38:33Z

PR types

Others

PR changes

APIs

Description

为 Paddle 新增 ASGD API

Current progress

paddle-bot · 2023-11-10T11:43:13Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-ci-bot · 2023-11-16T03:12:39Z

Sorry to inform you that afcc16f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

paddle-ci-bot · 2023-12-03T03:11:06Z

Sorry to inform you that 1f3b0bd's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

GGBond8488 · 2023-12-20T08:43:57Z

paddle/phi/api/yaml/ops.yaml

@@ -191,6 +191,19 @@
  backward : as_strided_grad
  no_need_buffer : input

+- op : asgd_
+  args : (Tensor param, Tensor learning_rate, Tensor grad, Tensor d, Tensor y, Tensor n, Tensor master_param, bool multi_precision=false)
+  output : Tensor(param_out), Tensor(d_out), Tensor(y_out), Tensor(master_param_out)


入参顺序和其他的优化器保持一致吧，把grad 放在第二位

GGBond8488 · 2023-12-20T08:45:14Z

paddle/phi/api/yaml/ops.yaml

+    func : asgd
+    data_type : param
+  data_transform :
+    support_trans_dtype : learning_rate, n


这里为什么要特别指定一下

这里参照的是 Paddle SGD 的写法，但是 ops.yaml 中仅 SGD 有这个，实测删除不影响结果，推测是历史遗留问题。已删除~~

这里回复的不对，删除之后需要重新cmake一下，我之前没做，抱歉抱歉。
实测会报类型转换的错误，具体原因正在测试。
这里应该是不能删除的。

原因在下面哈

GGBond8488 · 2023-12-20T11:18:04Z

paddle/phi/kernels/gpu/asgd_kernel.cu

+                   phi::dtype::bfloat16,
+                   float,
+                   double) {
+  if (kernel_key.dtype() == phi::DataType::FLOAT16 ||


这个是不是已经在infermeta的时候做了，这里还有必要吗

这里参照的是 Paddle SGD 的写法，但是从逻辑和实际结果来看，确实是冗余的。已删除~~

GGBond8488 · 2023-12-20T11:19:38Z

python/paddle/optimizer/asgd.py

+        name=None,
+    ):
+        if learning_rate is None:
+            raise ValueError("learning_rate is not set")


这个报错信息有点歧义，如果没有主动设置，会有默认值，如果主动传None, 应该睡should not be none

GGBond8488 · 2023-12-21T12:34:38Z

python/paddle/optimizer/asgd.py

+        m = self._get_accumulator_master(self._m_acc_str, param_and_grad[0])
+
+        ys = self._get_accumulator_master(self._y_acc_str, param_and_grad[0])
+        y = paddle.assign(ys[paddle.mod(m, self._n_tensor).item()])


用mod 是不是不是随机的

这里参照的是 PyTorch SGD 的逻辑。
我在读 SGD 源码时，发现它并没有显式的采样步骤，后来我在这里找到了答案。
大致意思是说，优化器本身并不负责采样，“随机性”是由用户提供的数据决定。

这里是用assign吗，这个属于ys 的 y应该要被修改吧，这里assign出来了，ys里面对应的y会被改动吗

ys 里面对应的 y 应该而且会被改动的。
动态图下，可以通过 pdb 定位到这，查看相关信息。我试过的，会被正确的改动。
静态图下，做过静态图和动态图是否统一的单测，见这里。
这里我调了很久。正确性可以保证的。

GGBond8488 · 2023-12-21T12:34:56Z

python/paddle/optimizer/asgd.py

+    def __init__(
+        self,
+        learning_rate=0.001,
+        batch_num=1,


batch_num和batch_szie的区别在哪呢

batch 是从总样本切分出来的样本，batch_size 是每一个 batch 中的样本数，batch_num 是指在完成一个 epoch 要处理的batch 的个数。
假设有一个包含 200 个样本的数据集（总样本），而且选择的 batch_size 为 5，那么此时 batch_num 为 40。

如果是这个意思，那么一个epoch, ASGD这个优化器需要保存所有样本的数据？所以torch为什么没有实现这个版本的ASGD呢

是的，在一个 epoch , ASGD 需要保存所有样本的梯度信息，因为它需要这些信息来：

求平均用以更新权重参数；

每次迭代用“随机采样”的新样本的梯度信息替换旧样本的梯度信息。

这个有在 RFC 里详细描述过。

为什么 torch 不这样实现。在 RFC 中有分析过，无论从代码逻辑和已有的评论来看， torch 现版本都是有严重问题的。至于 torch 为什么要实现一个错误的优化器，我曾经从百度和这个网站一点点翻过相关信息，真的已经无从考证了。

补充一下， SGD 其实都是 BGD（torch 也一样，这个可以从那个网站中求证），一个 batch 只会保留一份梯度信息。eg：这里不需要保存 200 个样本的梯度信息，只需要保留 40 份梯度信息。

突然想到一个优化思路：batch_num 不强制要求等于 all / batch_size，在显存紧张的情况下，允许将 batch_num 设置的小一点。即我们还是要保留历史梯度，但是可以不保留那么多。eg：all / batch_size 为 1000 的情况下，如果 batch_num 设置为 1000 爆显存，那用户可以把 batch_num 设置为 100 或者 10 等。

那我感觉这个和batch_size的区别好像不大了，不如就直接是batch_size, 每次多保存一份batch_size的梯度数据作为历史梯度数据

这样不好吧。
无论入参是什么，在计算的时候 batch_num 是多少都是要知道的。
仅传入 batch_size 这一个参数是不能得到 batch_num 的，还要多传一个 all 的参数。

先这样吧，batch_num这个参数的含义需要详细说明在文档上

WintersMontagne10335 · 2023-12-22T02:46:04Z

python/paddle/optimizer/asgd.py

+        lr = self._create_param_lr(param_and_grad)
+
+        if in_dynamic_or_pir_mode():
+            _C_ops.asgd_(


回复一下 ops.yaml 中的“特别指定”的问题。
support_trans_dtype 提供了自动类型转换的功能，eg：此处的 lr dtype 为 float32，而 C++ learning_rate 参数 dtype 为 float16（底层 learning_rate 的 dtype 与 param 相统一），如果没有为 learning_rate 开启 support_trans_dtype，就会报类型不匹配的错误。

WintersMontagne10335 · 2023-12-22T02:47:32Z

@GGBond8488 老师有时间的时候可以再审核一下~

WintersMontagne10335 · 2023-12-22T03:46:21Z

老师能先审核一下吗？12点截止了。@GGBond8488

WintersMontagne10335 · 2024-01-15T05:59:16Z

@GGBond8488 不忙的时候喵一眼哈

GGBond8488 · 2024-01-15T06:28:26Z

python/paddle/optimizer/asgd.py

+                assign_value
+            )
+
+    def update_sign(self, value):


这个意义是什么？

sign 的作用是标记是否记录中间变量。因为记录中间标量需要额外的储存空间，所以默认为 false。如果需要测试中间变量，可以用这个函数更改 sign 的值为 true

保存的是那些中间变量？用来做什么的？如果只是测试，在测试脚本加个patch，把这个方法补丁的形式打上就行，实际不需要就不要加

好的，这个稍后会更正，您先空看一下另一个问题吧，会对这里有影响

中间变量的这些测试，开发的时候能确定是对的就行，真正写在测试里面的应该是测试param的更新是不是符合预期就行了，用完整的模型优化的流程，而不是直接调单个op

我修改了一下代码，您看下符不符合要求

GGBond8488 · 2024-01-15T06:30:53Z

python/paddle/optimizer/asgd.py

+                find_master,
+            )
+
+            self._assign_accumulator_master(


这个写法是把accumulator 里面的值先assign 出来，修改了再assign回去？前面的asgd op 不是设置了inplace吗，应该只要直接拿出来就行了吧

静态图不支持 inplace 操作。
我在测试过程中发现，有的时候动态图会正常更新，静态图确更新失败。
用 assign 动态图和静态图都可以正常更新。
这里有更好的写法嘛。

静态图是支持inplace 操作的，你只需要inputs 和 output 都传同样的variable就行

我改了一下，会报动静不统一的错误。

这里的 ys[index] 与 ys[index][0] 分别是动态图与静态图情况下的 y

没看明白这个动静不统一的错误是什么意思，你这个_C_ops.asgd_的调用方式都是动态图啊

您之前不是说 “只需要inputs 和 output 都传同样的variable” 就能实现 inplace 操作嘛，这里传的是 ys[index] ，动态图的情况下很正常，静态图的情况下，并没有更新，我暂时搞不清楚为什么

除此之外，m+1 需要调用 add，但是add_ 有装饰器 @inplace_apis_in_dygraph_only，我试着直接调 C_ops.add(x, y)，也不行

这里一定要 inplace 吗？

这里区分动态图和鸡静态图就行了啊
动态图add inplace 就是 m.add_(paddle.to_tensor([1]))
静态图inplace 就是
append_op(
{
type = "add",
inputs = {"x":m},
outputs = {"out":m}
})
下面的那个inplace 的也一样，只是两个的不同用法罢了

理解了，这里修改了一下，大概内容如下~

分成了三个部分，动态图、pir 下的静态图、非 pir 下的静态图

动态图。通过 add_ 与 C_ops.asgd 实现 m 与 ys 的 inplace 更新

pir 下的静态图。通过 assign_accumulator_master 实现（之前的那种方式），不支持 inplace 更新。这里调用 add 与 C_ops.asgd，不能更新数据。调用 block.append_op ，会直接报错。

非 pir 下的静态图。通过 block.append_op 实现 m 的 inplace 更新，不支持 ys 的inplace 更新。通过 paddle.static.setitem 与 _assign_accumulator_master 实现。如果只用 paddle.static.setitem ，会出现一个很奇怪的 bug：在当前 step 下 ys 的数据正常更新了，下一个 step 还是原先的值（全0）

GGBond8488 · 2024-01-15T06:32:12Z

python/paddle/optimizer/asgd.py

+        index = paddle.mod(m, self._n_tensor).item()
+        y = paddle.assign(ys[index])
+
+        m = paddle.assign(paddle.add(m, to_tensor([1], dtype=m.dtype)))


这个add 操作本身就会生成一个新的tensor，再assign 回去是想实现类似inplace的效果？那其实用add 的inplace api paddle.add_就行了

我最开始用的 inplace 的 add，发现静态图不支持

WintersMontagne10335 · 2024-01-15T07:37:29Z

@GGBond8488 回复完了，老师再审核一下哈

GGBond8488 · 2024-01-15T15:48:58Z

python/paddle/optimizer/asgd.py

+                find_master,
+            )
+
+            self._assign_accumulator_master(


静态图是支持inplace 操作的，你只需要inputs 和 output 都传同样的variable就行

GGBond8488 · 2024-01-15T15:49:30Z

python/paddle/optimizer/asgd.py

+
+            outputs = {
+                "param_out": param_and_grad[0],
+                "d_out": d,


你这里本身就是一个inplace了

这里如果传的是 ys ，是没问题的。但是这里传的是 ys[index] ，总是动静不统一，用 assign 可以绕开这个问题。这里有更好的办法嘛

GGBond8488 · 2024-01-15T15:51:41Z

python/paddle/optimizer/asgd.py

+                assign_value
+            )
+
+    def update_sign(self, value):


保存的是那些中间变量？用来做什么的？如果只是测试，在测试脚本加个patch，把这个方法补丁的形式打上就行，实际不需要就不要加

WintersMontagne10335 · 2024-01-16T06:52:28Z

@GGBond8488 回复您了，有空看一下哈，patch 那个不难改，另一个问题我暂时没有好的解决办法

WintersMontagne10335 · 2024-01-17T09:28:06Z

@GGBond8488 ys[index] 的那个是老问题了，暂时没有好的解决办法。除此之外，还有别的问题吗？劳烦您都指出来吧。

GGBond8488 · 2024-01-18T12:09:25Z

@GGBond8488 ys[index] 的那个是老问题了，暂时没有好的解决办法。除此之外，还有别的问题吗？劳烦您都指出来吧。

其他应该没有问题了

WintersMontagne10335 · 2024-01-19T02:33:18Z

@GGBond8488 老师不忙的时候喵一眼哈~

WintersMontagne10335 · 2024-01-22T06:41:30Z

@GGBond8488 有时间再审核一下哈

GGBond8488

LGTM

GGBond8488 · 2024-01-22T08:01:08Z

python/paddle/optimizer/asgd.py

+            }
+
+            add_outputs = {
+                "Out": m,


这里其实也不是inplace了吧，直接用add就行了，而且这个m真的需要inplace吗

这个不是 inplace 是因为输入是 to_tensor(m) 而不是 m 吗？修改成 m 了。您看现在可以吗？

这里 m 需要更新并替换原先保存的值，因为是静态图所以不能用 add_ ，相比于 assign 出来再 assign 回去，这里用 append_op 应该会更好一点吧，所以用的这种实现方式

WintersMontagne10335 · 2024-01-22T15:12:53Z

CI rerun 不了，已经反应了

WintersMontagne10335 · 2024-01-23T05:51:16Z

@GGBond8488 CI 过了， m 那里修改了一下，现在可以吗

GGBond8488

LGTM

sunzhongkai588

LGTM for docs

jeff41404 · 2024-01-24T04:06:16Z

please add link of rfc and chinese document in description above, and ensure that the content in rfc is consistent with the code.

WintersMontagne10335 · 2024-01-24T05:53:04Z

@jeff41404 Done

jeff41404 · 2024-01-24T06:51:20Z

@jeff41404 Done

There is a difference between the API signature in the rfc and code, which has been commented in rfc. According to the usual process, rfc should be merged first and then the code

WintersMontagne10335 · 2024-01-24T07:20:54Z

@jeff41404 Done

There is a difference between the API signature in the rfc and code, which has been commented in rfc. According to the usual process, rfc should be merged first and then the code

RFC has been corrected. The comments in the code also need to be modified. Can I still push here? Will it disrupt the state of "approve"? Or propose a new PR?

luotao1 · 2024-01-24T07:27:55Z

Please still push here

WintersMontagne10335 · 2024-01-24T08:36:32Z

@jeff41404 Done

There is a difference between the API signature in the rfc and code, which has been commented in rfc. According to the usual process, rfc should be merged first and then the code

RFC has been corrected. The comments in the code also need to be modified. Can I still push here? Will it disrupt the state of "approve"? Or propose a new PR?

Sorry. The comments in the code already include "multi_precision".

WintersMontagne10335 · 2024-01-24T08:36:53Z

@jeff41404 Done

There is a difference between the API signature in the rfc and code, which has been commented in rfc. According to the usual process, rfc should be merged first and then the code

Done

jeff41404

LGTM

WintersMontagne10335 added 2 commits November 8, 2023 14:41

add basic asgd

9cb8ec3

Merge remote-tracking branch 'upstream/develop' into winters018

afcc16f

paddle-bot bot added the contributor External developers label Nov 8, 2023

luotao1 added the PaddlePaddle Hackathon label Nov 9, 2023

luotao1 assigned luotao1 and GGBond8488 Nov 9, 2023

luotao1 mentioned this pull request Nov 9, 2023

【PaddlePaddle Hackathon 5th】开源贡献个人挑战赛 #57262

Open

WintersMontagne10335 added 3 commits November 22, 2023 08:33

Merge remote-tracking branch 'upstream/develop' into winters018

80b5266

add basic implementation

c3a3a3c

fix bugs

1f3b0bd

WintersMontagne10335 added 7 commits December 14, 2023 12:55

Merge remote-tracking branch 'upstream/develop' into winters018

1cb5a92

Merge remote-tracking branch 'upstream/develop' into winters018

c9e2961

add unit test cases

56925e4

Merge remote-tracking branch 'upstream/develop' into winters018

58ed273

fix bug

c70caab

add English document

20e41ec

fix bug

49d8ead

WintersMontagne10335 mentioned this pull request Dec 21, 2023

【Hackathon 5th No.40】为 Paddle 新增 ASGD API 中文文档 PaddlePaddle/docs#6412

Merged

GGBond8488 reviewed Dec 21, 2023

View reviewed changes

WintersMontagne10335 added 5 commits December 21, 2023 14:31

update

3fd08f6

rollback

bbc0779

test

6b70f57

update grad

6fec604

fix bug

4800282

WintersMontagne10335 commented Dec 22, 2023

View reviewed changes

update

517d371

GGBond8488 reviewed Jan 15, 2024

View reviewed changes

update

176c067

update

97c9eec

GGBond8488 reviewed Jan 22, 2024

View reviewed changes

WintersMontagne10335 added 2 commits January 22, 2024 11:41

update

ce01712

update

e8d30eb

GGBond8488 approved these changes Jan 23, 2024

View reviewed changes

sunzhongkai588 approved these changes Jan 24, 2024

View reviewed changes

jeff41404 mentioned this pull request Jan 24, 2024

【Hackathon 5th No.40】为 Paddle 新增 ASGD API RFC PaddlePaddle/community#747

Merged

jeff41404 approved these changes Jan 25, 2024

View reviewed changes

luotao1 merged commit 712b21d into PaddlePaddle:develop Jan 25, 2024
29 checks passed

eee4017 pushed a commit to eee4017/Paddle that referenced this pull request Jan 30, 2024

【Hackathon No.40】为 Paddle 新增 ASGD API (PaddlePaddle#58834)

7fe71c6

【Hackathon No.40】为 Paddle 新增 ASGD API #58834

【Hackathon No.40】为 Paddle 新增 ASGD API #58834

Conversation

WintersMontagne10335 commented Nov 8, 2023 • edited Loading

PR types

PR changes

Description

Related links

Current progress

paddle-bot bot commented Nov 10, 2023

paddle-ci-bot bot commented Nov 16, 2023

paddle-ci-bot bot commented Dec 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WintersMontagne10335 commented Dec 22, 2023

WintersMontagne10335 commented Dec 22, 2023

WintersMontagne10335 commented Jan 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GGBond8488 Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WintersMontagne10335 commented Jan 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GGBond8488 Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

WintersMontagne10335 commented Jan 16, 2024

WintersMontagne10335 commented Jan 17, 2024

GGBond8488 commented Jan 18, 2024

WintersMontagne10335 commented Jan 19, 2024

WintersMontagne10335 commented Jan 22, 2024

GGBond8488 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WintersMontagne10335 commented Jan 22, 2024

WintersMontagne10335 commented Jan 23, 2024

GGBond8488 left a comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

Choose a reason for hiding this comment

WintersMontagne10335 commented Nov 8, 2023 •

edited

Loading

GGBond8488 Jan 15, 2024 •

edited

Loading

GGBond8488 Jan 15, 2024 •

edited

Loading

jeff41404 commented Jan 24, 2024 •

edited

Loading