分布式稳定性测试 #11289

kolinwei · 2018-06-07T11:49:38Z

1.功能验证

主要需要验证在不同的参数和运行条件时，多机功能的正常。需要考虑如下维度：

(1) 模型

    功能上，需要快速的验证，可以选用一个CNN模型，一个RNN模型。
    resnet50模型，flowers数据集。
    seqseq模型，wmt14数据集

(2)多机训练规模

    采用两种规模：1个ps，2个trainer；2个ps，4个trainer。nccl用4个trainer

(3) 训练相关配置

   同步和异步
   pserver和nccl2
   CPU和GPU
   其他一些训练参数，可以参照 #11206

以上的维度，进行组合测试，验证功能，主要关注训练速度、训练收敛。

2.持续稳定性验证

验证对一些比较大的模型进行持续训练时的稳定情况。

(1) 模型

   选用两个模型SE-ResNeXt(Imagenet数据集)、transformer。

(2)训练规模

   2个ps 4个trainer，nccl用4个trainer

(3)训练相关配置

   同步和异步
   CPU和GPU
   pserver和nccl2

主要关注训练收敛度、速度、内存占用。可以持续稳定的训练较长时间(1-2天)

The text was updated successfully, but these errors were encountered:

typhoonzero · 2018-06-08T02:37:17Z

其他一些训练参数，可以参照 #11206

其他的配置，用CI去做吧，放到CE的话跑一次时间太长了。验证的模型中，seq2seq包含稀疏embedding的话其实也可以覆盖稀疏场景的用例。

后续也需要加上分布式稀疏场景的验证。

panyx0718 · 2018-06-08T05:48:55Z

这个issue不包含最终精度对奇？

velconia · 2018-06-08T06:33:43Z

我觉得直接使用: 2个ps，2个trainer, nccl用2个trainer; 这样的参数规模就挺好;

既照顾到了多pservers, 多trainers的情况, 同时也省CE资源, 加速测试

shanyi15 · 2018-08-15T10:23:00Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

panyx0718 self-assigned this Jun 7, 2018

gongweibao self-assigned this Jul 10, 2018

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式稳定性测试 #11289

分布式稳定性测试 #11289

kolinwei commented Jun 7, 2018 •

edited

Loading

typhoonzero commented Jun 8, 2018

panyx0718 commented Jun 8, 2018

velconia commented Jun 8, 2018

shanyi15 commented Aug 15, 2018

分布式稳定性测试 #11289

分布式稳定性测试 #11289

Comments

kolinwei commented Jun 7, 2018 • edited Loading

1.功能验证

(1) 模型

(2)多机训练规模

(3) 训练相关配置

2.持续稳定性验证

(1) 模型

(2)训练规模

(3)训练相关配置

typhoonzero commented Jun 8, 2018

panyx0718 commented Jun 8, 2018

velconia commented Jun 8, 2018

shanyi15 commented Aug 15, 2018

kolinwei commented Jun 7, 2018 •

edited

Loading