"accelerate elementwise_add_grad, add reduce functor" #7961

dzhwinter · 2018-01-30T02:24:59Z

tonyyang-svail · 2018-02-01T05:05:26Z

@dzhwinter could you briefly explain why the original version is slow?

dzhwinter · 2018-02-02T11:37:40Z

The previous version use broadcast, it is a quite low efficient operation.
After enhance, the result shows, the elementwise_add_grad lower than convolution.

------------------------->     Profiling Report     <-------------------------

Place: CPU	Total Time:41762.5ms	Total Memory:17689.2MB	Sorted by total time in descending order in the same thread

Event                            Calls       Total       Min.        Max.        Ave.        Total Memory.Min Memory. Max Memory.
thread0::conv2d_grad             13          15602.1     276.188     5248.83     1200.16     9069.18     0.0078125   392.148
thread0::conv2d                  13          8975.5      219.889     3366.09     690.423     338.152     12.2539     392.004
thread0::dropout                 10          3036.16     0.300329    1213.68     303.616     1906.18     0.132812    784.008
thread0::elementwise_add_grad    16          2279.28     0.030858    544.775     142.455     8960.14     0.0195312   392.008
thread0::batch_norm_grad         14          1926.31     0.390424    462.51      137.594     8961.79     0.0742188   392.012
thread0::relu_grad               14          1689.34     0.087101    397.12      120.667     8961.71     0.0664062   392.004
thread0::pool2d_grad             5           1247.38     22.8108     617.521     249.475     9020.14     12.2539     392.004
thread0::batch_norm              14          1125.46     0.435152    264.698     80.39       1122.16     0.0742188   392.012
thread0::elementwise_add         16          991.63      0.031354    248.922     61.9769     730.156     0.015625    392.004
thread0::dropout_grad            10          808.702     0.048284    379.893     80.8702     8961.64     0.0664062   392.004
thread0::relu                    14          795.466     0.042214    189.24      56.819      1514.17     0.0664062   392.004
thread0::adam                    60          516.043     0.013363    238.108     8.60072     17688.7     0           0
thread0::fill_zeros_like         66          445.392     0.003328    185.057     6.74837     8961.57     0.00390625  392.004
thread0::pool2d                  5           342.792     7.17049     180.343     68.5584     4258.21     3.06641     98.0039
thread0::mul_grad                3           74.9344     0.593411    71.8969     24.9781     8960.16     0.269531    52.0703
thread0::mul                     3           44.4806     0.180829    43.4023     14.8269     8959.48     0.015625    0.0664062
thread0::elementwise_mul         60          0.363544    0.004298    0.022641    0.00605907  17688.7     0.00390625  0.00390625
thread0::softmax                 1           0.278477    0.278477    0.278477    0.278477    8960.05     0.015625    0.015625
thread0::fill_constant           61          0.260437    0.002431    0.025904    0.00426946  8960.11     0.00390625  0.00390625

dzhwinter · 2018-02-24T05:44:59Z

This solution involves in the reduce primitive, its CPU version implement is quite easy, as the test result shown above, which demonstrate the performance well. But the GPU kernel is hard to implement because of GPU threads overwriting problem.

I had dig into the reduce kernel in https://github.com/zchee/cuda-sample/blob/master/6_Advanced/reduction/reduction_kernel.cu , but in my local machine, the reduce kernel is still not implemented correctly, so this PR has been delayed such a long time.

Currently, this issue partly has been fixed in without using the reduce kernel. Please check the detail in #8402

paddle-bot-old · 2020-05-22T06:40:10Z

Since you haven't replied for a long time, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您长期未回复，我们将关闭这个issue/pr。
若问题未解决或有后续问题，请随时重新打开，我们会继续跟进。

dzhwinter added 2 commits January 30, 2018 00:38

"add reduce functor"

f13a2d6

"fix compile"

d993802

dzhwinter changed the title ~~"add reduce functor"~~ "accelerate elementwise_add_grad, add reduce functor" Jan 30, 2018

dzhwinter added 4 commits February 1, 2018 22:01

"test cpu speed"

43638ee

Merge remote-tracking branch 'origin/develop' into enhance/elememnt

d6b2363

try make copy in thrust binary op

769e84d

"try to use wrapper of iterator"

0d30456

paddle-bot-old bot closed this May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"accelerate elementwise_add_grad, add reduce functor" #7961

"accelerate elementwise_add_grad, add reduce functor" #7961

dzhwinter commented Jan 30, 2018

tonyyang-svail commented Feb 1, 2018

dzhwinter commented Feb 2, 2018 •

edited

Loading

dzhwinter commented Feb 24, 2018 •

edited

Loading

paddle-bot-old bot commented May 22, 2020

"accelerate elementwise_add_grad, add reduce functor" #7961

"accelerate elementwise_add_grad, add reduce functor" #7961

Conversation

dzhwinter commented Jan 30, 2018

tonyyang-svail commented Feb 1, 2018

dzhwinter commented Feb 2, 2018 • edited Loading

dzhwinter commented Feb 24, 2018 • edited Loading

paddle-bot-old bot commented May 22, 2020

dzhwinter commented Feb 2, 2018 •

edited

Loading

dzhwinter commented Feb 24, 2018 •

edited

Loading