Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"accelerate elementwise_add_grad, add reduce functor" #7961

Closed
wants to merge 6 commits into from

Conversation

dzhwinter
Copy link
Contributor

fix #7862

@dzhwinter dzhwinter changed the title "add reduce functor" "accelerate elementwise_add_grad, add reduce functor" Jan 30, 2018
@tonyyang-svail
Copy link

@dzhwinter could you briefly explain why the original version is slow?

@dzhwinter
Copy link
Contributor Author

dzhwinter commented Feb 2, 2018

The previous version use broadcast, it is a quite low efficient operation.
After enhance, the result shows, the elementwise_add_grad lower than convolution.

------------------------->     Profiling Report     <-------------------------

Place: CPU	Total Time:41762.5ms	Total Memory:17689.2MB	Sorted by total time in descending order in the same thread

Event                            Calls       Total       Min.        Max.        Ave.        Total Memory.Min Memory. Max Memory.
thread0::conv2d_grad             13          15602.1     276.188     5248.83     1200.16     9069.18     0.0078125   392.148
thread0::conv2d                  13          8975.5      219.889     3366.09     690.423     338.152     12.2539     392.004
thread0::dropout                 10          3036.16     0.300329    1213.68     303.616     1906.18     0.132812    784.008
thread0::elementwise_add_grad    16          2279.28     0.030858    544.775     142.455     8960.14     0.0195312   392.008
thread0::batch_norm_grad         14          1926.31     0.390424    462.51      137.594     8961.79     0.0742188   392.012
thread0::relu_grad               14          1689.34     0.087101    397.12      120.667     8961.71     0.0664062   392.004
thread0::pool2d_grad             5           1247.38     22.8108     617.521     249.475     9020.14     12.2539     392.004
thread0::batch_norm              14          1125.46     0.435152    264.698     80.39       1122.16     0.0742188   392.012
thread0::elementwise_add         16          991.63      0.031354    248.922     61.9769     730.156     0.015625    392.004
thread0::dropout_grad            10          808.702     0.048284    379.893     80.8702     8961.64     0.0664062   392.004
thread0::relu                    14          795.466     0.042214    189.24      56.819      1514.17     0.0664062   392.004
thread0::adam                    60          516.043     0.013363    238.108     8.60072     17688.7     0           0
thread0::fill_zeros_like         66          445.392     0.003328    185.057     6.74837     8961.57     0.00390625  392.004
thread0::pool2d                  5           342.792     7.17049     180.343     68.5584     4258.21     3.06641     98.0039
thread0::mul_grad                3           74.9344     0.593411    71.8969     24.9781     8960.16     0.269531    52.0703
thread0::mul                     3           44.4806     0.180829    43.4023     14.8269     8959.48     0.015625    0.0664062
thread0::elementwise_mul         60          0.363544    0.004298    0.022641    0.00605907  17688.7     0.00390625  0.00390625
thread0::softmax                 1           0.278477    0.278477    0.278477    0.278477    8960.05     0.015625    0.015625
thread0::fill_constant           61          0.260437    0.002431    0.025904    0.00426946  8960.11     0.00390625  0.00390625

@dzhwinter
Copy link
Contributor Author

dzhwinter commented Feb 24, 2018

This solution involves in the reduce primitive, its CPU version implement is quite easy, as the test result shown above, which demonstrate the performance well. But the GPU kernel is hard to implement because of GPU threads overwriting problem.

I had dig into the reduce kernel in https://github.com/zchee/cuda-sample/blob/master/6_Advanced/reduction/reduction_kernel.cu , but in my local machine, the reduce kernel is still not implemented correctly, so this PR has been delayed such a long time.

Currently, this issue partly has been fixed in without using the reduce kernel. Please check the detail in #8402

@paddle-bot-old paddle-bot-old bot closed this May 22, 2020
@paddle-bot-old
Copy link

Since you haven't replied for a long time, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您长期未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

elementwise_add_grad should be optimized
2 participants