-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"accelerate elementwise_add_grad, add reduce functor" #7961
Conversation
@dzhwinter could you briefly explain why the original version is slow? |
The previous version use broadcast, it is a quite low efficient operation. -------------------------> Profiling Report <-------------------------
Place: CPU Total Time:41762.5ms Total Memory:17689.2MB Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Total Memory.Min Memory. Max Memory.
thread0::conv2d_grad 13 15602.1 276.188 5248.83 1200.16 9069.18 0.0078125 392.148
thread0::conv2d 13 8975.5 219.889 3366.09 690.423 338.152 12.2539 392.004
thread0::dropout 10 3036.16 0.300329 1213.68 303.616 1906.18 0.132812 784.008
thread0::elementwise_add_grad 16 2279.28 0.030858 544.775 142.455 8960.14 0.0195312 392.008
thread0::batch_norm_grad 14 1926.31 0.390424 462.51 137.594 8961.79 0.0742188 392.012
thread0::relu_grad 14 1689.34 0.087101 397.12 120.667 8961.71 0.0664062 392.004
thread0::pool2d_grad 5 1247.38 22.8108 617.521 249.475 9020.14 12.2539 392.004
thread0::batch_norm 14 1125.46 0.435152 264.698 80.39 1122.16 0.0742188 392.012
thread0::elementwise_add 16 991.63 0.031354 248.922 61.9769 730.156 0.015625 392.004
thread0::dropout_grad 10 808.702 0.048284 379.893 80.8702 8961.64 0.0664062 392.004
thread0::relu 14 795.466 0.042214 189.24 56.819 1514.17 0.0664062 392.004
thread0::adam 60 516.043 0.013363 238.108 8.60072 17688.7 0 0
thread0::fill_zeros_like 66 445.392 0.003328 185.057 6.74837 8961.57 0.00390625 392.004
thread0::pool2d 5 342.792 7.17049 180.343 68.5584 4258.21 3.06641 98.0039
thread0::mul_grad 3 74.9344 0.593411 71.8969 24.9781 8960.16 0.269531 52.0703
thread0::mul 3 44.4806 0.180829 43.4023 14.8269 8959.48 0.015625 0.0664062
thread0::elementwise_mul 60 0.363544 0.004298 0.022641 0.00605907 17688.7 0.00390625 0.00390625
thread0::softmax 1 0.278477 0.278477 0.278477 0.278477 8960.05 0.015625 0.015625
thread0::fill_constant 61 0.260437 0.002431 0.025904 0.00426946 8960.11 0.00390625 0.00390625 |
This solution involves in the reduce primitive, its CPU version implement is quite easy, as the test result shown above, which demonstrate the performance well. But the GPU kernel is hard to implement because of GPU threads overwriting problem. I had dig into the reduce kernel in https://github.com/zchee/cuda-sample/blob/master/6_Advanced/reduction/reduction_kernel.cu , but in my local machine, the reduce kernel is still not implemented correctly, so this PR has been delayed such a long time. Currently, this issue partly has been fixed in without using the reduce kernel. Please check the detail in #8402 |
Since you haven't replied for a long time, we have closed this issue/pr. |
fix #7862