Speed/sequence op1 #9217

dzhwinter · 2018-03-19T17:28:18Z

fix #9099
every minibatch sequence_pool and sequence_pool_grad operator have a ~8x time acceleration.
for example,
the sequence_pool op enhanced from 0.815583 -> 0.119373
the sequence_pool_grad enhanced from 0.579614 -> 0.0830757

before optimize

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              72772       6579.74     0.013088    3.43046     0.0904158
thread0::mul_grad                         25928       4135.29     0.049952    4.4888      0.159491
thread0::sequence_softmax_grad            2344        3067.88     0.05872     95.0493     1.30882
thread0::sequence_softmax                 2344        2617.72     0.04976     17.122      1.11677
thread0::mul                              25928       2260.75     0.038624    8.36944     0.0871933
thread0::sequence_pool                    2380        1941.09     0.045984    89.9217     0.815583
thread0::sequence_expand_grad             2344        1730.34     0.05296     8.10054     0.738201
thread0::sequence_pool_grad               2380        1379.48     0.03824     137.793     0.579614

after optimize

thread0::sigmoid_grad                     7035        304.461     0.024032    89.5161     0.0432781
thread0::sequence_pool                    2381        284.226     0.053984    56.5243     0.119373
thread0::sigmoid                          7035        214.732     0.02448     3.57146     0.0305233
thread0::tanh                             7071        214.441     0.023712    1.59734     0.0303269
thread0::tanh_grad                        7071        206.762     0.023328    0.1432      0.0292408
thread0::sequence_pool_grad               2381        197.803     0.057408    0.934464    0.0830757
thread0::adam                             936         187.986     0.024384    1.28653     0.20084

QiJune · 2018-03-26T08:24:41Z

Please add benchmark details between there two versions.

luotao1

请问8倍的加速比是对GPU来说的吧，CPU上维持不变？

luotao1 · 2018-03-26T12:11:56Z

paddle/fluid/operators/math/sequence_pooling.cu

+        if (i == index[tid]) {
+          in_grad[item_dim * i + tid] = out_grad[tid];
+        } else {
+          in_grad[item_dim * i + tid] = static_cast<T>(0);


如果先都赋值为0，再根据if条件，对in_grad[item_dim * i + tid] = out_grad[tid]，还会更快一点么？LastPool和FirstPool类似。

不会。因为先都赋值为0将会多一次cuda kernel call。减少kernel call 会大大加快运行速度。

luotao1 · 2018-03-26T12:12:20Z

python/paddle/fluid/tests/unittests/test_seq_pool.py

+    #     return x, lod, out
+
+    # def compute(self, x, lod, out):
+    #     self.attrs = {'pooltype': "FIRST"}


32-42行不用的代码可以删去

luotao1 · 2018-03-26T12:13:34Z

python/paddle/fluid/tests/unittests/test_seq_pool.py

+        self.attrs = {'pooltype': "SUM"}
+        for i in range(4):
+            sub_x = x[lod[0][i]:lod[0][i + 1], :]
+            out[i] = sub_x.sum(axis=0)


test_seq_pool.py单测只是换了一些单测的顺序么？

luotao1 · 2018-03-26T12:17:23Z

paddle/fluid/operators/math/sequence_pooling.cu

+          T, MaxPoolGradFunctor<T>><<<grid, threads, 0, context.stream()>>>(
+          MaxPoolGradFunctor<T>(), out_grad.data<T>(),
+          lod.CUDAData(context.GetPlace()), lod.size(), item_dim,
+          in_grad->mutable_data<T>(context.GetPlace()), index->data<int>());


lod.CUDAData(context.GetPlace())和in_grad->mutable_data<T>(context.GetPlace())等可以先用临时变量定义在if条件前面么，这样303-338行的代码可以简短点。141-178行类似。

I don't think that is a good way to save lines of code. One rule in the Google style is that keep the declaration as close as it used place. If we forward declared with a temporary variable, user has to find it when he read the following code.

luotao1 · 2018-03-29T08:23:36Z

LGTM @qingqing01 Do you have any suggestions?

* commit '33b8b3d22034423455a493712955e419aac7b19b': (251 commits) Remove redundant commands in build.sh and build_doc.sh Add dependencies Move v2/api/fluid to fluid/api and Adjust doc build commands Plain LRN op throws an exception when is_test is set in backward pass fix compiler error of profiler_test in ONLY_CPU mode fix server shutdown Translation for Model Configuration (PaddlePaddle#9513) Fix data transform when inplace (PaddlePaddle#9450) refine parallel add FAQ (PaddlePaddle#9494) Fix dist error with lr decay layer (PaddlePaddle#9489) add prefetch_op (PaddlePaddle#9495) Fix some errors (PaddlePaddle#9403) hookup WITH_FLUID_ONLY in TeamCity build.sh (PaddlePaddle#9509) Fix the order of reads and write from buffered channel (PaddlePaddle#9423) change WITH_FLUID to WITH_FLUID_ONLY (PaddlePaddle#9427) fix block num Revert "make append activation in place by default (PaddlePaddle#9417)" Speed/sequence op1 (PaddlePaddle#9217) fix a compile error ...

PaddlePaddle#9217) * [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bugs for enable_delay_scale_loss * add enable_delay_scale_loss flag for auto_parallel * fix ut * Update ci_case_auto.sh

dzhwinter added 4 commits March 24, 2018 20:12

"add functors"

afcd94f

"remove old code"

2b17df7

"fix"

867d3cb

"fix ci"

2524003

dzhwinter force-pushed the speed/sequence_op1 branch from 22f414c to 2524003 Compare March 25, 2018 11:56

dzhwinter added 5 commits March 25, 2018 22:42

"add details"

738b3c6

"fix ci"

5adaa92

"fix ci"

74eafc8

"fix ci"

53354c6

"fix ci"

9ff3f1c

luotao1 reviewed Mar 26, 2018

View reviewed changes

"remove unused code"

9249ef9

luotao1 approved these changes Mar 29, 2018

View reviewed changes

dzhwinter merged commit 8425c2c into PaddlePaddle:develop Mar 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed/sequence op1 #9217

Speed/sequence op1 #9217

dzhwinter commented Mar 19, 2018 •

edited

Loading

QiJune commented Mar 26, 2018

luotao1 left a comment

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 commented Mar 29, 2018

Speed/sequence op1 #9217

Speed/sequence op1 #9217

Conversation

dzhwinter commented Mar 19, 2018 • edited Loading

QiJune commented Mar 26, 2018

luotao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 commented Mar 29, 2018

dzhwinter commented Mar 19, 2018 •

edited

Loading