Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed/sequence op1 #9217

Merged
merged 10 commits into from
Mar 29, 2018
Merged

Conversation

dzhwinter
Copy link
Contributor

@dzhwinter dzhwinter commented Mar 19, 2018

fix #9099
every minibatch sequence_pool and sequence_pool_grad operator have a ~8x time acceleration.
for example,
the sequence_pool op enhanced from 0.815583 -> 0.119373
the sequence_pool_grad enhanced from 0.579614 -> 0.0830757

before optimize

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              72772       6579.74     0.013088    3.43046     0.0904158
thread0::mul_grad                         25928       4135.29     0.049952    4.4888      0.159491
thread0::sequence_softmax_grad            2344        3067.88     0.05872     95.0493     1.30882
thread0::sequence_softmax                 2344        2617.72     0.04976     17.122      1.11677
thread0::mul                              25928       2260.75     0.038624    8.36944     0.0871933
thread0::sequence_pool                    2380        1941.09     0.045984    89.9217     0.815583
thread0::sequence_expand_grad             2344        1730.34     0.05296     8.10054     0.738201
thread0::sequence_pool_grad               2380        1379.48     0.03824     137.793     0.579614

after optimize

thread0::sigmoid_grad                     7035        304.461     0.024032    89.5161     0.0432781
thread0::sequence_pool                    2381        284.226     0.053984    56.5243     0.119373
thread0::sigmoid                          7035        214.732     0.02448     3.57146     0.0305233
thread0::tanh                             7071        214.441     0.023712    1.59734     0.0303269
thread0::tanh_grad                        7071        206.762     0.023328    0.1432      0.0292408
thread0::sequence_pool_grad               2381        197.803     0.057408    0.934464    0.0830757
thread0::adam                             936         187.986     0.024384    1.28653     0.20084

@QiJune
Copy link
Member

QiJune commented Mar 26, 2018

Please add benchmark details between there two versions.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问8倍的加速比是对GPU来说的吧,CPU上维持不变?

if (i == index[tid]) {
in_grad[item_dim * i + tid] = out_grad[tid];
} else {
in_grad[item_dim * i + tid] = static_cast<T>(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果先都赋值为0,再根据if条件,对in_grad[item_dim * i + tid] = out_grad[tid],还会更快一点么?LastPool和FirstPool类似。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会。因为先都赋值为0将会多一次cuda kernel call。减少kernel call 会大大加快运行速度。

# return x, lod, out

# def compute(self, x, lod, out):
# self.attrs = {'pooltype': "FIRST"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32-42行不用的代码可以删去

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

self.attrs = {'pooltype': "SUM"}
for i in range(4):
sub_x = x[lod[0][i]:lod[0][i + 1], :]
out[i] = sub_x.sum(axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_seq_pool.py单测只是换了一些单测的顺序么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的

T, MaxPoolGradFunctor<T>><<<grid, threads, 0, context.stream()>>>(
MaxPoolGradFunctor<T>(), out_grad.data<T>(),
lod.CUDAData(context.GetPlace()), lod.size(), item_dim,
in_grad->mutable_data<T>(context.GetPlace()), index->data<int>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lod.CUDAData(context.GetPlace())in_grad->mutable_data<T>(context.GetPlace())等可以先用临时变量定义在if条件前面么,这样303-338行的代码可以简短点。141-178行类似。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is a good way to save lines of code. One rule in the Google style is that keep the declaration as close as it used place. If we forward declared with a temporary variable, user has to find it when he read the following code.

@luotao1
Copy link
Contributor

luotao1 commented Mar 29, 2018

LGTM @qingqing01 Do you have any suggestions?

@dzhwinter dzhwinter merged commit 8425c2c into PaddlePaddle:develop Mar 29, 2018
mikeseven added a commit to mikeseven/Paddle that referenced this pull request Mar 30, 2018
* commit '33b8b3d22034423455a493712955e419aac7b19b': (251 commits)
  Remove redundant commands in build.sh and build_doc.sh
  Add dependencies
  Move v2/api/fluid to fluid/api and Adjust doc build commands
  Plain LRN op throws an exception when is_test is set in backward pass
  fix compiler error of profiler_test in ONLY_CPU mode
  fix server shutdown
  Translation for Model Configuration (PaddlePaddle#9513)
  Fix data transform when inplace (PaddlePaddle#9450)
  refine parallel
  add FAQ (PaddlePaddle#9494)
  Fix dist error with lr decay layer (PaddlePaddle#9489)
  add prefetch_op (PaddlePaddle#9495)
  Fix some errors (PaddlePaddle#9403)
  hookup WITH_FLUID_ONLY in TeamCity build.sh (PaddlePaddle#9509)
  Fix the order of reads and write from buffered channel  (PaddlePaddle#9423)
  change WITH_FLUID to WITH_FLUID_ONLY (PaddlePaddle#9427)
  fix block num
  Revert "make append activation in place by default (PaddlePaddle#9417)"
  Speed/sequence op1 (PaddlePaddle#9217)
  fix a compile error
  ...
blacksheep-Aristotle pushed a commit to blacksheep-Aristotle/Paddle that referenced this pull request Nov 22, 2024
PaddlePaddle#9217)

* [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bugs for enable_delay_scale_loss

* add enable_delay_scale_loss flag for auto_parallel

* fix ut

* Update ci_case_auto.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Speed] sequence_pool op need to be enhanced
3 participants