-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed/sequence op1 #9217
Speed/sequence op1 #9217
Conversation
22f414c
to
2524003
Compare
Please add benchmark details between there two versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请问8倍的加速比是对GPU来说的吧,CPU上维持不变?
if (i == index[tid]) { | ||
in_grad[item_dim * i + tid] = out_grad[tid]; | ||
} else { | ||
in_grad[item_dim * i + tid] = static_cast<T>(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果先都赋值为0,再根据if条件,对in_grad[item_dim * i + tid] = out_grad[tid]
,还会更快一点么?LastPool和FirstPool类似。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会。因为先都赋值为0将会多一次cuda kernel call。减少kernel call 会大大加快运行速度。
# return x, lod, out | ||
|
||
# def compute(self, x, lod, out): | ||
# self.attrs = {'pooltype': "FIRST"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32-42行不用的代码可以删去
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
self.attrs = {'pooltype': "SUM"} | ||
for i in range(4): | ||
sub_x = x[lod[0][i]:lod[0][i + 1], :] | ||
out[i] = sub_x.sum(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_seq_pool.py单测只是换了一些单测的顺序么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的
T, MaxPoolGradFunctor<T>><<<grid, threads, 0, context.stream()>>>( | ||
MaxPoolGradFunctor<T>(), out_grad.data<T>(), | ||
lod.CUDAData(context.GetPlace()), lod.size(), item_dim, | ||
in_grad->mutable_data<T>(context.GetPlace()), index->data<int>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lod.CUDAData(context.GetPlace())
和in_grad->mutable_data<T>(context.GetPlace())
等可以先用临时变量定义在if条件前面么,这样303-338行的代码可以简短点。141-178行类似。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is a good way to save lines of code. One rule in the Google style is that keep the declaration as close as it used place. If we forward declared with a temporary variable, user has to find it when he read the following code.
LGTM @qingqing01 Do you have any suggestions? |
* commit '33b8b3d22034423455a493712955e419aac7b19b': (251 commits) Remove redundant commands in build.sh and build_doc.sh Add dependencies Move v2/api/fluid to fluid/api and Adjust doc build commands Plain LRN op throws an exception when is_test is set in backward pass fix compiler error of profiler_test in ONLY_CPU mode fix server shutdown Translation for Model Configuration (PaddlePaddle#9513) Fix data transform when inplace (PaddlePaddle#9450) refine parallel add FAQ (PaddlePaddle#9494) Fix dist error with lr decay layer (PaddlePaddle#9489) add prefetch_op (PaddlePaddle#9495) Fix some errors (PaddlePaddle#9403) hookup WITH_FLUID_ONLY in TeamCity build.sh (PaddlePaddle#9509) Fix the order of reads and write from buffered channel (PaddlePaddle#9423) change WITH_FLUID to WITH_FLUID_ONLY (PaddlePaddle#9427) fix block num Revert "make append activation in place by default (PaddlePaddle#9417)" Speed/sequence op1 (PaddlePaddle#9217) fix a compile error ...
PaddlePaddle#9217) * [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bugs for enable_delay_scale_loss * add enable_delay_scale_loss flag for auto_parallel * fix ut * Update ci_case_auto.sh
fix #9099
every minibatch sequence_pool and sequence_pool_grad operator have a ~8x time acceleration.
for example,
the sequence_pool op enhanced from 0.815583 -> 0.119373
the sequence_pool_grad enhanced from 0.579614 -> 0.0830757
before optimize
after optimize