-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used #2007
Conversation
00e7feb
to
8654287
Compare
Codecov Report
@@ Coverage Diff @@
## master #2007 +/- ##
==========================================
+ Coverage 42.57% 43.51% +0.94%
==========================================
Files 45 45
Lines 6474 6264 -210
Branches 1109 1109
==========================================
- Hits 2756 2726 -30
+ Misses 3544 3365 -179
+ Partials 174 173 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
This is already a sign that OpenGL officially recommend us not to make iteration body small, and scale work group dynamically.
Interesting. IIUC, the saving still comes most from the removal of the extra sync rather than the matched size. So even if the work group size matches perfectly, and is significantly larger than the static grid-strided-loop size, it's still bounded by the hardware, which divides the work into batches..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks! LGTM.
Related issue = [数据删除]
[Click here for the format server]
What this PR do?
In
master
we sync back the gtmp buffer to compute thenum_groups
, and invokeglDispatchCompute
with it.In this PR we use
glDispatchComputeIndirect
to let OpenGL directly fetchnum_groups
from GPU memory.This prevents a force sync when dynamic range-for is used, resulting in a better performance.
Why I use indirect-evaluator instead of utilizing grid-stride-loop?
OpenGL CS is completely different from CUDA and Metal.
So the term work group they claimed can be implemented differently from the CUDA grid.
Using a fixed work group size + grid-stride loop would harm performance.
Instead, OpenGL provides another API for dynamic work group size, which CUDA don't have IMO:
https://www.khronos.org/opengl/wiki/GLAPI/glDispatchComputeIndirect
This is already a sign that OpenGL officially recommend us not to make iteration body small, and scale work group dynamically.
I'll merge this as soon as CI passed to push our ultra-secret project forward.
Test status
All tests except for the
test_bit_shl
(#1931) passed on my NVIDIA card, which is already a known issue in latestmaster
.Example
master: 1.0291837109998596
this PR: 0.1320957290008664