Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used #2007

Merged
merged 10 commits into from
Oct 30, 2020

Conversation

archibate
Copy link
Collaborator

@archibate archibate commented Oct 30, 2020

Related issue = [数据删除]

[Click here for the format server]


What this PR do?

In master we sync back the gtmp buffer to compute the num_groups, and invoke glDispatchCompute with it.
In this PR we use glDispatchComputeIndirect to let OpenGL directly fetch num_groups from GPU memory.
This prevents a force sync when dynamic range-for is used, resulting in a better performance.

Why I use indirect-evaluator instead of utilizing grid-stride-loop?

OpenGL CS is completely different from CUDA and Metal.
So the term work group they claimed can be implemented differently from the CUDA grid.
Using a fixed work group size + grid-stride loop would harm performance.
Instead, OpenGL provides another API for dynamic work group size, which CUDA don't have IMO:
https://www.khronos.org/opengl/wiki/GLAPI/glDispatchComputeIndirect
This is already a sign that OpenGL officially recommend us not to make iteration body small, and scale work group dynamically.
I'll merge this as soon as CI passed to push our ultra-secret project forward.

Test status

All tests except for the test_bit_shl (#1931) passed on my NVIDIA card, which is already a known issue in latest master.

Example

import taichi as ti
import timeit

ti.init(ti.opengl)

N = 2**14

x = ti.field(int, N)

@ti.kernel
def func():
    for i in x:
        x[i] = i
    for i in range(x[2]):
        x[i] = i

stmt = lambda: func()
print(timeit.timeit(stmt, stmt, number=10000))

master: 1.0291837109998596
this PR: 0.1320957290008664

@archibate archibate changed the title Indirect dyn for [OpenGL] [perf] Prevent sync when dynamic range used by glDispatchComputeIndirect Oct 30, 2020
@archibate archibate changed the title [OpenGL] [perf] Prevent sync when dynamic range used by glDispatchComputeIndirect [OpenGL] [perf] Prevent sync when dynamic range used by utilizing glDispatchComputeIndirect Oct 30, 2020
@archibate archibate requested review from taichi-gardener and removed request for taichi-gardener October 30, 2020 09:38
@codecov
Copy link

codecov bot commented Oct 30, 2020

Codecov Report

Merging #2007 into master will increase coverage by 0.94%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2007      +/-   ##
==========================================
+ Coverage   42.57%   43.51%   +0.94%     
==========================================
  Files          45       45              
  Lines        6474     6264     -210     
  Branches     1109     1109              
==========================================
- Hits         2756     2726      -30     
+ Misses       3544     3365     -179     
+ Partials      174      173       -1     
Impacted Files Coverage Δ
python/taichi/lang/ast_checker.py 70.58% <0.00%> (-1.64%) ⬇️
python/taichi/testing.py 75.00% <0.00%> (-0.72%) ⬇️
python/taichi/lang/linalg.py 89.33% <0.00%> (-0.67%) ⬇️
python/taichi/lang/__init__.py 40.57% <0.00%> (-0.33%) ⬇️
python/taichi/misc/util.py 20.54% <0.00%> (-0.21%) ⬇️
python/taichi/misc/task.py 0.00% <0.00%> (ø)
python/taichi/tools/patterns.py 0.00% <0.00%> (ø)
python/taichi/lang/kernel.py 73.00% <0.00%> (+0.16%) ⬆️
python/taichi/misc/gui.py 8.89% <0.00%> (+0.18%) ⬆️
python/taichi/main.py 22.95% <0.00%> (+0.35%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5a00ce...d3ba272. Read the comment docs.

Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

This is already a sign that OpenGL officially recommend us not to make iteration body small, and scale work group dynamically.

Interesting. IIUC, the saving still comes most from the removal of the extra sync rather than the matched size. So even if the work group size matches perfectly, and is significantly larger than the static grid-strided-loop size, it's still bounded by the hardware, which divides the work into batches..

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks! LGTM.

@yuanming-hu yuanming-hu changed the title [OpenGL] [perf] Prevent sync when dynamic range used by utilizing glDispatchComputeIndirect [OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used Oct 30, 2020
@yuanming-hu yuanming-hu merged commit dea88d0 into taichi-dev:master Oct 30, 2020
@yuanming-hu yuanming-hu mentioned this pull request Oct 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants