[OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used #2007

archibate · 2020-10-30T08:58:20Z

Related issue = [数据删除]

What this PR do?

In master we sync back the gtmp buffer to compute the num_groups, and invoke glDispatchCompute with it.
In this PR we use glDispatchComputeIndirect to let OpenGL directly fetch num_groups from GPU memory.
This prevents a force sync when dynamic range-for is used, resulting in a better performance.

Why I use indirect-evaluator instead of utilizing grid-stride-loop?

OpenGL CS is completely different from CUDA and Metal.
So the term work group they claimed can be implemented differently from the CUDA grid.
Using a fixed work group size + grid-stride loop would harm performance.
Instead, OpenGL provides another API for dynamic work group size, which CUDA don't have IMO:
https://www.khronos.org/opengl/wiki/GLAPI/glDispatchComputeIndirect
This is already a sign that OpenGL officially recommend us not to make iteration body small, and scale work group dynamically.
I'll merge this as soon as CI passed to push our ultra-secret project forward.

Test status

All tests except for the test_bit_shl (#1931) passed on my NVIDIA card, which is already a known issue in latest master.

Example

import taichi as ti
import timeit

ti.init(ti.opengl)

N = 2**14

x = ti.field(int, N)

@ti.kernel
def func():
    for i in x:
        x[i] = i
    for i in range(x[2]):
        x[i] = i

stmt = lambda: func()
print(timeit.timeit(stmt, stmt, number=10000))

master: 1.0291837109998596
this PR: 0.1320957290008664

codecov · 2020-10-30T09:59:06Z

Codecov Report

Merging #2007 into master will increase coverage by 0.94%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2007      +/-   ##
==========================================
+ Coverage   42.57%   43.51%   +0.94%     
==========================================
  Files          45       45              
  Lines        6474     6264     -210     
  Branches     1109     1109              
==========================================
- Hits         2756     2726      -30     
+ Misses       3544     3365     -179     
+ Partials      174      173       -1

Impacted Files	Coverage Δ
python/taichi/lang/ast_checker.py	`70.58% <0.00%> (-1.64%)`	⬇️
python/taichi/testing.py	`75.00% <0.00%> (-0.72%)`	⬇️
python/taichi/lang/linalg.py	`89.33% <0.00%> (-0.67%)`	⬇️
python/taichi/lang/__init__.py	`40.57% <0.00%> (-0.33%)`	⬇️
python/taichi/misc/util.py	`20.54% <0.00%> (-0.21%)`	⬇️
python/taichi/misc/task.py	`0.00% <0.00%> (ø)`
python/taichi/tools/patterns.py	`0.00% <0.00%> (ø)`
python/taichi/lang/kernel.py	`73.00% <0.00%> (+0.16%)`	⬆️
python/taichi/misc/gui.py	`8.89% <0.00%> (+0.18%)`	⬆️
python/taichi/main.py	`22.95% <0.00%> (+0.35%)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5a00ce...d3ba272. Read the comment docs.

k-ye

LGTM, thank you!

This is already a sign that OpenGL officially recommend us not to make iteration body small, and scale work group dynamically.

Interesting. IIUC, the saving still comes most from the removal of the extra sync rather than the matched size. So even if the work group size matches perfectly, and is significantly larger than the static grid-strided-loop size, it's still bounded by the hardware, which divides the work into batches..

yuanming-hu

Cool, thanks! LGTM.

archibate added 7 commits October 30, 2020 15:09

tmp1

e0ca120

tmp2

1c5e4be

CompiledKernel refactor to PIMPL

d42c544

ps->get_indirect_evaluator

248546b

cache indirect evaluator kernel

5f32be9

_compute_indirect as template

1cee610

according to gtmp

8e754c2

archibate requested review from yuanming-hu, k-ye and taichi-gardener October 30, 2020 08:58

archibate changed the title ~~Indirect dyn for~~ [OpenGL] [perf] Prevent sync when dynamic range used by glDispatchComputeIndirect Oct 30, 2020

archibate changed the title ~~[OpenGL] [perf] Prevent sync when dynamic range used by glDispatchComputeIndirect~~ [OpenGL] [perf] Prevent sync when dynamic range used by utilizing glDispatchComputeIndirect Oct 30, 2020

clean up

8654287

archibate force-pushed the indirect-dyn-for branch from 00e7feb to 8654287 Compare October 30, 2020 09:36

archibate requested review from taichi-gardener and removed request for taichi-gardener October 30, 2020 09:38

archibate and others added 2 commits October 30, 2020 17:42

Merge branch 'master' into indirect-dyn-for

be99270

[skip ci] enforce code format

d3ba272

k-ye reviewed Oct 30, 2020

View reviewed changes

k-ye approved these changes Oct 30, 2020

View reviewed changes

yuanming-hu approved these changes Oct 30, 2020

View reviewed changes

yuanming-hu changed the title ~~[OpenGL] [perf] Prevent sync when dynamic range used by utilizing glDispatchComputeIndirect~~ [OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used Oct 30, 2020

yuanming-hu merged commit dea88d0 into taichi-dev:master Oct 30, 2020

yuanming-hu mentioned this pull request Oct 31, 2020

[release] v0.7.4 #2013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used #2007

[OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used #2007

archibate commented Oct 30, 2020 •

edited

Loading

codecov bot commented Oct 30, 2020 •

edited

Loading

k-ye left a comment •

edited

Loading

yuanming-hu left a comment

[OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used #2007

[OpenGL] [perf] Utilize glDispatchComputeIndirect to prevent sync when dynamic ranges are used #2007

Conversation

archibate commented Oct 30, 2020 • edited Loading

What this PR do?

Why I use indirect-evaluator instead of utilizing grid-stride-loop?

Test status

Example

codecov bot commented Oct 30, 2020 • edited Loading

Codecov Report

k-ye left a comment • edited Loading

Choose a reason for hiding this comment

yuanming-hu left a comment

Choose a reason for hiding this comment

archibate commented Oct 30, 2020 •

edited

Loading

codecov bot commented Oct 30, 2020 •

edited

Loading

k-ye left a comment •

edited

Loading