[Performance] Improve segment_matmul by reducing launching overheads #213

yaox12 · 2023-03-16T03:24:02Z

CUTLASS grouped gemm requires copying matrix pointers and layouts to the device memory, which brings significant "launch" overheads, more concretely, 7 pageable H2D copies. This PR sets up the arguments for grouped gemm in a CPU pinned buffer manually and copy it to the device memory at once to reduce such overheads.

Other changes include setting CUDA stream for the grouped gemm, adding proper C10_CUDA_CHECK and C10_CUDA_KERNEL_LAUNCH_CHECK.

Performance

Benchmarking with the following script, this PR reduces the op time from 0.29 ms to 0.05 ms on my desktop (RTX 3090).

import torch
import time
import pyg_lib

def bench_pyg(a, b, seg):

    seg = torch.tensor(seg)
    pyg_lib.ops.segment_matmul(a, seg, b)
    torch.cuda.synchronize()

    tic = time.time()
    for _ in range(10):
        pyg_lib.ops.segment_matmul(a, seg, b)
    torch.cuda.synchronize()
    print(f"{(time.time() - tic) * 100:.2f} ms")

if __name__ == "__main__":
    num_seg = 20
    hid_dim = 64
    num_ele = 10000

    torch.manual_seed(42)
    device = torch.device("cuda:0")

    a = torch.rand((num_ele, hid_dim), device=device)
    b = torch.rand((num_seg, hid_dim, hid_dim), device=device)

    seg = torch.randint(num_ele, (num_seg - 1,)).sort()[0].tolist()
    seg = [0] + seg + [num_ele]

    bench_pyg(a, b, seg)

cc @rusty1s @puririshi98

codecov-commenter · 2023-03-16T03:30:11Z

Codecov Report

Merging #213 (2a76c9e) into master (c04fb60) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #213   +/-   ##
=======================================
  Coverage   83.49%   83.49%           
=======================================
  Files          26       26           
  Lines         848      848           
=======================================
  Hits          708      708           
  Misses        140      140

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

rusty1s · 2023-03-16T08:34:20Z

Thank you! @puririshi98 do you mind to take a look?

yaox12 · 2023-04-06T03:48:45Z

@puririshi98 Can I get your review?

puririshi98

LGTM!

puririshi98 · 2023-04-19T03:39:34Z

thanks for this! @yaox12

improve grouped gemm

2a76c9e

update CHANGELOG

c25f89c

yaox12 changed the title ~~[Performance] Improve grouped gemm~~ [Performance] Improve segment_matmul by reducing launching overheads Mar 16, 2023

rusty1s requested a review from puririshi98 March 16, 2023 08:33

rusty1s assigned yaox12 Mar 16, 2023

rusty1s added 0 - Priority P0 feature ops labels Mar 16, 2023

Merge branch 'master' into improve_grouped_gemm

d430bb6

puririshi98 approved these changes Apr 19, 2023

View reviewed changes

puririshi98 enabled auto-merge (squash) April 19, 2023 03:39

puririshi98 merged commit a7c7742 into pyg-team:master Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Improve segment_matmul by reducing launching overheads #213

[Performance] Improve segment_matmul by reducing launching overheads #213

yaox12 commented Mar 16, 2023 •

edited

Loading

codecov-commenter commented Mar 16, 2023

rusty1s commented Mar 16, 2023

yaox12 commented Apr 6, 2023

puririshi98 left a comment

puririshi98 commented Apr 19, 2023

[Performance] Improve segment_matmul by reducing launching overheads #213

[Performance] Improve segment_matmul by reducing launching overheads #213

Conversation

yaox12 commented Mar 16, 2023 • edited Loading

Performance

codecov-commenter commented Mar 16, 2023

Codecov Report

rusty1s commented Mar 16, 2023

yaox12 commented Apr 6, 2023

puririshi98 left a comment

Choose a reason for hiding this comment

puririshi98 commented Apr 19, 2023

yaox12 commented Mar 16, 2023 •

edited

Loading