[Optimization] Implicit gemm rewrite #2545

wingertge · 2024-11-26T19:20:25Z

Pull Request Template

Checklist

Confirmed that run-checks all script has been executed.
Made sure the book is up to date with changes in this PR.

Related Issues/PRs

Requires tracel-ai/cubecl#309 to land first

Changes

Adds a brand new implicit GEMM implementation that uses the matmul primitives in cubecl. This is slower for small k sizes, but much faster for large ones, and more flexible. I'm keeping the current implementation because it's significantly faster for certain sizes, and uses a significantly different loader strategy (loading only within each warp, which skips cross warp syncs).

Adds a number of new convolution benchmarks to test performance with different sizes and characteristics.

Testing

All non-group tests pass, and CRAFT has the expected output with all layers using the new implicit GEMM. This tests many different and relatively large layers. Adds two new regression tests for bugs discovered during implementation.

…lt in CPU execution

…tial issues on AMD

…t-gemm-rewrite

codecov · 2024-11-27T21:08:15Z

Codecov Report

Attention: Patch coverage is 20.72264% with 746 lines in your changes missing coverage. Please review.

Project coverage is 81.85%. Comparing base (42e7c1f) to head (4eda62a).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...it/src/kernel/conv/conv2d/gemm/homogeneous/base.rs	1.81%	270 Missing ⚠️
...tes/burn-jit/src/kernel/conv/conv2d/gemm/launch.rs	30.93%	125 Missing ⚠️
...n-jit/src/kernel/conv/conv2d/gemm/loader/im2col.rs	0.00%	103 Missing ⚠️
...urn-jit/src/kernel/conv/conv2d/gemm/loader/bias.rs	0.00%	83 Missing ⚠️
...n-jit/src/kernel/conv/conv2d/gemm/reader/im2col.rs	0.00%	64 Missing ⚠️
.../burn-jit/src/kernel/conv/conv2d/gemm/algorithm.rs	31.57%	26 Missing ⚠️
...s/burn-jit/src/kernel/conv/conv2d/implicit_gemm.rs	0.00%	23 Missing ⚠️
...urn-jit/src/kernel/conv/conv2d/gemm/reader/bias.rs	0.00%	17 Missing ⚠️
...rates/burn-jit/src/kernel/conv/conv2d/gemm/base.rs	0.00%	13 Missing ⚠️
crates/burn-jit/src/kernel/matmul/base.rs	64.00%	9 Missing ⚠️
... and 7 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2545      +/-   ##
==========================================
- Coverage   82.38%   81.85%   -0.53%     
==========================================
  Files         826      834       +8     
  Lines      105711   106589     +878     
==========================================
+ Hits        87087    87247     +160     
- Misses      18624    19342     +718

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nathanielsimard

It looks awesome! Feels great to reuse a lot of components. There are still some improvements that we can make in our "design paradigm", especially in how we pass around the config. But this is beyond the scope of this PR.

I have a few comments, but it would also be great for @louisfd to review.

crates/burn-jit/src/kernel/conv/conv2d/base.rs

crates/burn-jit/src/kernel/conv/conv2d/gemm/algorithm.rs

nathanielsimard · 2024-11-28T13:40:39Z

crates/burn-jit/src/kernel/conv/conv2d/gemm/homogeneous/base.rs

+            Self::LhsLoader::advance_view(&mut lhs_loader, k_step);
+            Self::RhsLoader::advance_view(&mut rhs_loader, k_step);
+        }
+


Somehow adding a sync_units after the for loop improved performance for the matmul. I think it makes sure all units in a plane are sync following the loop which improve the execution of following operations.

I'll benchmark it

So the benchmark results are very odd. I tried 4 different ways of syncing, and the one I used initially was overall the fastest for CUDA, but adding a sync before the load and after the loop was significantly faster for SPIR-V. Only syncing where absolutely needed was the slowest by far. Very odd behaviour. I'll stick with lots of sync for now because it's only 10-15% slower than the current implementation on CUDA (and there's margin of error), but 30% faster on SPIR-V.

crates/burn-jit/src/kernel/conv/conv2d/gemm/homogeneous/base.rs

crates/burn-jit/src/kernel/conv/conv2d/gemm/launch.rs

crates/burn-jit/src/kernel/conv/conv2d/gemm/loader/base.rs

nathanielsimard

LGTM we can merge after the conflicts are resolved!

wingertge · 2024-11-29T18:58:19Z

Done 👍

wingertge and others added 30 commits October 19, 2024 23:18

Add SPIR-V backend

116a582

Update READMEs

5eb4dfe

More doc updates and testing

6f8f589

Ensure SPIR-V tests actually run

636b280

Disable SPIR-V tests if WGPU is disabled in general

325f659

Disable SPIR-V tests on MacOS

00a3036

Update cubecl

08f212f

Merge branch 'main' into feat/wgpu-spirv-backend

1cf435a

Disable SPIR-V CI tests until I can figure out what causes the segfau…

44f116e

…lt in CPU execution

Merge branch 'main' into feat/wgpu-spirv-backend

351cf07

Reenable SPIR-V tests to see if fixes work

bf81855

Temporarily point to fork to check fixes

2c124ae

Revert to main

9eecef1

Disable SPIR-V tests on CI

19a9f3f

More conv2d benches

82d4a58

Optimize implicit GEMM

27b93a5

Ensure weight isn't vectorized if it's loaded directly

3bfdfd6

Merge branch 'main' into opt/implicit_gemm

7535046

Implement tf32 for implicit GEMM

95822d0

Fixes

d472a17

Merge branch 'main' into opt/implicit_gemm

efe36e1

Make reduce checked since we're still getting segfaults

7543f7b

Merge branch 'main' into opt/implicit_gemm

b17713e

Make bicubic interp checked

07bd12f

Undo direct weight loader because it was backfiring

ea9e872

Undo version change

449e464

Use git version of cubecl

a95b692

Update cubecl

6577812

Use select to ensure correctness in bilinear interpolate

ca64d18

Disable reduce_dim_subcube if warp size isn't known, to prevent poten…

83adb2f

…tial issues on AMD

wingertge added 10 commits November 26, 2024 18:58

Refactor and documentation

25a8adb

Refactor

774e49f

Merge branch 'main' into opt/implicit-gemm-rewrite

7b50d93

Revert accidental changes

378a77c

Add newline

70b4532

Update cubecl

55cfbb5

Merge remote-tracking branch 'upstream/matmulupdate' into opt/implici…

100780e

…t-gemm-rewrite

Vectorize SMEM for implicit_gemm

3f17791

Merge branch 'main' into opt/implicit-gemm-rewrite

ac8ea65

Update cubecl

7d8d119

wingertge marked this pull request as ready for review November 27, 2024 18:14

wingertge added 2 commits November 27, 2024 20:21

Temp fix for cubecl strategy

135d256

Fix deform_conv_transpose2d

5565f8e

nathanielsimard reviewed Nov 28, 2024

View reviewed changes

wingertge added 7 commits November 28, 2024 16:10

Cleanup and generic typing

86dd009

Update cubecl

253f85b

Merge branch 'main' into opt/implicit-gemm-rewrite

9ce83c8

Cleanup

266c07b

Fix clippy

cec10e1

Merge branch 'main' into opt/implicit-gemm-rewrite

927e026

Make conv2d bench more consistent

fdf73c9

wingertge requested a review from nathanielsimard November 28, 2024 20:22

Merge branch 'main' into opt/implicit-gemm-rewrite

41a7f06

nathanielsimard approved these changes Nov 29, 2024

View reviewed changes

Merge branch 'main' into opt/implicit-gemm-rewrite

4eda62a

nathanielsimard merged commit a5624c1 into tracel-ai:main Nov 29, 2024
10 of 11 checks passed

wingertge deleted the opt/implicit-gemm-rewrite branch November 29, 2024 19:30

tiruka mentioned this pull request Dec 3, 2024

fix typos in crates/burn-jit/src/tests/conv2d.rs and may include bugs #2581

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] Implicit gemm rewrite #2545

[Optimization] Implicit gemm rewrite #2545

wingertge commented Nov 26, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading

nathanielsimard left a comment

nathanielsimard Nov 28, 2024

wingertge Nov 28, 2024

wingertge Nov 28, 2024

nathanielsimard left a comment

wingertge commented Nov 29, 2024

[Optimization] Implicit gemm rewrite #2545

[Optimization] Implicit gemm rewrite #2545

Conversation

wingertge commented Nov 26, 2024 • edited Loading

Pull Request Template

Checklist

Related Issues/PRs

Changes

Testing

codecov bot commented Nov 27, 2024 • edited Loading

Codecov Report

nathanielsimard left a comment

Choose a reason for hiding this comment

nathanielsimard Nov 28, 2024

Choose a reason for hiding this comment

wingertge Nov 28, 2024

Choose a reason for hiding this comment

wingertge Nov 28, 2024

Choose a reason for hiding this comment

nathanielsimard left a comment

Choose a reason for hiding this comment

wingertge commented Nov 29, 2024

wingertge commented Nov 26, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading