-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optimization] Implicit gemm rewrite #2545
[Optimization] Implicit gemm rewrite #2545
Conversation
…lt in CPU execution
…tial issues on AMD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks awesome! Feels great to reuse a lot of components. There are still some improvements that we can make in our "design paradigm", especially in how we pass around the config. But this is beyond the scope of this PR.
I have a few comments, but it would also be great for @louisfd to review.
Self::LhsLoader::advance_view(&mut lhs_loader, k_step); | ||
Self::RhsLoader::advance_view(&mut rhs_loader, k_step); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow adding a sync_units
after the for loop improved performance for the matmul. I think it makes sure all units in a plane are sync following the loop which improve the execution of following operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll benchmark it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the benchmark results are very odd. I tried 4 different ways of syncing, and the one I used initially was overall the fastest for CUDA, but adding a sync before the load and after the loop was significantly faster for SPIR-V. Only syncing where absolutely needed was the slowest by far. Very odd behaviour. I'll stick with lots of sync for now because it's only 10-15% slower than the current implementation on CUDA (and there's margin of error), but 30% faster on SPIR-V.
crates/burn-jit/src/kernel/conv/conv2d/gemm/homogeneous/base.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM we can merge after the conflicts are resolved!
Done 👍 |
Pull Request Template
Checklist
run-checks all
script has been executed.Related Issues/PRs
Requires tracel-ai/cubecl#309 to land first
Changes
Adds a brand new implicit GEMM implementation that uses the matmul primitives in
cubecl
. This is slower for small k sizes, but much faster for large ones, and more flexible. I'm keeping the current implementation because it's significantly faster for certain sizes, and uses a significantly different loader strategy (loading only within each warp, which skips cross warp syncs).Adds a number of new convolution benchmarks to test performance with different sizes and characteristics.
Testing
All non-group tests pass, and CRAFT has the expected output with all layers using the new implicit GEMM. This tests many different and relatively large layers. Adds two new regression tests for bugs discovered during implementation.