Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reorder load and scaling code to allow latency hidding for block-wise…
… scaled GEMMs (#2600) Summary: The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here. Previously in the block-wise scaled GEMM kernel, the scaling logic followed `tl.load` and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the `dot` operation, I'm moving the scaling logic before the loads. {F1640448911} While we should fix the compiler to be more robust, I'm making a source change as a workaround. Differential Revision: D57473133
- Loading branch information