[TL] Adapt TL Hardware-aware Search Space with Roller #207

LeiWang1999 · 2024-10-01T16:16:44Z

Though currently we provide a strategy for users to define a search space for given TL Kernel. But it's still be hard and complex to define a precise and efficient search space for dynamic shapes and operators for given OP and backend.

 def get_configs_sm80(self):
        num_stages = 2
        configs = [
            {
                'block_M': 128,
                'block_N': 256,
                'block_K': 32,
                'threads': 128
            },
            {
                'block_M': 256,
                'block_N': 128,
                'block_K': 32,
                'threads': 128
            },
            {
                'block_M': 128,
                'block_N': 128,
                'block_K': 32,
                'threads': 128
            },
        ]
        configs = [{**c, 'num_stages': num_stages} for c in configs]
        return configs

This pull request link search space up with our roller search space.

However, the Block Level TL can not fully utilize the schedule information, for example, TL only provide dedicated three Warp Scheduling Policy:

class GemmWarpPolicy:
    Square = 0
    FullRow = 1
    FullCol = 2

…y function

…ps_dynamic

The select_scheduler function in the dense/__init__.py module has been refactored to use a fine-grained interface. This change provides more flexibility and enables the implementation of high-performance kernels. Update MatmulScheduler class in matmul_tensorcore.py The MatmulScheduler class in the matmul_tensorcore.py module has been updated to calculate the number of threads based on the block size and warp size. This ensures optimal GPU warp configuration for NVIDIA GPUs. Improve test_general_matmul_tilelang_kernel.py The test_general_matmul_tilelang_kernel.py module has been improved to include additional test cases and assertions for correctness.

LeiWang1999 · 2024-10-02T04:56:05Z

BUG Fix:

InjtectThreadSync Pass must be applied after the pass MergeSharedMemoryAllocation, because MergeSharedMemoryAllocation pass modifies the buffer region, altering the liveness domain of the buffer.

…inetuning

LeiWang1999 added 22 commits September 28, 2024 07:43

Refactor tilelang dequantize module and add matmul_blocked_weight_onl…

f3b1eb9

…y function

remove un-implemented code.

730d13e

Implement BaseScheduler to wrap some related items.

8047ee7

lint fix

64db065

test skip

cef04a8

Refactor tilelang dequantize module and add matmul_blocked_weight_onl…

f1652e9

…y function

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

4f6c545

…ps_dynamic

test fix

c485b68

hardware tuning demo

ebe42a6

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

88230ec

…ps_dynamic

remove debug related items.

44246a1

imlement tuner and cache fix

bb51e15

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

f42a3b9

…ps_dynamic

lint fix

de7ae18

test case fix.

ef40bd8

Adapt Tuning Space generation with Roller

85f0a5f

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

e9f7db3

…ps_dynamic

lint fix

9e31336

Refactor select_scheduler function for fine-grained interface

f1378d4

Refactor NotImplementedError message in BaseTLHint class

137cce3

Update submodule reference in 3rdparty/tvm

fc19fa2

LeiWang1999 added 2 commits October 2, 2024 07:51

Refactor matmul_finetune function to use topk=20 for hardware-aware f…

fe51bb1

…inetuning

Refactor submodule reference in 3rdparty/tvm

79878cb

LeiWang1999 mentioned this pull request Oct 2, 2024

[BUG] TVM PopenPoolExecutor may have some bugs on TL Scripts #211

Open

LeiWang1999 added 2 commits October 2, 2024 14:57

lint fix

0fc7ab9

Refactor test_general_matmul_tilelang_impl.py and test_tilelang_gemm.py

255e925

LeiWang1999 merged commit 33d2fd6 into microsoft:main Oct 2, 2024
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TL] Adapt TL Hardware-aware Search Space with Roller #207

[TL] Adapt TL Hardware-aware Search Space with Roller #207

LeiWang1999 commented Oct 1, 2024

LeiWang1999 commented Oct 2, 2024

[TL] Adapt TL Hardware-aware Search Space with Roller #207

[TL] Adapt TL Hardware-aware Search Space with Roller #207

Conversation

LeiWang1999 commented Oct 1, 2024

LeiWang1999 commented Oct 2, 2024