feat: torch.compile and custom_op support #554

abcdabcd987 · 2024-10-24T23:25:07Z

Follow up of #552. This PR adds torch library annotation to all FlashInfer kernels so that torch.compile can recognize the kernels. Most changes are tedious.

I manually ran subsets of pytest test cases when I made these changes, but since there are too many of them and also some of them didn't pass even before I made the change, I cannot guarantee it's all working. To run tests with torch.compile, pass FLASHINFER_TEST_TORCH_COMPILE=1 env.

# With torch.compile
FLASHINFER_TEST_TORCH_COMPILE=1 pytest -svx tests/test_norm.py

# Without torch.compile
pytest -svx tests/test_norm.py

Notable changes:

For the prefill and decode pybind, it used to return Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] depending on return_lse. This causes trouble for torch.compile. I changed the pybind interface to accept a maybe_lse: Optional[torch.Tensor] and only return one tensor. The allocation of the lse tensor is moved to Python side. The Python API does not change.
chain_speculative_sampling pybind: Move the allocation of accepted and emitted from C++ to Python. This is because torch.compile doesn't like returning input tensor as output tensor. The Python API does not change.

Piggyback changes:

BatchPrefillWithRaggedKVCacheWrapper.plan: Bugfix qo_indptr not on CPU
merge_state: Fix typo in docs
Change run_return_lse(...) to run(..., return_lse=True) because torch.compile does not recognize functools.partial.
In tests, change flashinfer.xxx() to flashinfer.<module>.xxx() so that the monkeypatch works.

Unsupported for torch.compile:

flashinfer.quantization.segment_packbits: Because it's data dependent.

Untouched:

sparse.py: Tests didn't pass beforehand, so I skiped this. Also, it doesn't seem like need custom_op annotations, as it does not have CUDA kernels.

Failed test cases:

batch_decode non contiguous kv: test_batch_decode_with_paged_kv_cache[False-kv_dtype0-q_dtype0-True-0.0-NONE-NHD-128-4-4-1-54-12]

The block sparse attention unittests failed as noted in #554, this PR fixes the issue.

yzh119

LGTM, thanks for the huge improvement, I have left some tiny suggestions.

python/flashinfer/decode.py

python/flashinfer/quantization.py

@abcdabcd987

#554 didn't update the `batch_prefill.cu` (which was used in AOT mode) according to the API change. This PR fixes the issue. cc @abcdabcd987

Fix bugs introduced in #554 1. Function signature change for `chain_speculative_sampling()` pybind in aot. 2. `packbits()` uses a str default value, which is not supported by PyTorch 2.4. This PR added a workaround. 3. For Pytorch < 2.4, the two decorators (`register_custom_op()` and `register_fake_op()`) should return identity function instead of `None`.

abcdabcd987 requested a review from yzh119 October 24, 2024 23:25

This was referenced Oct 24, 2024

pytorch 2.4 support #395

Open

Runtime error with single_prefill_with_kv_cache while Compilation #541

Open

misc: typing improvement #555

Merged

torchlib

f28464d

abcdabcd987 force-pushed the lequn/1023-torchlib branch from a870e3e to f28464d Compare October 25, 2024 02:25

yzh119 mentioned this pull request Oct 25, 2024

bugfix: fix block sparse wrappers #556

Merged

yzh119 added a commit that referenced this pull request Oct 25, 2024

bugfix: fix block sparse wrappers (#556)

2989556

The block sparse attention unittests failed as noted in #554, this PR fixes the issue.

yzh119 approved these changes Oct 25, 2024

View reviewed changes

python/flashinfer/decode.py Show resolved Hide resolved

python/flashinfer/quantization.py Outdated Show resolved Hide resolved

add notes to docs

a423029

yzh119 merged commit 9bf916f into flashinfer-ai:main Oct 25, 2024

github-actions bot mentioned this pull request Oct 25, 2024

chore(main): release 0.2.0 #476

Open

yzh119 mentioned this pull request Oct 26, 2024

bugfix: fix batch_prefill.cu in AOT mode after #554 #559

Merged

yzh119 added a commit that referenced this pull request Oct 26, 2024

bugfix: fix batch_prefill.cu in AOT mode after #554 (#559)

ea86f81

#554 didn't update the `batch_prefill.cu` (which was used in AOT mode) according to the API change. This PR fixes the issue. cc @abcdabcd987

abcdabcd987 mentioned this pull request Oct 26, 2024

bugfix for torch library annotation #562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: torch.compile and custom_op support #554

feat: torch.compile and custom_op support #554

abcdabcd987 commented Oct 24, 2024 •

edited

Loading

yzh119 left a comment

feat: torch.compile and custom_op support #554

feat: torch.compile and custom_op support #554

Conversation

abcdabcd987 commented Oct 24, 2024 • edited Loading

yzh119 left a comment

Choose a reason for hiding this comment

abcdabcd987 commented Oct 24, 2024 •

edited

Loading