Support Mixtral quantization using HQT #67

dudilester · 2024-06-20T11:57:45Z

No description provided.

…use of matmul class

dudilester · 2024-06-24T08:33:37Z

Wrapped the habana static_fused_moe function using a Class
Wrapped the MoE matmul calculations using a Class as well.
When running inference with HQT quantization, transposed weights of the different MoEs are calculated once and statically saved (to avoid re-calculation on each forward call)
MatmulMoe Calsses of the StaticFusedMoe instance are patched using the HQT to quantize its MoE weights and stat tensors.

nirda7 · 2024-06-24T14:16:23Z

vllm/hpu/ops.py

-        final_hidden_states += current_hidden_states_static
-
-    return final_hidden_states.view(-1, D)
+class MoeMatmul(nn.Module):


Better call it MoeLinear as it more acts like a linear than a matmul

or just use Linear without bias

Conflicts: vllm/hpu/ops.py

Initial FP8 support

It causes OOM on 70b

Co-authored-by: Krzysztof Laskowski <[email protected]>

This reverts commit 1dc6cb2.

This reverts commit 4afe86d.

remove expert_max hard code (#47) vLLM-Ext: Full enabling of ALiBi (#34) Add version inference via setuptools-scm (#58) Revert "vLLM-Ext: Full enabling of ALiBi (#34)" (#59) Remove punica_hpu.py from vllm_hpu_extension (#66) Removed previous (not-pipelined) pa implementation (#72) Add flag to enable running softmax in fp32 (#71) Update calibration readme link (#73) allow lm_head quantization in calibration process (#65) Pad to bmin if value is less (#67) Update pyproject.toml (#75) --------- Co-authored-by: Michał Kuligowski <[email protected]>

nirda7 added 7 commits June 18, 2024 11:39

support hqt on vllm

4b9b955

Support HQT on VLLM - KVCache and Mark Step uses

f3ffc8c

HQT on VLLM - prep model and finish measurements and multi cards run

8ffc3d0

HQT on VLLM - separate kv caches

f5f0972

HQT on VLLM - remove code duplications

c521c4d

HQT on VLLM - move matmul and softmax to hpu utils and revert logits …

64c8c7f

…use of matmul class

Move model to hpu when HQT is not used

2e291c5

dudilester requested review from HolyFalafel, MrGeva, Yantom1, nirda7 and bgoldberg-habana June 20, 2024 11:58

nirda7 added 2 commits June 21, 2024 22:37

fix CR comments

9d0fbb7

add model weights device load

09e0078

nirda7 reviewed Jun 24, 2024

View reviewed changes

nirda7 and others added 7 commits June 26, 2024 13:44

skip replay cached graphs during warmup

24847a9

HQT on VLLM - Enable split value in G3

90c2527

pass optimizations flags only in Lazy mode

f7c2157

Merge remote-tracking branch 'origin/habana_next' into vllm-hqt-fork

83770dc

Conflicts: vllm/hpu/ops.py

Filter-out warmup_mode before passing to model.forward

ae1d3f4

Merge pull request #75 from HabanaAI/vllm-hqt-fork

33a2620

Initial FP8 support

Profile single forward (#68)

566bdd2

dudilester force-pushed the dev/dlester/mixtral_hqt branch from 5d9f4de to e814a4a Compare July 1, 2024 09:18

Krzysztof Laskowski and others added 6 commits July 1, 2024 16:50

Skip logprobs processing for greedy

55ea726

Fix lower bucket range calculation

0674aea

Disable warmup_mode for now

15c67ed

It causes OOM on 70b

Introduce delayed sampling mechanism (#84)

77e1ab8

Co-authored-by: Krzysztof Laskowski <[email protected]>

Disable tensor cache set to True (#88)

1dc6cb2

Revert "Disable tensor cache set to True (#88)" (#89)

4afe86d

This reverts commit 1dc6cb2.

Revert "Revert "Disable tensor cache set to True (#88)" (#89)" (#90)

ca1dbf6

This reverts commit 4afe86d.

dudilester force-pushed the dev/dlester/mixtral_hqt branch from e814a4a to f4f3437 Compare July 7, 2024 09:46

Support Mixtral quantization using HQT

87d95ad

dudilester force-pushed the dev/dlester/mixtral_hqt branch from f4f3437 to 87d95ad Compare July 7, 2024 11:06

dudilester closed this Jul 24, 2024

dudilester deleted the dev/dlester/mixtral_hqt branch July 24, 2024 12:13

mfylcek mentioned this pull request Jan 14, 2025

Set vllm-hpu-extension to 6ac93fb #684

Merged

michalkuligowski mentioned this pull request Jan 15, 2025

Update requirements-hpu.txt #685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Mixtral quantization using HQT #67

Support Mixtral quantization using HQT #67

dudilester commented Jun 20, 2024

dudilester commented Jun 24, 2024

nirda7 Jun 24, 2024

nirda7 Jun 24, 2024

Support Mixtral quantization using HQT #67

Support Mixtral quantization using HQT #67

Conversation

dudilester commented Jun 20, 2024

dudilester commented Jun 24, 2024

nirda7 Jun 24, 2024

Choose a reason for hiding this comment

nirda7 Jun 24, 2024

Choose a reason for hiding this comment