Skip to content

v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.

Compare
Choose a tag to compare
@Duyi-Wang Duyi-Wang released this 23 Jul 01:25
· 23 commits to main since this release
faa25f4

Highlight

  • Continuous Batching on Single ARC GPU is supported and can be integrated by vllm-xft.
  • Introduce Intel AMX instructions support for float16 data type.

Models

  • Support ChatGLM4 series models.
  • Introduce BF16/FP16 full path support for Qwen series models.

BUG fix

  • Fixed memory leak of oneDNN primitive cache.
  • Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
  • Fixed heads Split error for distributed Grouped-query attention(GQA).
  • Fixed an issue with the invokeAttentionLLaMA API.

What's Changed

Generated release nots

What's Changed

  • [Kernel] Enable continuous batching on single GPU. by @changqi1 in #452
  • [Bugfix] fixed shm reduceAdd & rope error when batch size is large by @abenmao in #457
  • [Feature] Enable AMX FP16 on next generation CPU by @wenhuanh in #456
  • [Kernel] Cache oneDNN primitive when M < XFT_PRIMITIVE_CACHE_M, default 256. by @Duyi-Wang in #460
  • [Denpendency] Pin python requirements.txt version. by @Duyi-Wang in #458
  • [Dependency] Bump web_demo requirement. by @Duyi-Wang in #463
  • [Layers] Enable AMX FP16 of FlashAttn by @abenmao in #459
  • [Layers] Fix invokeAttentionLLaMA API by @wenhuanh in #464
  • [Readme] Add accepted papers by @wenhuanh in #465
  • [Kernel] Make SelfAttention prepared for AMX_FP16; More balanced task split in Cross Attention by @pujiang2018 in #466
  • [Kernel] Upgrade xDNN to v1.5.2 and make AMX_FP16 work by @pujiang2018 in #468

Full Changelog: v1.7.3...v1.8.0