Megatron LM 论文精读【论文精读】

flashattention:

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness https://arxiv.org/abs/2205.14135

参数服务器（Parameter Server）逐段精读【论文精读】

GPipe 论文精读

Pathways 论文精读

vllm:

FasterTransformer：

Transformer加速包括量化，剪枝，模型蒸馏

投机推理方向的一些论文：

Speculative Decoding：Google ICML'23

Speculative Sampling：DeepMind arXiv preprint arXiv:2302.01318 2023

SpecInfer：CMU arXiv preprint arXiv:2305.09781 2023

Medusa：Princeton

LLM Accelerator：Microsoft arXiv preprint arXiv:2304.04487 2023

REST: Retrieval-Based Speculative Decoding：Peking, Princeton arXiv preprin arXiv:2311.08252 2023

一些其他的应用：

Provide feedback

Saved searches