Megatron LM 论文精读【论文精读】
https://www.bilibili.com/video/BV1nB4y1R7Yz
flashattention:
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness https://arxiv.org/abs/2205.14135
参数服务器(Parameter Server)逐段精读【论文精读】
https://www.bilibili.com/video/BV1YA4y197G8
GPipe 论文精读
https://www.bilibili.com/video/BV1v34y1E7zu
Pathways 论文精读
https://www.bilibili.com/video/BV1xB4y1m7Xi
vllm:
https://github.com/vllm-project/vllm
FasterTransformer:
https://github.com/NVIDIA/FasterTransformer
Transformer加速包括量化,剪枝,模型蒸馏
投机推理方向的一些论文:
Speculative Decoding:Google ICML'23
Speculative Sampling:DeepMind arXiv preprint arXiv:2302.01318 2023
SpecInfer:CMU arXiv preprint arXiv:2305.09781 2023
Medusa:Princeton
LLM Accelerator:Microsoft arXiv preprint arXiv:2304.04487 2023
REST: Retrieval-Based Speculative Decoding:Peking, Princeton arXiv preprin arXiv:2311.08252 2023
Prompt lookup decoding (PLD):https://github.com/apoorvumang/prompt-lookup-decoding
Lookahead Decoding:https://lmsys.org/blog/2023-11-21-lookahead-decoding/
一些其他的应用: