Add support for the SGLang inference engine #413

nstogner · 2025-02-16T15:41:57Z

About:

💡Leverages RadixAttention, reuses shared prompt prefixes across requests (5x throughput gains in chat/RAG scenarios).
📈 Accelerates structured output (JSON/XML) by 3.1x by skipping deterministic schema elements.
🛠️Optimized CPU scheduling to prevent host-side bottlenecks
🔥 5,000 tokens/sec on Llama3-8B (A100) and 10K tokens/sec on Llama3-70B (8xH100 clusters).
🌍 Used in production by Meituan, ByteDance, and xAI
🎵 Bytedance runs 70% of internal NLP pipelines through SGLang with 5PB daily processed data
💰 Reduced xAI's Grok serving costs by 37% via KV cache reuse and optimized scheduling, serving 23M+ chats per day.
🤝 Open-source (Apache 2.0) with @openai compatibility and Python API
⚡ Supports NVIDIA, AMD GPUs, and integrates with quantization (FP8, INT4).
🧠 Supports all popular models, Llama, Mistral, Gemma, Qwen, Deepseek, Phi, Granite
🐋 Best Throughput for Deepseek R1 with Multi Token Prediction (MTP).
🔜 FP6 Weight/FP8 Activation quantization, Fast Start Up times, Cross-Cloud Load Balancing

Source: https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_what-is-sglang-and-why-does-it-matter-sglang-activity-7296907885588959232-BylZ?utm_source=share&utm_medium=member_desktop&rcm=ACoAABcIJRMBxYkMPdgFEnBdvqNuLuI1AgLLIhU

Repo: https://github.com/sgl-project/sglang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for the SGLang inference engine #413

Add support for the SGLang inference engine #413

nstogner commented Feb 16, 2025

Add support for the SGLang inference engine #413

Add support for the SGLang inference engine #413

Comments

nstogner commented Feb 16, 2025