You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
💡Leverages RadixAttention, reuses shared prompt prefixes across requests (5x throughput gains in chat/RAG scenarios).
📈 Accelerates structured output (JSON/XML) by 3.1x by skipping deterministic schema elements.
🛠️Optimized CPU scheduling to prevent host-side bottlenecks
🔥 5,000 tokens/sec on Llama3-8B (A100) and 10K tokens/sec on Llama3-70B (8xH100 clusters).
🌍 Used in production by Meituan, ByteDance, and xAI
🎵 Bytedance runs 70% of internal NLP pipelines through SGLang with 5PB daily processed data
💰 Reduced xAI's Grok serving costs by 37% via KV cache reuse and optimized scheduling, serving 23M+ chats per day.
🤝 Open-source (Apache 2.0) with @openai compatibility and Python API
⚡ Supports NVIDIA, AMD GPUs, and integrates with quantization (FP8, INT4).
🧠 Supports all popular models, Llama, Mistral, Gemma, Qwen, Deepseek, Phi, Granite
🐋 Best Throughput for Deepseek R1 with Multi Token Prediction (MTP).
🔜 FP6 Weight/FP8 Activation quantization, Fast Start Up times, Cross-Cloud Load Balancing
About:
💡Leverages RadixAttention, reuses shared prompt prefixes across requests (5x throughput gains in chat/RAG scenarios).
📈 Accelerates structured output (JSON/XML) by 3.1x by skipping deterministic schema elements.
🛠️Optimized CPU scheduling to prevent host-side bottlenecks
🔥 5,000 tokens/sec on Llama3-8B (A100) and 10K tokens/sec on Llama3-70B (8xH100 clusters).
🌍 Used in production by Meituan, ByteDance, and xAI
🎵 Bytedance runs 70% of internal NLP pipelines through SGLang with 5PB daily processed data
💰 Reduced xAI's Grok serving costs by 37% via KV cache reuse and optimized scheduling, serving 23M+ chats per day.
🤝 Open-source (Apache 2.0) with @openai compatibility and Python API
⚡ Supports NVIDIA, AMD GPUs, and integrates with quantization (FP8, INT4).
🧠 Supports all popular models, Llama, Mistral, Gemma, Qwen, Deepseek, Phi, Granite
🐋 Best Throughput for Deepseek R1 with Multi Token Prediction (MTP).
🔜 FP6 Weight/FP8 Activation quantization, Fast Start Up times, Cross-Cloud Load Balancing
Source: https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_what-is-sglang-and-why-does-it-matter-sglang-activity-7296907885588959232-BylZ?utm_source=share&utm_medium=member_desktop&rcm=ACoAABcIJRMBxYkMPdgFEnBdvqNuLuI1AgLLIhU
Repo: https://github.com/sgl-project/sglang
The text was updated successfully, but these errors were encountered: