Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the SGLang inference engine #413

Open
nstogner opened this issue Feb 16, 2025 · 0 comments
Open

Add support for the SGLang inference engine #413

nstogner opened this issue Feb 16, 2025 · 0 comments

Comments

@nstogner
Copy link
Contributor

About:

💡Leverages RadixAttention, reuses shared prompt prefixes across requests (5x throughput gains in chat/RAG scenarios).
📈 Accelerates structured output (JSON/XML) by 3.1x by skipping deterministic schema elements.
🛠️Optimized CPU scheduling to prevent host-side bottlenecks
🔥 5,000 tokens/sec on Llama3-8B (A100) and 10K tokens/sec on Llama3-70B (8xH100 clusters).
🌍 Used in production by Meituan, ByteDance, and xAI
🎵 Bytedance runs 70% of internal NLP pipelines through SGLang with 5PB daily processed data
💰 Reduced xAI's Grok serving costs by 37% via KV cache reuse and optimized scheduling, serving 23M+ chats per day.
🤝 Open-source (Apache 2.0) with @openai compatibility and Python API
⚡ Supports NVIDIA, AMD GPUs, and integrates with quantization (FP8, INT4).
🧠 Supports all popular models, Llama, Mistral, Gemma, Qwen, Deepseek, Phi, Granite
🐋 Best Throughput for Deepseek R1 with Multi Token Prediction (MTP).
🔜 FP6 Weight/FP8 Activation quantization, Fast Start Up times, Cross-Cloud Load Balancing

Source: https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_what-is-sglang-and-why-does-it-matter-sglang-activity-7296907885588959232-BylZ?utm_source=share&utm_medium=member_desktop&rcm=ACoAABcIJRMBxYkMPdgFEnBdvqNuLuI1AgLLIhU

Repo: https://github.com/sgl-project/sglang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant