Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Sparsity #1

Merged
merged 12 commits into from
Feb 1, 2024
133 changes: 31 additions & 102 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,112 +1,41 @@
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
</picture>
</p>
## Neural Magic vLLM

<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
Fork of vLLM with sparsity.

<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> |
### To Run

</p>

---

**The Second vLLM Bay Area Meetup (Jan 31st 5pm-7:30pm PT)**

We are thrilled to announce our second vLLM Meetup!
The vLLM team will share recent updates and roadmap.
We will also have vLLM collaborators from IBM coming up to the stage to discuss their insights on LLM optimizations.
Please register [here](https://lu.ma/ygxbpzhl) and join us!

---

*Latest News* 🔥
- [2023/12] Added ROCm support to vLLM.
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
- [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
- [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv!
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).

---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
- Optimized CUDA kernels

vLLM is flexible and easy to use with:

- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs

vLLM seamlessly supports many Hugging Face models, including the following architectures:

- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
- Baichuan & Baichuan2 (`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.)
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
- ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.)
- DeciLM (`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.)
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, etc.)
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
- Phi (`microsoft/phi-1_5`, `microsoft/phi-2`, etc.)
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
- Qwen2 (`Qwen/Qwen2-7B-beta`, `Qwen/Qwen-7B-Chat-beta`, etc.)
- StableLM(`stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.)
- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)

Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
Clone and install nm_gpu:

```bash
pip install vllm
git clone https://github.com/neuralmagic/nm_gpu.git
cd nm_gpu
export TORCH_CUDA_ARCH_LIST=8.6
pip install -e .
```

## Getting Started

Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)

## Contributing
Install:
```bash
cd ../
pip install -e .
```

We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
### Run Sample

## Citation
Run a 50% sparse model:

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
```bash
from vllm import LLM, SamplingParams

model = LLM(
"nm-testing/Llama-2-7b-pruned50-retrained",
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
enforce_eager=True, # Does not work with cudagraphs yet
dtype="float16",
tensor_parallel_size=1,
max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```
27 changes: 27 additions & 0 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ def __init__(
tokenizer_revision: Optional[str] = None,
max_model_len: Optional[int] = None,
quantization: Optional[str] = None,
sparsity: Optional[str] = None,
enforce_eager: bool = False,
max_context_len_to_capture: Optional[int] = None,
) -> None:
Expand All @@ -85,6 +86,7 @@ def __init__(
self.revision = revision
self.tokenizer_revision = tokenizer_revision
self.quantization = quantization
self.sparsity = sparsity
self.enforce_eager = enforce_eager
self.max_context_len_to_capture = max_context_len_to_capture

Expand All @@ -106,6 +108,7 @@ def __init__(
self._verify_load_format()
self._verify_tokenizer_mode()
self._verify_quantization()
self._verify_sparsity()
self._verify_cuda_graph()

def _verify_load_format(self) -> None:
Expand Down Expand Up @@ -144,6 +147,30 @@ def _verify_tokenizer_mode(self) -> None:
"either 'auto' or 'slow'.")
self.tokenizer_mode = tokenizer_mode

def _verify_sparsity(self) -> None:
supported_sparsity = ["sparse_w16a16"]

if self.quantization is not None:
raise ValueError(f"Both sparsity and quantization detected. Only "
"one or the other is supported at a time.")

if self.sparsity is not None:
if self.sparsity not in supported_sparsity:
raise ValueError(f"Unknown sparse method: {self.sparsity}. Must "
f"be one of {supported_sparse}.")

hf_sparsity_config = getattr(self.hf_config, "sparsity_config", None)
if hf_sparsity_config is not None:
hf_sparsity_method = str(hf_sparse_config["sparse_method"]).lower()
if self.sparsity is None:
self.sparsity = hf_sparsity_method
elif self.sparsity != hf_sparsity_method:
raise ValueError(
"Sparsity method specified in the model config "
f"({hf_sparsity_method}) does not match the sparsity "
f"method specified in the `sparsity` argument "
f"({self.sparsity}).")

def _verify_quantization(self) -> None:
supported_quantization = ["awq", "gptq", "squeezellm"]
rocm_not_supported_quantization = ["awq"]
Expand Down
14 changes: 12 additions & 2 deletions vllm/engine/arg_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ class EngineArgs:
revision: Optional[str] = None
tokenizer_revision: Optional[str] = None
quantization: Optional[str] = None
sparsity: Optional[str] = None
enforce_eager: bool = False
max_context_len_to_capture: int = 8192
enable_lora: bool = False
Expand Down Expand Up @@ -197,6 +198,15 @@ def add_cli_args(
'None, we assume the model weights are not '
'quantized and use `dtype` to determine the data '
'type of the weights.')
parser.add_argument('--sparsity',
'-s',
type=str,
choices=['sparse_w16a16', None],
default=None,
help='Method used to compress sparse weights. If '
'None, we first check the `sparsity_config` attribute '
'in the model config gile. If that is None we assume '
'the model weights are dense')
parser.add_argument('--enforce-eager',
action='store_true',
help='Always use eager-mode PyTorch. If False, '
Expand Down Expand Up @@ -260,8 +270,8 @@ def create_engine_configs(
self.download_dir, self.load_format,
self.dtype, self.seed, self.revision,
self.tokenizer_revision, self.max_model_len,
self.quantization, self.enforce_eager,
self.max_context_len_to_capture)
self.quantization, self.sparsity,
self.enforce_eager, self.max_context_len_to_capture)
cache_config = CacheConfig(self.block_size,
self.gpu_memory_utilization,
self.swap_space,
Expand Down
1 change: 1 addition & 0 deletions vllm/engine/llm_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ def __init__(
f"load_format={model_config.load_format}, "
f"tensor_parallel_size={parallel_config.tensor_parallel_size}, "
f"quantization={model_config.quantization}, "
f"sparsity={model_config.sparsity}, "
f"enforce_eager={model_config.enforce_eager}, "
f"seed={model_config.seed})")
# TODO(woosuk): Print more configs in debug mode.
Expand Down
7 changes: 7 additions & 0 deletions vllm/entrypoints/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ class LLM:
the `quantization_config` attribute in the model config file. If
that is None, we assume the model weights are not quantized and use
`dtype` to determine the data type of the weights.
sparsity: The format of the sparse model weights. Currently,
we support "sparse_w16a16". If None, we first check the `sparsity`
attribute in the model config file. If that is None, we assume the
model weights are dense and use `dtype` to determine the data
type of the weights.
revision: The specific model version to use. It can be a branch name,
a tag name, or a commit id.
tokenizer_revision: The specific tokenizer version to use. It can be a
Expand Down Expand Up @@ -75,6 +80,7 @@ def __init__(
tensor_parallel_size: int = 1,
dtype: str = "auto",
quantization: Optional[str] = None,
sparsity: Optional[str] = None,
revision: Optional[str] = None,
tokenizer_revision: Optional[str] = None,
seed: int = 0,
Expand All @@ -94,6 +100,7 @@ def __init__(
tensor_parallel_size=tensor_parallel_size,
dtype=dtype,
quantization=quantization,
sparsity=sparsity,
revision=revision,
tokenizer_revision=tokenizer_revision,
seed=seed,
Expand Down
36 changes: 29 additions & 7 deletions vllm/model_executor/layers/linear.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
divide, split_tensor_along_last_dim)
from vllm.model_executor.utils import set_weight_attrs
from vllm.logger import init_logger
from vllm.model_executor.layers.parameters import SparseParameter, get_param_data

logger = init_logger(__name__)


class LinearMethodBase(ABC):
"""Base class for different (maybe quantized) linear methods."""

Expand Down Expand Up @@ -195,7 +195,8 @@ def __init__(
def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
tp_rank = get_tensor_model_parallel_rank()
output_dim = getattr(param, "output_dim", None)
param_data = param.data
param_data = get_param_data(param)

if output_dim is not None:
shard_size = param_data.shape[output_dim]
start_idx = tp_rank * shard_size
Expand All @@ -204,6 +205,10 @@ def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
assert param_data.shape == loaded_weight.shape
param_data.copy_(loaded_weight)

# If SparsParameter, repack dense data as sparse.
if isinstance(param, SparseParameter):
param.pack()

def forward(self, input_):
bias = self.bias if not self.skip_bias_add else None

Expand All @@ -218,7 +223,6 @@ def forward(self, input_):
output_bias = self.bias if self.skip_bias_add else None
return output, output_bias


class MergedColumnParallelLinear(ColumnParallelLinear):
"""Packed linear layers with column parallelism.

Expand Down Expand Up @@ -260,9 +264,13 @@ def weight_loader(self,
param: Parameter,
loaded_weight: torch.Tensor,
loaded_shard_id: Optional[int] = None):
param_data = param.data
param_data = get_param_data(param)
output_dim = getattr(param, "output_dim", None)
if loaded_shard_id is None:
if isinstance(param, SparseParameter):
raise NotImplementedError(
"Passing loaded_shard_id=None not yet supported for SparseParameter")

# Loaded weight is already packed.
if output_dim is None:
assert param_data.shape == loaded_weight.shape
Expand Down Expand Up @@ -312,6 +320,9 @@ def weight_loader(self,
assert param_data.shape == loaded_weight.shape
param_data.copy_(loaded_weight)

# If Parameter, repack dense data as sparse.
if isinstance(param, SparseParameter):
param.pack()

class QKVParallelLinear(ColumnParallelLinear):
"""Linear layers for the attention's QKV transformation.
Expand Down Expand Up @@ -373,10 +384,14 @@ def __init__(
def weight_loader(self,
param: Parameter,
loaded_weight: torch.Tensor,
loaded_shard_id: Optional[str] = None):
param_data = param.data
loaded_shard_id: Optional[str] = None):
param_data = get_param_data(param)
output_dim = getattr(param, "output_dim", None)
if loaded_shard_id is None:
if isinstance(param, SparseParameter):
raise NotImplementedError(
"Passing loaded_shard_id=None not yet supported for SparseParameter")

# Loaded weight is already packed.
if output_dim is None:
assert param_data.shape == loaded_weight.shape
Expand Down Expand Up @@ -440,6 +455,9 @@ def weight_loader(self,
assert param_data.shape == loaded_weight.shape
param_data.copy_(loaded_weight)

# If SparseParameter, repack dense data as sparse.
if isinstance(param, SparseParameter):
param.pack()

class RowParallelLinear(torch.nn.Module):
"""Linear layer with row parallelism.
Expand Down Expand Up @@ -522,14 +540,18 @@ def __init__(
def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
tp_rank = get_tensor_model_parallel_rank()
input_dim = getattr(param, "input_dim", None)
param_data = param.data
param_data = get_param_data(param)
if input_dim is not None:
shard_size = param_data.shape[input_dim]
start_idx = tp_rank * shard_size
loaded_weight = loaded_weight.narrow(input_dim, start_idx,
shard_size)
assert param_data.shape == loaded_weight.shape
param_data.copy_(loaded_weight)

# If SparseParameter, repack dense data as sparse.
if isinstance(param, SparseParameter):
param.pack()

def forward(self, input_):
# Set up backprop all-reduce.
Expand Down
9 changes: 9 additions & 0 deletions vllm/model_executor/layers/parameters/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import torch
from vllm.model_executor.layers.parameters.sparsity import SparseParameter

def get_param_data(param: torch.nn.Parameter) -> torch.Tensor:
"""Gets parameter data in dense format."""
if isinstance(param, SparseParameter):
return param.get_dense_data()
else:
return param.data
Loading