-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734
Conversation
4e93d2f
to
9625d62
Compare
An overview of tasks derived from the pull request discussions: Immediate changescore/
worker/ and executor/
models/
Separate model_runner and embedding_model_runner
Further evaluation
HF embeddingsHF's Adhering to the same design would require further discussion and evaluation of the points mentioned above. |
e935ca6
to
c0e21a1
Compare
b897f77
to
0b352b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary review, will continue to add comments as I continue reading through this PR. Thanks a lot for all the hard work so far!!!
@@ -253,9 +253,14 @@ def __init__( | |||
self.scheduler_config.max_model_len, | |||
self.scheduler_config.max_num_batched_tokens) | |||
|
|||
version = "v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nit: Would it make sense to clarify/rename the version naming scheme? v3 implies an improved version over v1 and/or v2 when in reality it represents a placeholder when we do not need it. This is very nit so feel free to ignore.
"dimensions is currently not supported") | ||
|
||
model_name = request.model | ||
request_id = f"cmpl-{random_uuid()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be prefixing the request_id with something other than cmpl
? I noticed both chat and completions have this as the request_id, so unsure if we should be keeping this here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your contribution @CatherineSue!
I took a pass and left some comments and suggestions. Overall the design and implementation is pretty straightforward to me, and thank you for creating the no op blockmanager to separate the logic.
Some thoughts I would like to throw out there:
- The current implementation doesn't allow an engine to do generation and embedding at the same time (since the
embedding_mode
is passed to initialize the engine), so I wonder if it's actually worth the effort to create a separateEmbeddingEngine
since a lot of logics inLLMEngine
are completely not needed, and this separation could also make a lot of higher level APIs cleaner. - If the end goal is indeed to have engine to support both at the same time, then we should think about a clean way to support it, as well as supporting other vanilla embedding models.
Happy to discuss and hear your thoughts on this!
vllm/core/block_manager_v3.py
Outdated
class BlockSpaceManagerV3(BlockSpaceManager): | ||
"""A simple version of BlockSpaceManager for use in environments | ||
where block management is not required. | ||
|
||
This class provides the same interface as BlockSpaceManager, but its | ||
methods perform no actions or return simple values like True in specific | ||
actions. It's designed to be used in scenarios where the overhead of | ||
block management is unnecessary, such as in an embedding environment. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming here is a bit misleading since this is essentially a dummy blockmanager rather than v3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I had a quite debate on the naming. Is DummyBlockSpaceManager better or SimpleBlockSpaceManager? A few functions here always return true such as can_allocate
, and can_swap_out
. So it feels not that dummy.
Additional thought: If there is plan to make Scheduler more extensible, maybe in the future we can simplify the logic of schedule with embedding, thus further cleanse this class as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could it EmbeddingModelBlockSpaceManager
to signify that it is used for EmbeddingModels
.
Only negative would be if some other type of model wanted to use this in the future
@ywang96 thanks for the initial review! I have applied the suggestions and resolved some comments.
I agree that having a separate EmbeddingEngine has a lot of benefits. Besides making the design cleaner, adding more functionalities in embedding and supporting more embedding models would be easier. |
@ywang96 A question related to design: intfloat/e5-mistral-7b-instruct requires an eos_token_id appended at the end each input. See:
I checked a few embedding models, it seems specific to this model. Since input_ids are processed in ModelRunner._prepare_prompt, is there a good place to inject eos_token_id to the end of each input? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work on this.
@ywang96 and I are syncing this morning on this topic.
Will revert with detailed plan to get this merged
vllm/core/block_manager_v3.py
Outdated
class BlockSpaceManagerV3(BlockSpaceManager): | ||
"""A simple version of BlockSpaceManager for use in environments | ||
where block management is not required. | ||
|
||
This class provides the same interface as BlockSpaceManager, but its | ||
methods perform no actions or return simple values like True in specific | ||
actions. It's designed to be used in scenarios where the overhead of | ||
block management is unnecessary, such as in an embedding environment. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could it EmbeddingModelBlockSpaceManager
to signify that it is used for EmbeddingModels
.
Only negative would be if some other type of model wanted to use this in the future
cc @simon-mo for plan here @CatherineSue This is in good shape. We are very close to being ready to merge. @ywang96 and I discussed. There are a few things we want to do for the final implementation, but we want to take an approach of landing something close to the current version and then following up with incremental work. Here's what we want to do to get ready for merge: Needed For Merge
TestsThis is a big feature that needs end-to-end testing. I would suggest that we focus in the following areas:
Clean up
|
Follow Ups Post MergeAfter we merge this initial implementation, we can refactor in the following way: Replace
|
@robertgshaw2-neuralmagic thanks for the feedback. Working on resolving the comments and getting the checklist. Will update it soon. |
Thanks @CatherineSue Apologies for the delay on getting this reviewed and thank you so much for your contribution :) |
@robertgshaw2-neuralmagic I resolved all the comments. Here's an overview of the tasks checked in the new commits:
|
@CatherineSue - awesome! Are you planning to resolve the merge conflicts? Should I review now? |
@robertgshaw2-neuralmagic |
that works, ill review tomorrow |
ping me when ready |
I can take a first pass too whenever it's ready if @robertgshaw2-neuralmagic doesn't get there before me :) |
Thank you so much for your efforts. This is ready to go. Just letting the CI run and then will merge. |
@robertgshaw2-neuralmagic thank you so much for all the time and help on reviews and CIs!! |
@CatherineSue Thanks for the incredible works! Is there a speed comparison between vllm and hf/sentence-transformers in terms of inference speed? |
Are there any updates to the project docs on how to actually use this feature? |
Are there any updates to the project docs on how to actually use this feature? |
@linuxliker @K-Mistele this feature is ready to use. You can follow examples/openai_embedding_client.py to use it. |
Thanks for the effort @CatherineSue 🙏 There is no model name configured in the example. I flew over the code, but unfortunately couldn't figure out how to set an embedding and a regular model at the same time. Thanks in advance :) |
@yannickschuchmann thanks for trying this out. You would need to run two vLLM instances. Running one vLLM instance with two models is not supported for now. |
If a model's tokenizer has |
Key Features of This Pull Request
This PR adds the e5-mistral-7b-instruct model and enables an E2E embedding vector generation.
There are a few open PRs that add support for embedding models. Our PR uniquely addresses the following key issues:
Integration of the OpenAI Server Front End
This PR includes comprehensive end-to-end (E2E) embedding functionality, spanning from the OpenAI Front End to the Back End.
Turn off KV cache with embedding models
This PR introduces the capability to turn off KV cache when operating in embedding mode, which includes the block_tables and cache_engine.
High Level Design
The embedding model can essentially be considered a special type of generative model with max_token=1 from an inference perspective. Both embedding and generative models (with max_token=1) require a single feedforward calculation without the need for generating any subsequent new tokens. The primary differences are:
The embedding model returns the hidden state, while the generative model takes an extra step to sample and return the first output token. In this sense, the embedding model is a subset of the generative model in terms of the calculations performed during the single feedforward operation on GPUs. They differ in their API specifications concerning request and output formats.
As a consequence, the serving of embedding models is:
Therefore, in this PR, our current implementation focuses on bypassing vLLM’s components for subsequent token generations, such as KV cache and CUDA Graph to avoid unnecessary GPU memory, which in turn is useful to optimize the performance for embedding serving. If these technologies (e.g., CUDA Graph) are enhanced to improve prompt processing, they can be directly applied to embedding models. At a high level, our PR is independent of such crucial feature tensor parallelism, quantization, etc., which are addressed in Milestone 1 below. We have tested it with tensor parallelism > 1 and it functions effectively. We are willing to work together to test/improve these crucial features where needed.
Benchmarking Results
On 1 H100 GPU, vLLM’s serving of the E5 embedding model reaches a max throughput up to 36.6k tokens/s and remains consistent across sequence lengths from 128 to 32k (when
gpu_memory_utilization
is 0.8).On 1 A100 GPU, it reaches up to 15.3k tokens/second and also remains consistent across sequence lengths. As a comparison, in a test of the latest ONNX, the speed is up to 12.6k tokens/sec when sequence length is low (256 tokens), is reduced to 8k tokens/sec when sequence length is 2k, and shows a general trend of getting worse when the sequent length is longer.
See the figure for more details of our testing results on H100 and A100
Note: The
throughput
is measured as the"number of sequences in a batch" * "sequence length" / "end-to-end latency"
Design and Implementation Details
Add Embedding generation
Add Embedding API to
entrypoints/openai
EmbeddingRequest
andEmbeddingResponse
to protocol.pyserving_embedding.py
OpenAIServingEmbedding
to api_serverllm.py
work with embeddingAdd Embedding in outputs and sequence
RequestOutput
andSequenceGroupOutput
.Completion*Output
andEmbedding*Output
.EmbeddingOutput
,EmbeddingRequestOutput
,RequestOutputFactory
andEmbeddingSequenceGroupOutput
to support processing the embedding outputsequence.
Add MistralModel and embedding generation in llama.py
llama.embedding()
andload_weights
in LlamaModel to support forwardand embedding vector generation
embedding_mode
is True in model_runnerload_weights
inllama.py
to support embedding modelsDisable KV Cache for Embedding serving
Skip
slot_mapping
with embedding modeslot_mapping
is only used in model_executor/layers/attention.py whenkv_cache is not None. In embedding mode, we pass None kv_cache. So no
need to process slot_mapping
Turn off KV cache for embedding mode
The goal is to disable the block_table and cache_engine completely, so
we don't consider allocating blocks for KV cache for embedding mode
embedding_mode
toModelConfig
andSchedulerConfig
BlockSpaceManagerProxy
to control the block management behaviorfor embedding mode
Update parameters for max batching
profile_max_batched_tokens_for_embedding
to profile themax_num_batched_tokens for embedding server mode
ray_gpu_executor.py
to support tensor parallelism > 1Notes
Overlaps with other PRs
This PR overlaps with the following Milestones in Supporting embedding models #3187
Milestone 1: Basic functionality using Hugging Face models [Partially completed]
Note: Instead of using
LLM.encode()
, this PR currently addsembedding()
toLlamaModel
, and then keepsLLM.generate()
for embedding.G) Update parameters for max batching [Completed]
We introduced
profile_max_batched_tokens_for_embedding
ingpu_executor.py
to support the maximum number of tokens the GPU can take in one batch.Milestone 2: Implement the model properly [Completed]
This PR focuses on adding the e5-mistral-7b-instruct model, which can utilize llama.py. So it already uses the vLLM layer primitives.
Milestone 4: Add the OpenAI Server Front End [Completed]
This PR has implemented the Embedding API to
entrypoints/openai
.Discrepancies with other discussions
This PR didn't implement the following in this discussion in Supporting embedding models #3187
Move finish sequence logic to check_stop
Currently, the logic is in
llm_engine._process_model_output()
The logic should be in
llm_engine._check_stop
Automatically detect that the model is an embedding model
The user should not have to specify that it is an embedding model
Somewhere in the vllm code, create a registry that selects which models are embedding and which models are decoders
F) Update Pass
EmbeddingParams
around instead ofSamplingParams
note: this is going to require a lot of work
Note: This PR passes
SamplingParams()
to the LLMEngine and disabled the use of it in embedding mode. As separatingEmbeddingParams
andSamplingParams
requires changes to the UX, it would be easier to discuss and review in a following PR.Discussion Points and Considerations
Simplifying the Workflow
model_runner.py
) or integrating embedding as a subset of generation needs.Soundness of Profiling
profile_max_batched_tokens_for_embedding
is sufficient to support the maximum number of tokens in one batch without causing CUDA OOM.