-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LLM Inference (vLLM, OpenAI format interface) #1607
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1607 +/- ##
==========================================
+ Coverage 80.33% 80.64% +0.30%
==========================================
Files 95 115 +20
Lines 6602 8154 +1552
==========================================
+ Hits 5304 6576 +1272
- Misses 1298 1578 +280 ☔ View full report in Codecov by Sentry. |
2febcb6
to
1e6da67
Compare
1e6da67
to
fe30689
Compare
869d170
to
a9ce504
Compare
be4306f
to
12acb6d
Compare
b389c01
to
891bb80
Compare
A new PR will be created next week to supplement the usage documentation. |
@@ -42,12 +42,12 @@ class Listener(Component): | |||
type_id: t.ClassVar[str] = 'listener' | |||
|
|||
def __post_init__(self): | |||
super().__post_init__() | |||
if self.identifier is None and self.model is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to switch this order (of the super().__post_init__()
call)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check for identifier in the post_init
method of Component
. Then if any subclass wants to handle a none identifier, they need to fix the value of identifier and then call the super().__post_init()
from superduperdb.ext.llm.openai import OpenAI | ||
from superduperdb.ext.llm.vllm import VllmAPI, VllmModel, VllmOpenAI | ||
|
||
__all__ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question: how does this fit together with the other OpenAI
models which are not of the chat-completion type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A chat
parameter handles this. If set chat=False
then it will call _prompt_generate
(self.client.completions.create
), otherwise it will call _chat_generate
(self.client.chat.completions.create
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I mean we have superduperdb.ext.openai.*
; what do we do with those models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops got it, we should continue to maintain these modules as support for OpenAI vendors
getLogger("httpx").setLevel(WARNING) | ||
|
||
|
||
def ensure_initialized(func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need this as a general purpose tool (not just for LLMs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, But it is only used for LLMs now. Because in the previous design, the model was instantiated first and then passed to SuperDuperDB.
We can make it as a general tool in this plan #1604
prompt_template: str = "{input}" | ||
prompt_func: Optional[Callable] = dc.field(default=None) | ||
max_batch_size: Optional[int] = 64 | ||
inference_kwargs: dict = dc.field(default_factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? Passed to native model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, different frameworks and API providers provide different parameter of inference
/generate
function.
Although they are mostly the same.
But I left a parameter for users to adapt to them. If they don’t use it, they can ignore this parameter and the default parameters of the corresponding framework or API provider will be used.
Currently, we can pass in the parameters when generating text in two ways. This is one, and the other is through db.predict("llm", max_tokens=100, **other_inference_kwargs)
It will be explained in the documentation
return [self._generate(prompt, **kwargs) for prompt in prompts] | ||
|
||
@ensure_initialized | ||
def _predict(self, X: Union[str, List[str]], one: bool = False, **kwargs: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A point for later: none of the LLM implementations we have seem to support postprocess
or preprocess
.
|
||
|
||
@dc.dataclass | ||
class BaseLLMAPI(_BaseLLM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
) | ||
return completion.choices[0].message.content | ||
|
||
async def _async_generate(self, semaphore, prompt: str, **kwargs) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe now is the time to remove these async
methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This asynchronous method is used for the underlying implementation and is used for performance optimization when calling API reasoning in batches. It will not be exposed to users, which is different from the scenario we discussed before. However, if we do not use it, we can also use other methods to provide performance optimization during batch requests, otherwise it will block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, gotcha.
|
||
|
||
@dc.dataclass | ||
class OpenAI(BaseOpenAI): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is how to synchronize this with the other models from OpenAI. I.e. what
about ext/openai/model/*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the LLM scenario, I suggest guiding users to use the llm module in the documentation. For example, if a user wants to build a RAG application, they would first go to the llm directory to find the corresponding llm interface, and to the Embedding directory to find the embedding interface, instead of looking under the openai module. The original OpenAI module can coexist.
We will still face this situation in the future, and my suggestion is to unify the underlying interfaces, such as llm and embedding, where calls to OpenAI are reused from the ext.openai
module.
I suggest that we can do this unification after completing the embedding module.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
] | ||
return await asyncio.gather(*tasks) | ||
|
||
def build_post_data(self, prompt: str, **kwargs: dict[str, Any]) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined with the prompt to wrap the request data of the vLLM API, it supports model predefined parameters and parameters passed when calling, and finally performs parameter filtering (if unnecessary parameters are passed in, an error will occur, such as the context introduced by SuperDuperDB, or others).
self.on_ray = self.on_ray or bool(self.ray_address) | ||
super().__post_init__() | ||
|
||
def init(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should do this at the module level? I.e. the moment the user tries to import our module, we should test for the vLLM
installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users only need the API model, they don't need to install vLLM. So we just need to check this when using VllmModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
|
||
runtime_env = {"pip": ["vllm"]} | ||
if not ray.is_initialized(): | ||
ray.init(address=self.ray_address, runtime_env=runtime_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this ray.init
call do under the hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connect the remote ray cluster, if not, then will start a local ray cluster.
If we do not need to connect the remote ray cluster, we can set ray_address = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Do you think we should a single notebook where we can try a range of LLMs in different forms:
- OpenAI
ray
LLM- API self-hosted LLM with vLLM
- vLLM native
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can write it in the document and then add a link to this part of the document in the notebook.
Because they are almost the same, except for the line of code that defines the model.
But if we are doing a post blog, we can separate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
from superduperdb.base.datalayer import Datalayer | ||
|
||
# Disable httpx info level logging | ||
getLogger("httpx").setLevel(WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpenAI package will print the INFO information of network-related requests when calling the interface, which is of little significance. It is disabled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jieguangzhou this was a great pull request! Thank-you.
A few questions and discussion points:
- What should we do with the existing OpenAI models?
- Shall we deprecate the
async
functions now? These may become hard to maintain if we over-invest in them. - There are a few "TODOs" in the test-suite - shall we address this now?
- Is there any trivial/ tiny LLM we can use in the integration tests?
Ok |
For models present in both application scenario modules (LLMs) and interface modules (OpenAI module), it's advised to share a common underlying interface, like the OpenAI module. The interface modules should be kept simple and extensible, while application scenario modules should adapt more to the specific scenario. Taking OpenAI LLMs as an example:
I suggest keeping two interface implementations now, one for the OpenAI module and one for LLMs, and using LLMs as a separate plug-in in the future
This BTW, we also can use multi-thread to do this.
For the test case section of the TODO, I suggest that I do it before we are about to support a large number of AI models and AI API integrations.
I originally planned to do this and upload a very small model to our huggingface repository for integration. However, vLLM installation requires a cuda environment. Currently, automatic testing is not possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR
raise Exception("You must install vllm with command 'pip install ray'") | ||
|
||
runtime_env = {"pip": ["vllm"]} | ||
if not ray.is_initialized(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we somehow utilise RayCompute from backends/ray/compute.py ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation here is unique because VLLM inherently supports Ray. Therefore, even if SuperDuperDB doesn't use Ray as its computing backend, it can still be utilized. They are separate entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, perhaps when SuperDuperDB uses Ray as its computation backend, it could by default use the same Ray instance, but it must be equipped with a GPU.
**self.get_kwargs(SamplingParams, kwargs, self.inference_kwargs) | ||
) | ||
|
||
if self.on_ray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not on_ray
where is the compute happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model inference will be done locally
assert len(results) == 2 | ||
|
||
|
||
def check_llm_as_listener_model(db, llm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this more of test suite utils method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to move this to the test suite utils tool in the future
#1601
#1536
Description
Related Issues
Checklist
make unit-testing
andmake integration-testing
successfully?Additional Notes or Comments