Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LLM Inference (vLLM, OpenAI format interface) #1607

Merged
merged 15 commits into from
Jan 2, 2024

Conversation

jieguangzhou
Copy link
Collaborator

@jieguangzhou jieguangzhou commented Dec 27, 2023

#1601
#1536

Description

Related Issues

Checklist

  • Is this code covered by new or existing unit tests or integration tests?
  • Did you run make unit-testing and make integration-testing successfully?
  • Do new classes, functions, methods and parameters all have docstrings?
  • Were existing docstrings updated, if necessary?
  • Was external documentation updated, if necessary?

Additional Notes or Comments

@codecov-commenter
Copy link

codecov-commenter commented Dec 27, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (34830a7) 80.33% compared to head (b14a33d) 80.64%.
Report is 1342 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1607      +/-   ##
==========================================
+ Coverage   80.33%   80.64%   +0.30%     
==========================================
  Files          95      115      +20     
  Lines        6602     8154    +1552     
==========================================
+ Hits         5304     6576    +1272     
- Misses       1298     1578     +280     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jieguangzhou jieguangzhou changed the title Support vLLM api service Support vLLM Dec 27, 2023
@jieguangzhou jieguangzhou force-pushed the feat/llm-base branch 2 times, most recently from 2febcb6 to 1e6da67 Compare December 27, 2023 11:14
@jieguangzhou jieguangzhou linked an issue Dec 27, 2023 that may be closed by this pull request
@jieguangzhou jieguangzhou force-pushed the feat/llm-base branch 2 times, most recently from 869d170 to a9ce504 Compare December 28, 2023 08:46
@jieguangzhou jieguangzhou linked an issue Dec 28, 2023 that may be closed by this pull request
@jieguangzhou jieguangzhou force-pushed the feat/llm-base branch 2 times, most recently from be4306f to 12acb6d Compare December 29, 2023 14:48
@jieguangzhou jieguangzhou marked this pull request as ready for review December 29, 2023 15:27
@jieguangzhou jieguangzhou changed the title Support vLLM Support LLM Inference (vLLM, OpenAI format interface) Dec 29, 2023
@jieguangzhou
Copy link
Collaborator Author

A new PR will be created next week to supplement the usage documentation.

@@ -42,12 +42,12 @@ class Listener(Component):
type_id: t.ClassVar[str] = 'listener'

def __post_init__(self):
super().__post_init__()
if self.identifier is None and self.model is not None:
Copy link
Collaborator

@blythed blythed Dec 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to switch this order (of the super().__post_init__() call)?

Copy link
Collaborator Author

@jieguangzhou jieguangzhou Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for identifier in the post_init method of Component. Then if any subclass wants to handle a none identifier, they need to fix the value of identifier and then call the super().__post_init()

from superduperdb.ext.llm.openai import OpenAI
from superduperdb.ext.llm.vllm import VllmAPI, VllmModel, VllmOpenAI

__all__ = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question: how does this fit together with the other OpenAI models which are not of the chat-completion type?

Copy link
Collaborator Author

@jieguangzhou jieguangzhou Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A chat parameter handles this. If set chat=False then it will call _prompt_generate(self.client.completions.create), otherwise it will call _chat_generate(self.client.chat.completions.create)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I mean we have superduperdb.ext.openai.*; what do we do with those models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops got it, we should continue to maintain these modules as support for OpenAI vendors

getLogger("httpx").setLevel(WARNING)


def ensure_initialized(func):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need this as a general purpose tool (not just for LLMs)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, But it is only used for LLMs now. Because in the previous design, the model was instantiated first and then passed to SuperDuperDB.
We can make it as a general tool in this plan #1604

prompt_template: str = "{input}"
prompt_func: Optional[Callable] = dc.field(default=None)
max_batch_size: Optional[int] = 64
inference_kwargs: dict = dc.field(default_factory=dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Passed to native model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, different frameworks and API providers provide different parameter of inference/generate function.
Although they are mostly the same.

But I left a parameter for users to adapt to them. If they don’t use it, they can ignore this parameter and the default parameters of the corresponding framework or API provider will be used.

Currently, we can pass in the parameters when generating text in two ways. This is one, and the other is through db.predict("llm", max_tokens=100, **other_inference_kwargs)

It will be explained in the documentation

return [self._generate(prompt, **kwargs) for prompt in prompts]

@ensure_initialized
def _predict(self, X: Union[str, List[str]], one: bool = False, **kwargs: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A point for later: none of the LLM implementations we have seem to support postprocess or preprocess.



@dc.dataclass
class BaseLLMAPI(_BaseLLM):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

)
return completion.choices[0].message.content

async def _async_generate(self, semaphore, prompt: str, **kwargs) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe now is the time to remove these async methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asynchronous method is used for the underlying implementation and is used for performance optimization when calling API reasoning in batches. It will not be exposed to users, which is different from the scenario we discussed before. However, if we do not use it, we can also use other methods to provide performance optimization during batch requests, otherwise it will block

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, gotcha.



@dc.dataclass
class OpenAI(BaseOpenAI):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is how to synchronize this with the other models from OpenAI. I.e. what
about ext/openai/model/*?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the LLM scenario, I suggest guiding users to use the llm module in the documentation. For example, if a user wants to build a RAG application, they would first go to the llm directory to find the corresponding llm interface, and to the Embedding directory to find the embedding interface, instead of looking under the openai module. The original OpenAI module can coexist.
We will still face this situation in the future, and my suggestion is to unify the underlying interfaces, such as llm and embedding, where calls to OpenAI are reused from the ext.openai module.

I suggest that we can do this unification after completing the embedding module.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

]
return await asyncio.gather(*tasks)

def build_post_data(self, prompt: str, **kwargs: dict[str, Any]) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined with the prompt to wrap the request data of the vLLM API, it supports model predefined parameters and parameters passed when calling, and finally performs parameter filtering (if unnecessary parameters are passed in, an error will occur, such as the context introduced by SuperDuperDB, or others).

self.on_ray = self.on_ray or bool(self.ray_address)
super().__post_init__()

def init(self):
Copy link
Collaborator

@blythed blythed Dec 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do this at the module level? I.e. the moment the user tries to import our module, we should test for the vLLM installation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users only need the API model, they don't need to install vLLM. So we just need to check this when using VllmModel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


runtime_env = {"pip": ["vllm"]}
if not ray.is_initialized():
ray.init(address=self.ray_address, runtime_env=runtime_env)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this ray.init call do under the hood?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect the remote ray cluster, if not, then will start a local ray cluster.
If we do not need to connect the remote ray cluster, we can set ray_address = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Do you think we should a single notebook where we can try a range of LLMs in different forms:

  • OpenAI
  • ray LLM
  • API self-hosted LLM with vLLM
  • vLLM native

Copy link
Collaborator Author

@jieguangzhou jieguangzhou Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can write it in the document and then add a link to this part of the document in the notebook.
Because they are almost the same, except for the line of code that defines the model.
But if we are doing a post blog, we can separate it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

from superduperdb.base.datalayer import Datalayer

# Disable httpx info level logging
getLogger("httpx").setLevel(WARNING)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAI package will print the INFO information of network-related requests when calling the interface, which is of little significance. It is disabled here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Collaborator

@blythed blythed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jieguangzhou this was a great pull request! Thank-you.

A few questions and discussion points:

  • What should we do with the existing OpenAI models?
  • Shall we deprecate the async functions now? These may become hard to maintain if we over-invest in them.
  • There are a few "TODOs" in the test-suite - shall we address this now?
  • Is there any trivial/ tiny LLM we can use in the integration tests?

@blythed
Copy link
Collaborator

blythed commented Dec 30, 2023

A new PR will be created next week to supplement the usage documentation.

Ok

@jieguangzhou
Copy link
Collaborator Author

jieguangzhou commented Jan 2, 2024

@jieguangzhou this was a great pull request! Thank-you.

A few questions and discussion points:

  • What should we do with the existing OpenAI models?

For models present in both application scenario modules (LLMs) and interface modules (OpenAI module), it's advised to share a common underlying interface, like the OpenAI module. The interface modules should be kept simple and extensible, while application scenario modules should adapt more to the specific scenario.

Taking OpenAI LLMs as an example:

  1. Remove the context logic in the OpenAI module's OpenAIChatCompletion.
  2. Let llms' OpenAI reuse it, adding only the separate Context logic.
    However, this PR does not achieve this as it has two different implementations.

I suggest keeping two interface implementations now, one for the OpenAI module and one for LLMs, and using LLMs as a separate plug-in in the future

  • Shall we deprecate the async functions now? These may become hard to maintain if we over-invest in them.

This async functions` is used for the underlying implementation and is used for performance optimization when calling API reasoning in batches. It will not be exposed to users, which is different from the scenario we discussed before. However, if we do not use it, we can also use other methods to provide performance optimization during batch requests, otherwise it will block.

BTW, we also can use multi-thread to do this.

  • There are a few "TODOs" in the test-suite - shall we address this now?

For the test case section of the TODO, I suggest that I do it before we are about to support a large number of AI models and AI API integrations.
Ensure that all models are compatible with SuperDuperDB

  1. Add models and correctly execute db.predict
  2. Add models as listeners and correctly handle data in the database
  3. Models as Listeners can correctly handle CDC data (perhaps this can be combined with 2)
  • Is there any trivial/ tiny LLM we can use in the integration tests?

I originally planned to do this and upload a very small model to our huggingface repository for integration. However, vLLM installation requires a cuda environment. Currently, automatic testing is not possible.

@jieguangzhou jieguangzhou merged commit e36268b into superduper-io:main Jan 2, 2024
2 checks passed
Copy link
Contributor

@kartik4949 kartik4949 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR

raise Exception("You must install vllm with command 'pip install ray'")

runtime_env = {"pip": ["vllm"]}
if not ray.is_initialized():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we somehow utilise RayCompute from backends/ray/compute.py ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation here is unique because VLLM inherently supports Ray. Therefore, even if SuperDuperDB doesn't use Ray as its computing backend, it can still be utilized. They are separate entities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, perhaps when SuperDuperDB uses Ray as its computation backend, it could by default use the same Ray instance, but it must be equipped with a GPU.

**self.get_kwargs(SamplingParams, kwargs, self.inference_kwargs)
)

if self.on_ray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not on_ray where is the compute happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model inference will be done locally

assert len(results) == 2


def check_llm_as_listener_model(db, llm):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this more of test suite utils method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to move this to the test suite utils tool in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LLM] Bring your own "OpenAI" style deployment [LLM] vLLM API integration
4 participants