-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the triton inference server backend "turbomind_backend" #1986
Conversation
May we upgrade the image to r24.03 in order to avoid the memory leak issue in the Python Backend less than r23.10? This would also address the issue mentioned in this link. @lvhan028 @irexyc @zhulinJulia24 ref |
Hi, @zhyncs |
OK. I'll change the default base image to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lvhan028 There are some triton/chatbot references left:
lmdeploy/lmdeploy/serve/client.py
Line 37 in 49208aa
chatbot = Chatbot(tritonserver_addr, |
from lmdeploy.serve.turbomind.chatbot import Chatbot |
lmdeploy/lmdeploy/serve/gradio/app.py
Line 50 in 49208aa
from lmdeploy.serve.gradio.triton_server_backend import \ |
lmdeploy/lmdeploy/cli/serve.py
Line 340 in 49208aa
def triton_client(args): |
Yes. This PR is still under development. Once it is done, I'll remove the WIP label |
@ispobock I've removed as guided. Please take a review. |
@zhyncs I think I'd better to open another PR to update the dockerfile |
ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM and I'll verify this on my local dev. The size of the whl will be greatly reduced, and it is expected that the compilation speed will also be much faster. Great work!
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We no longer need to install rapidjson-dev
as it is a dependency of Triton.
Perhaps we could consider updating the guide for building from source at https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/turbomind/triton_backend/triton_utils.hpp
seems to be unnecessary now and can be deleted and the include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc
needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .github/scripts/test_triton_server.py
in auto test may also be deleted, and whether the triton_client
in autotest/utils/run_client_chat.py
needs to be removed at the same time. cc @zhulinJulia24
set(TRITON_PYTORCH_INCLUDE_PATHS "" CACHE PATH "Paths to Torch includes") | ||
set(TRITON_PYTORCH_LIB_PATHS "" CACHE PATH "Paths to Torch libraries") | ||
|
||
set(TRITON_BACKEND_REPO_TAG "r22.12" CACHE STRING "Tag for triton-inference-server/backend repo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have removed the dependency on triton here, there is no longer any concern about the version or tag of triton. In this case, do we still need to create a new Dockerfile on top of r24.03's docker? ref #1986 (comment)
Thanks. I finished it as suggested. |
The include in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
trust_remote_code (bool): Whether or not to allow for custom models | ||
defined on the Hub in their own modeling files. Defaults to False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are using this argument. In my test, I have to add --trust-remote-code
for lmdeploy convert
command during converting local models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irexyc used to suggest removing it
All right, I can remove this argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply set trust-remote-code
to true by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do downstream repos still using the triton server? Shall we make a notification to them?
As far as I know, the internal downstream projects has switched to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested converting internlm2-chat-1_8b OK.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
We plan to remove the triton inference server backend "turbomind_backend" because the python_backend (#1329) outperforms the integration of turbomind_backend
BC-breaking (Optional)