Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the triton inference server backend "turbomind_backend" #1986

Merged
merged 15 commits into from
Jul 17, 2024

Conversation

lvhan028
Copy link
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

We plan to remove the triton inference server backend "turbomind_backend" because the python_backend (#1329) outperforms the integration of turbomind_backend

BC-breaking (Optional)

  • TC should be updated @zhulinJulia24. Please remove TCs related to the triton inference server with turbomind_backend

@lvhan028 lvhan028 added the WIP label Jul 10, 2024
@zhyncs
Copy link
Collaborator

zhyncs commented Jul 10, 2024

May we upgrade the image to r24.03 in order to avoid the memory leak issue in the Python Backend less than r23.10? This would also address the issue mentioned in this link. @lvhan028 @irexyc @zhulinJulia24

ref

https://github.com/InternLM/lmdeploy/tree/main/lmdeploy/serve/turbomind/triton_python_backend#step-2-run-the-triton-server

#1363 (comment)

#1371

@lvhan028
Copy link
Collaborator Author

Hi, @zhyncs
What's the cuda version in r24.03?

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 11, 2024

@lvhan028
Copy link
Collaborator Author

OK. I'll change the default base image to nvcr.io/nvidia/tritonserver:24.03-py3
When we release docker image, both versions (24.03-py3, 22.12-py3) will be built

Copy link
Contributor

@ispobock ispobock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lvhan028 There are some triton/chatbot references left:

chatbot = Chatbot(tritonserver_addr,

from lmdeploy.serve.turbomind.chatbot import Chatbot

from lmdeploy.serve.gradio.triton_server_backend import \

def triton_client(args):

@lvhan028
Copy link
Collaborator Author

Yes. This PR is still under development. Once it is done, I'll remove the WIP label

@lvhan028 lvhan028 added improvement and removed WIP labels Jul 16, 2024
@lvhan028 lvhan028 requested a review from zhulinJulia24 July 16, 2024 15:58
@lvhan028
Copy link
Collaborator Author

Hi @lvhan028 There are some triton/chatbot references left:

chatbot = Chatbot(tritonserver_addr,

from lmdeploy.serve.turbomind.chatbot import Chatbot

from lmdeploy.serve.gradio.triton_server_backend import \

def triton_client(args):

@ispobock I've removed as guided. Please take a review.

@lvhan028
Copy link
Collaborator Author

lvhan028 commented Jul 16, 2024

@zhyncs I think I'd better to open another PR to update the dockerfile

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 16, 2024

@zhyncs I think I'd better to open another PR to update the dockerfile

ok

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM and I'll verify this on my local dev. The size of the whl will be greatly reduced, and it is expected that the compilation speed will also be much faster. Great work!

@ispobock
Copy link
Contributor

Hi @lvhan028 There are some triton/chatbot references left:

chatbot = Chatbot(tritonserver_addr,

from lmdeploy.serve.turbomind.chatbot import Chatbot

from lmdeploy.serve.gradio.triton_server_backend import \

def triton_client(args):

@ispobock I've removed as guided. Please take a review.

LGTM

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We no longer need to install rapidjson-dev as it is a dependency of Triton.
Perhaps we could consider updating the guide for building from source at https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md.

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/turbomind/triton_backend/triton_utils.hpp seems to be unnecessary now and can be deleted and the include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated.

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .github/scripts/test_triton_server.py in auto test may also be deleted, and whether the triton_client in autotest/utils/run_client_chat.py needs to be removed at the same time. cc @zhulinJulia24

set(TRITON_PYTORCH_INCLUDE_PATHS "" CACHE PATH "Paths to Torch includes")
set(TRITON_PYTORCH_LIB_PATHS "" CACHE PATH "Paths to Torch libraries")

set(TRITON_BACKEND_REPO_TAG "r22.12" CACHE STRING "Tag for triton-inference-server/backend repo")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have removed the dependency on triton here, there is no longer any concern about the version or tag of triton. In this case, do we still need to create a new Dockerfile on top of r24.03's docker? ref #1986 (comment)

@lvhan028
Copy link
Collaborator Author

src/turbomind/triton_backend/triton_utils.hpp seems to be unnecessary now and can be deleted and the include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated.

Thanks. I finished it as suggested.

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 17, 2024

src/turbomind/triton_backend/triton_utils.hpp seems to be unnecessary now and can be deleted and the include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated.

Thanks. I finished it as suggested.

The include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated. ref

#include "src/turbomind/triton_backend/triton_utils.hpp"

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines -229 to -230
trust_remote_code (bool): Whether or not to allow for custom models
defined on the Hub in their own modeling files. Defaults to False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using this argument. In my test, I have to add --trust-remote-code for lmdeploy convert command during converting local models.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irexyc used to suggest removing it
All right, I can remove this argument

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply set trust-remote-code to true by default.

Copy link
Collaborator

@AllentDan AllentDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@AllentDan AllentDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do downstream repos still using the triton server? Shall we make a notification to them?

@lvhan028
Copy link
Collaborator Author

Do downstream repos still using the triton server? Shall we make a notification to them?

As far as I know, the internal downstream projects has switched to api_server

@lvhan028 lvhan028 requested a review from irexyc July 17, 2024 10:52
Copy link
Collaborator

@AllentDan AllentDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested converting internlm2-chat-1_8b OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants