Remove the triton inference server backend "turbomind_backend" #1986

lvhan028 · 2024-07-10T12:32:31Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

We plan to remove the triton inference server backend "turbomind_backend" because the python_backend (#1329) outperforms the integration of turbomind_backend

BC-breaking (Optional)

TC should be updated @zhulinJulia24. Please remove TCs related to the triton inference server with turbomind_backend

zhyncs · 2024-07-10T13:26:28Z

May we upgrade the image to r24.03 in order to avoid the memory leak issue in the Python Backend less than r23.10? This would also address the issue mentioned in this link. @lvhan028 @irexyc @zhulinJulia24

ref

https://github.com/InternLM/lmdeploy/tree/main/lmdeploy/serve/turbomind/triton_python_backend#step-2-run-the-triton-server

#1363 (comment)

#1371

lvhan028 · 2024-07-11T06:26:34Z

Hi, @zhyncs
What's the cuda version in r24.03?

zhyncs · 2024-07-11T06:29:57Z

12.4

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-03.html

lvhan028 · 2024-07-11T07:33:31Z

OK. I'll change the default base image to nvcr.io/nvidia/tritonserver:24.03-py3
When we release docker image, both versions (24.03-py3, 22.12-py3) will be built

ispobock

Hi @lvhan028 There are some triton/chatbot references left:

lmdeploy/lmdeploy/serve/client.py

Line 37 in 49208aa

chatbot = Chatbot(tritonserver_addr,

lmdeploy/lmdeploy/serve/gradio/triton_server_backend.py

Line 10 in 49208aa

from lmdeploy.serve.turbomind.chatbot import Chatbot

lmdeploy/lmdeploy/serve/gradio/app.py

Line 50 in 49208aa

from lmdeploy.serve.gradio.triton_server_backend import \

lmdeploy/lmdeploy/cli/serve.py

Line 340 in 49208aa

def triton_client(args):

lvhan028 · 2024-07-13T04:08:04Z

Yes. This PR is still under development. Once it is done, I'll remove the WIP label

lvhan028 · 2024-07-16T15:59:48Z

Hi @lvhan028 There are some triton/chatbot references left:

lmdeploy/lmdeploy/serve/client.py

Line 37 in 49208aa

chatbot = Chatbot(tritonserver_addr,

lmdeploy/lmdeploy/serve/gradio/triton_server_backend.py

Line 10 in 49208aa

from lmdeploy.serve.turbomind.chatbot import Chatbot

lmdeploy/lmdeploy/serve/gradio/app.py

Line 50 in 49208aa

from lmdeploy.serve.gradio.triton_server_backend import \

lmdeploy/lmdeploy/cli/serve.py

Line 340 in 49208aa

def triton_client(args):

@ispobock I've removed as guided. Please take a review.

lvhan028 · 2024-07-16T16:00:58Z

@zhyncs I think I'd better to open another PR to update the dockerfile

zhyncs · 2024-07-16T16:01:54Z

@zhyncs I think I'd better to open another PR to update the dockerfile

ok

lmdeploy/serve/turbomind/__init__.py

zhyncs

Overall LGTM and I'll verify this on my local dev. The size of the whl will be greatly reduced, and it is expected that the compilation speed will also be much faster. Great work!

ispobock · 2024-07-16T16:30:52Z

Hi @lvhan028 There are some triton/chatbot references left:

lmdeploy/lmdeploy/serve/client.py

Line 37 in 49208aa

chatbot = Chatbot(tritonserver_addr,

lmdeploy/lmdeploy/serve/gradio/triton_server_backend.py

Line 10 in 49208aa

from lmdeploy.serve.turbomind.chatbot import Chatbot

lmdeploy/lmdeploy/serve/gradio/app.py

Line 50 in 49208aa

from lmdeploy.serve.gradio.triton_server_backend import \

lmdeploy/lmdeploy/cli/serve.py

Line 340 in 49208aa

def triton_client(args):

@ispobock I've removed as guided. Please take a review.

LGTM

zhyncs

nit: We no longer need to install rapidjson-dev as it is a dependency of Triton.
Perhaps we could consider updating the guide for building from source at https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md.

zhyncs

src/turbomind/triton_backend/triton_utils.hpp seems to be unnecessary now and can be deleted and the include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated.

zhyncs

The .github/scripts/test_triton_server.py in auto test may also be deleted, and whether the triton_client in autotest/utils/run_client_chat.py needs to be removed at the same time. cc @zhulinJulia24

lmdeploy/turbomind/deploy/converter.py

lmdeploy/serve/gradio/triton_server_backend.py

zhyncs · 2024-07-16T17:31:40Z

src/turbomind/triton_backend/CMakeLists.txt

-set(TRITON_PYTORCH_INCLUDE_PATHS "" CACHE PATH "Paths to Torch includes")
-set(TRITON_PYTORCH_LIB_PATHS "" CACHE PATH "Paths to Torch libraries")
-
-set(TRITON_BACKEND_REPO_TAG "r22.12" CACHE STRING "Tag for triton-inference-server/backend repo")


Since we have removed the dependency on triton here, there is no longer any concern about the version or tag of triton. In this case, do we still need to create a new Dockerfile on top of r24.03's docker? ref #1986 (comment)

lvhan028 · 2024-07-17T04:06:26Z

src/turbomind/triton_backend/triton_utils.hpp seems to be unnecessary now and can be deleted and the include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated.

Thanks. I finished it as suggested.

zhyncs · 2024-07-17T04:11:01Z

src/turbomind/triton_backend/triton_utils.hpp seems to be unnecessary now and can be deleted and the include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated.

Thanks. I finished it as suggested.

The include in src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc needs to be updated. ref

lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc

Line 24 in 7b24674

#include "src/turbomind/triton_backend/triton_utils.hpp"

zhyncs

LGTM

AllentDan · 2024-07-17T10:39:11Z

lmdeploy/turbomind/deploy/converter.py

-        trust_remote_code (bool):  Whether or not to allow for custom models
-            defined on the Hub in their own modeling files. Defaults to False


we are using this argument. In my test, I have to add --trust-remote-code for lmdeploy convert command during converting local models.

@irexyc used to suggest removing it
All right, I can remove this argument

Simply set trust-remote-code to true by default.

AllentDan

https://github.com/InternLM/lmdeploy/blob/main/MANIFEST.in need update

AllentDan

Do downstream repos still using the triton server? Shall we make a notification to them?

lvhan028 · 2024-07-17T10:45:49Z

Do downstream repos still using the triton server? Shall we make a notification to them?

As far as I know, the internal downstream projects has switched to api_server

AllentDan

Tested converting internlm2-chat-1_8b OK.

lvhan028 added 3 commits July 10, 2024 19:59

remove cutlass to BUILD_TEST

e460b60

remove turbomind_backend

cb5b6d2

get_hf_config_content

4fbd39e

lvhan028 added the WIP label Jul 10, 2024

lvhan028 added 3 commits July 11, 2024 14:02

remove profile_serving.py and libfastertransformer

0d520fb

remove lmdeploy/serve/turbomind/triton_models

a7eac26

remove chatbot.py

eae2264

ispobock reviewed Jul 13, 2024

View reviewed changes

lvhan028 added 3 commits July 16, 2024 23:51

remove triton_server_backend

afa0f08

remove triton_client CLI

20d7c47

merge main

e14962f

lvhan028 added improvement and removed WIP labels Jul 16, 2024

lvhan028 requested a review from zhulinJulia24 July 16, 2024 15:58

lvhan028 requested review from lzhangzz, AllentDan and zhyncs July 16, 2024 16:03

zhyncs reviewed Jul 16, 2024

View reviewed changes

lmdeploy/serve/turbomind/__init__.py Outdated Show resolved Hide resolved

zhyncs reviewed Jul 16, 2024

View reviewed changes

lmdeploy/turbomind/deploy/converter.py Show resolved Hide resolved

zhyncs reviewed Jul 16, 2024

View reviewed changes

lmdeploy/serve/gradio/triton_server_backend.py Outdated Show resolved Hide resolved

zhyncs reviewed Jul 16, 2024

View reviewed changes

zhyncs mentioned this pull request Jul 16, 2024

Add tritonserver testcase #1559

Closed

lvhan028 added 2 commits July 17, 2024 11:47

fix

b1ae7a5

remove triton_utils.hpp

e1ba6de

fix

e6c99ac

zhyncs approved these changes Jul 17, 2024

View reviewed changes

fix typo

2cb9a85

AllentDan reviewed Jul 17, 2024

View reviewed changes

lvhan028 requested a review from irexyc July 17, 2024 10:52

lvhan028 added 2 commits July 17, 2024 18:54

update manifest.in

9a66b2c

fix as suggested

8d319e2

AllentDan approved these changes Jul 17, 2024

View reviewed changes

lzhangzz approved these changes Jul 17, 2024

View reviewed changes

irexyc approved these changes Jul 17, 2024

View reviewed changes

lvhan028 merged commit 0d600d2 into InternLM:main Jul 17, 2024
9 checks passed

lvhan028 mentioned this pull request Jul 22, 2024

Remove session_len and deprecated short names of the chat templates #2105

Merged

lvhan028 mentioned this pull request Jul 30, 2024

[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 #2164

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the triton inference server backend "turbomind_backend" #1986

Remove the triton inference server backend "turbomind_backend" #1986

lvhan028 commented Jul 10, 2024

zhyncs commented Jul 10, 2024 •

edited

Loading

lvhan028 commented Jul 11, 2024

zhyncs commented Jul 11, 2024

lvhan028 commented Jul 11, 2024

ispobock left a comment

lvhan028 commented Jul 13, 2024

lvhan028 commented Jul 16, 2024

lvhan028 commented Jul 16, 2024 •

edited

Loading

zhyncs commented Jul 16, 2024

zhyncs left a comment

ispobock commented Jul 16, 2024

zhyncs left a comment

zhyncs left a comment

zhyncs left a comment

zhyncs Jul 16, 2024

lvhan028 commented Jul 17, 2024

zhyncs commented Jul 17, 2024 •

edited

Loading

zhyncs left a comment

AllentDan Jul 17, 2024

lvhan028 Jul 17, 2024

zhyncs Jul 17, 2024

AllentDan left a comment

AllentDan left a comment

lvhan028 commented Jul 17, 2024

AllentDan left a comment

		trust_remote_code (bool): Whether or not to allow for custom models
		defined on the Hub in their own modeling files. Defaults to False

Remove the triton inference server backend "turbomind_backend" #1986

Remove the triton inference server backend "turbomind_backend" #1986

Conversation

lvhan028 commented Jul 10, 2024

Motivation

BC-breaking (Optional)

zhyncs commented Jul 10, 2024 • edited Loading

lvhan028 commented Jul 11, 2024

zhyncs commented Jul 11, 2024

lvhan028 commented Jul 11, 2024

ispobock left a comment

Choose a reason for hiding this comment

lvhan028 commented Jul 13, 2024

lvhan028 commented Jul 16, 2024

lvhan028 commented Jul 16, 2024 • edited Loading

zhyncs commented Jul 16, 2024

zhyncs left a comment

Choose a reason for hiding this comment

ispobock commented Jul 16, 2024

zhyncs left a comment

Choose a reason for hiding this comment

zhyncs left a comment

Choose a reason for hiding this comment

zhyncs left a comment

Choose a reason for hiding this comment

zhyncs Jul 16, 2024

Choose a reason for hiding this comment

lvhan028 commented Jul 17, 2024

zhyncs commented Jul 17, 2024 • edited Loading

zhyncs left a comment

Choose a reason for hiding this comment

AllentDan Jul 17, 2024

Choose a reason for hiding this comment

lvhan028 Jul 17, 2024

Choose a reason for hiding this comment

zhyncs Jul 17, 2024

Choose a reason for hiding this comment

AllentDan left a comment

Choose a reason for hiding this comment

AllentDan left a comment

Choose a reason for hiding this comment

lvhan028 commented Jul 17, 2024

AllentDan left a comment

Choose a reason for hiding this comment

zhyncs commented Jul 10, 2024 •

edited

Loading

lvhan028 commented Jul 16, 2024 •

edited

Loading

zhyncs commented Jul 17, 2024 •

edited

Loading