Skip to content

Commit

Permalink
fix pre-commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jbkyang-nvi committed Oct 13, 2023
1 parent f21a8e9 commit 017bb8a
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 12 deletions.
20 changes: 10 additions & 10 deletions Popular_Models_Guide/Llama2/trtllm_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,21 +35,21 @@ Clone the repo of the model with weights and tokens [here](https://huggingface.c

## Installation

Launch Triton docker container with TensorRT-LLM backend
Launch Triton docker container with TensorRT-LLM backend
```docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash```

Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container.
Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container.

Don't forget to allow gpu usage when you launch the container.

## Create Engines for each model [skip this step if you already have an engine]
TensorRT-LLM requires each model to be compiled for the configuration you need before running.
To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want.
TensorRT-LLM requires each model to be compiled for the configuration you need before running.
To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want.
To do so, you will need to complete the following steps:

1. Install Tensorrt-LLM python package
```bash
# TensorRT-LLM is required for generating engines.
# TensorRT-LLM is required for generating engines.
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git
mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
Expand Down Expand Up @@ -78,7 +78,7 @@ To do so, you will need to complete the following steps:
--world-size 1
```

> Optional: You can check test the output of the model with `run.py`
> Optional: You can check test the output of the model with `run.py`
> located in the same llama examples folder.
>
> ```bash
Expand All @@ -94,24 +94,24 @@ To run our Llama2-7B model, you will need to:
1. Copy over the inflight batcher models repository
```bash
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
```
2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps
2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps
```bash
# preprocessing
sed -i 's#${tokenizer_dir}#/<path to your engine>/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
sed -i 's#${tokenizer_dir}#/<path to your engine>/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt

sed -i 's#${decoupled_mode}#false#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
sed -i 's#${engine_dir}#/<path to your engine>/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
```
Also, ensure that the `gpt_model_type` parameter is set to `inflight_fused_batching`

3. Launch Tritonserver
3. Launch Tritonserver

```bash
tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ The focus of these examples is to demonstrate deployment for models trained with
| --------------- | ------------ | --------------- | --------------- | --------------- |

#### Supported Model Table
The table below contains a
The table below contains a
| Model Name | Supported with HuggingFace format | Supported with TensorRT-LLM Backend | Supported with vLLM Backend |
| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: |
| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: |
| [Llama2-7B](https://ai.meta.com/llama/) | [Llama-2](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | :grey_question:|
| [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) |:white_check_mark: |:grey_question: | :white_check_mark: |
| [Falcon-180B](https://falconllm.tii.ae/index.html) |:white_check_mark: |:grey_question: | :white_check_mark: |
Expand Down

0 comments on commit 017bb8a

Please sign in to comment.