Update LLM readme (#172)

* Update LLM readme Signed-off-by: lvliang-intel <[email protected]> * update readme Signed-off-by: lvliang-intel <[email protected]> * update tgi readme Signed-off-by: lvliang-intel <[email protected]> * rollback requirements.txt Signed-off-by: lvliang-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: lvliang-intel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Jun 13, 2024 · d4afce6 · d4afce6
1 parent 1b654de
commit d4afce6
Show file tree

Hide file tree

Showing 7 changed files with 157 additions and 33 deletions.
diff --git a/comps/llms/text-generation/qwen2/Dockerfile → comps/llms/text-generation/native/Dockerfile b/comps/llms/text-generation/qwen2/Dockerfile → comps/llms/text-generation/native/Dockerfile
diff --git a/comps/llms/text-generation/qwen2/llm.py → comps/llms/text-generation/native/llm.py b/comps/llms/text-generation/qwen2/llm.py → comps/llms/text-generation/native/llm.py
diff --git a/comps/llms/text-generation/qwen2/qwen2.patch → ...s/llms/text-generation/native/qwen2.patch b/comps/llms/text-generation/qwen2/qwen2.patch → ...s/llms/text-generation/native/qwen2.patch
diff --git a/...ms/text-generation/qwen2/requirements.txt → ...s/text-generation/native/requirements.txt b/...ms/text-generation/qwen2/requirements.txt → ...s/text-generation/native/requirements.txt
diff --git a/comps/llms/text-generation/qwen2/utils.py → comps/llms/text-generation/native/utils.py b/comps/llms/text-generation/qwen2/utils.py → comps/llms/text-generation/native/utils.py
diff --git a/comps/llms/text-generation/ollama/README.md b/comps/llms/text-generation/ollama/README.md
@@ -0,0 +1,66 @@
+# Introduction
+
+[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally.
+
+# Get Started
+
+## Setup
+
+Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance.
+
+- Download and install Ollama onto the available supported platforms (including Windows)
+- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3`
+- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model.
+
+Note:
+Special settings are necessary to pull models behind the proxy.
+
+```bash
+sudo vim /etc/systemd/system/ollama.service
+```
+
+Add your proxy to the above configure file.
+
+```markdown
+[Service]
+Environment="http_proxy=${your_proxy}"
+Environment="https_proxy=${your_proxy}"
+```
+
+## Usage
+
+Here are a few ways to interact with pulled local models:
+
+### In the terminal
+
+All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly.
+
+### API access
+
+Send an application/json request to the API endpoint of Ollama to interact.
+
+```bash
+curl http://localhost:11434/api/generate -d '{
+  "model": "llama3",
+  "prompt":"Why is the sky blue?"
+}'
+```
+
+# Build Docker Image
+
+```bash
+cd GenAIComps/
+docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile .
+```
+
+# Run the Ollama Microservice
+
+```bash
+docker run --network host opea/llm-ollama:latest
+```
+
+# Consume the Ollama Microservice
+
+```bash
+curl http://127.0.0.1:9000/v1/chat/completions  -X POST   -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}'   -H 'Content-Type: application/json'
+```
diff --git a/comps/llms/text-generation/tgi/README.md b/comps/llms/text-generation/tgi/README.md
@@ -1,66 +1,124 @@
-# Introduction
+# TGI LLM Microservice
 
-[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally.
+[Text Generation Inference](https://github.com/huggingface/text-generation-inference) (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more.
 
-# Get Started
+# 🚀1. Start Microservice with Python (Option 1)
 
-## Setup
+To start the LLM microservice, you need to install python packages first.
 
-Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance.
+## 1.1 Install Requirements
 
-- Download and install Ollama onto the available supported platforms (including Windows)
-- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3`
-- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model.
+```bash
+pip install -r requirements.txt
+```
 
-Note:
-Special settings are necessary to pull models behind the proxy.
+## 1.2 Start LLM Service
 
 ```bash
-sudo vim /etc/systemd/system/ollama.service
+export HF_TOKEN=${your_hf_api_token}
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=${your_langchain_api_key}
+export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms"
+docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model}
 ```
 
-Add your proxy to the above configure file.
+## 1.3 Verify the TGI Service
 
-```markdown
-[Service]
-Environment="http_proxy=${your_proxy}"
-Environment="https_proxy=${your_proxy}"
+```bash
+curl http://${your_ip}:8008/generate \
+  -X POST \
+  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
+  -H 'Content-Type: application/json'
 ```
 
-## Usage
+## 1.4 Start LLM Service with Python Script
 
-Here are a few ways to interact with pulled local models:
+```bash
+export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
+python text-generation/tgi/llm.py
+```
 
-### In the terminal
+# 🚀2. Start Microservice with Docker (Option 2)
 
-All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly.
+If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker.
 
-### API access
+## 2.1 Setup Environment Variables
 
-Send an application/json request to the API endpoint of Ollama to interact.
+In order to start TGI and LLM services, you need to setup the following environment variables first.
 
 ```bash
-curl http://localhost:11434/api/generate -d '{
-  "model": "llama3",
-  "prompt":"Why is the sky blue?"
-}'
+export HF_TOKEN=${your_hf_api_token}
+export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
+export LLM_MODEL_ID=${your_hf_llm_model}
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=${your_langchain_api_key}
+export LANGCHAIN_PROJECT="opea/llms"
 ```
 
-# Build Docker Image
+## 2.2 Build Docker Image
 
 ```bash
-cd GenAIComps/
-docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile .
+cd ../../
+docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile .
 ```
 
-# Run the Ollama Microservice
+To start a docker container, you have two options:
+
+- A. Run Docker with CLI
+- B. Run Docker with Docker Compose
+
+You can choose one as needed.
+
+## 2.3 Run Docker with CLI (Option A)
 
 ```bash
-docker run --network host opea/llm-ollama:latest
+docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-tgi:latest
 ```
 
-# Consume the Ollama Microservice
+## 2.4 Run Docker with Docker Compose (Option B)
 
 ```bash
-curl http://127.0.0.1:9000/v1/chat/completions  -X POST   -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}'   -H 'Content-Type: application/json'
+cd text-generation/tgi
+docker compose -f docker_compose_llm.yaml up -d
 ```
+
+# 🚀3. Consume LLM Service
+
+## 3.1 Check Service Status
+
+```bash
+curl http://${your_ip}:9000/v1/health_check\
+  -X GET \
+  -H 'Content-Type: application/json'
+```
+
+## 3.2 Consume LLM Service
+
+You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`.
+
+The `streaming` parameter determines the format of the data returned by the API. It will return text string with `streaming=false`, return text streaming flow with `streaming=true`.
+
+```bash
+# non-streaming mode
+curl http://${your_ip}:9000/v1/chat/completions \
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
+  -H 'Content-Type: application/json'
+
+# streaming mode
+curl http://${your_ip}:9000/v1/chat/completions \
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
+  -H 'Content-Type: application/json'
+```
+
+## 4. Validated Model
+
+| Model                     | TGI-Gaudi |
+| ------------------------- | --------- |
+| Intel/neural-chat-7b-v3-3 | ✓         |
+| Llama-2-7b-chat-hf        | ✓         |
+| Llama-2-70b-chat-hf       | ✓         |
+| Meta-Llama-3-8B-Instruct  | ✓         |
+| Meta-Llama-3-70B-Instruct | ✓         |
+| Phi-3                     | x         |