-
Notifications
You must be signed in to change notification settings - Fork 144
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update LLM readme Signed-off-by: lvliang-intel <[email protected]> * update readme Signed-off-by: lvliang-intel <[email protected]> * update tgi readme Signed-off-by: lvliang-intel <[email protected]> * rollback requirements.txt Signed-off-by: lvliang-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: lvliang-intel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
1b654de
commit d4afce6
Showing
7 changed files
with
157 additions
and
33 deletions.
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Introduction | ||
|
||
[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally. | ||
|
||
# Get Started | ||
|
||
## Setup | ||
|
||
Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance. | ||
|
||
- Download and install Ollama onto the available supported platforms (including Windows) | ||
- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3` | ||
- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model. | ||
|
||
Note: | ||
Special settings are necessary to pull models behind the proxy. | ||
|
||
```bash | ||
sudo vim /etc/systemd/system/ollama.service | ||
``` | ||
|
||
Add your proxy to the above configure file. | ||
|
||
```markdown | ||
[Service] | ||
Environment="http_proxy=${your_proxy}" | ||
Environment="https_proxy=${your_proxy}" | ||
``` | ||
|
||
## Usage | ||
|
||
Here are a few ways to interact with pulled local models: | ||
|
||
### In the terminal | ||
|
||
All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly. | ||
|
||
### API access | ||
|
||
Send an application/json request to the API endpoint of Ollama to interact. | ||
|
||
```bash | ||
curl http://localhost:11434/api/generate -d '{ | ||
"model": "llama3", | ||
"prompt":"Why is the sky blue?" | ||
}' | ||
``` | ||
|
||
# Build Docker Image | ||
|
||
```bash | ||
cd GenAIComps/ | ||
docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile . | ||
``` | ||
|
||
# Run the Ollama Microservice | ||
|
||
```bash | ||
docker run --network host opea/llm-ollama:latest | ||
``` | ||
|
||
# Consume the Ollama Microservice | ||
|
||
```bash | ||
curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,66 +1,124 @@ | ||
# Introduction | ||
# TGI LLM Microservice | ||
|
||
[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally. | ||
[Text Generation Inference](https://github.com/huggingface/text-generation-inference) (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. | ||
|
||
# Get Started | ||
# 🚀1. Start Microservice with Python (Option 1) | ||
|
||
## Setup | ||
To start the LLM microservice, you need to install python packages first. | ||
|
||
Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance. | ||
## 1.1 Install Requirements | ||
|
||
- Download and install Ollama onto the available supported platforms (including Windows) | ||
- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3` | ||
- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model. | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Note: | ||
Special settings are necessary to pull models behind the proxy. | ||
## 1.2 Start LLM Service | ||
|
||
```bash | ||
sudo vim /etc/systemd/system/ollama.service | ||
export HF_TOKEN=${your_hf_api_token} | ||
export LANGCHAIN_TRACING_V2=true | ||
export LANGCHAIN_API_KEY=${your_langchain_api_key} | ||
export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms" | ||
docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model} | ||
``` | ||
|
||
Add your proxy to the above configure file. | ||
## 1.3 Verify the TGI Service | ||
|
||
```markdown | ||
[Service] | ||
Environment="http_proxy=${your_proxy}" | ||
Environment="https_proxy=${your_proxy}" | ||
```bash | ||
curl http://${your_ip}:8008/generate \ | ||
-X POST \ | ||
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
## Usage | ||
## 1.4 Start LLM Service with Python Script | ||
|
||
Here are a few ways to interact with pulled local models: | ||
```bash | ||
export TGI_LLM_ENDPOINT="http://${your_ip}:8008" | ||
python text-generation/tgi/llm.py | ||
``` | ||
|
||
### In the terminal | ||
# 🚀2. Start Microservice with Docker (Option 2) | ||
|
||
All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly. | ||
If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker. | ||
|
||
### API access | ||
## 2.1 Setup Environment Variables | ||
|
||
Send an application/json request to the API endpoint of Ollama to interact. | ||
In order to start TGI and LLM services, you need to setup the following environment variables first. | ||
|
||
```bash | ||
curl http://localhost:11434/api/generate -d '{ | ||
"model": "llama3", | ||
"prompt":"Why is the sky blue?" | ||
}' | ||
export HF_TOKEN=${your_hf_api_token} | ||
export TGI_LLM_ENDPOINT="http://${your_ip}:8008" | ||
export LLM_MODEL_ID=${your_hf_llm_model} | ||
export LANGCHAIN_TRACING_V2=true | ||
export LANGCHAIN_API_KEY=${your_langchain_api_key} | ||
export LANGCHAIN_PROJECT="opea/llms" | ||
``` | ||
|
||
# Build Docker Image | ||
## 2.2 Build Docker Image | ||
|
||
```bash | ||
cd GenAIComps/ | ||
docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile . | ||
cd ../../ | ||
docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile . | ||
``` | ||
|
||
# Run the Ollama Microservice | ||
To start a docker container, you have two options: | ||
|
||
- A. Run Docker with CLI | ||
- B. Run Docker with Docker Compose | ||
|
||
You can choose one as needed. | ||
|
||
## 2.3 Run Docker with CLI (Option A) | ||
|
||
```bash | ||
docker run --network host opea/llm-ollama:latest | ||
docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-tgi:latest | ||
``` | ||
|
||
# Consume the Ollama Microservice | ||
## 2.4 Run Docker with Docker Compose (Option B) | ||
|
||
```bash | ||
curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json' | ||
cd text-generation/tgi | ||
docker compose -f docker_compose_llm.yaml up -d | ||
``` | ||
|
||
# 🚀3. Consume LLM Service | ||
|
||
## 3.1 Check Service Status | ||
|
||
```bash | ||
curl http://${your_ip}:9000/v1/health_check\ | ||
-X GET \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
## 3.2 Consume LLM Service | ||
|
||
You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`. | ||
|
||
The `streaming` parameter determines the format of the data returned by the API. It will return text string with `streaming=false`, return text streaming flow with `streaming=true`. | ||
|
||
```bash | ||
# non-streaming mode | ||
curl http://${your_ip}:9000/v1/chat/completions \ | ||
-X POST \ | ||
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \ | ||
-H 'Content-Type: application/json' | ||
|
||
# streaming mode | ||
curl http://${your_ip}:9000/v1/chat/completions \ | ||
-X POST \ | ||
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
## 4. Validated Model | ||
|
||
| Model | TGI-Gaudi | | ||
| ------------------------- | --------- | | ||
| Intel/neural-chat-7b-v3-3 | ✓ | | ||
| Llama-2-7b-chat-hf | ✓ | | ||
| Llama-2-70b-chat-hf | ✓ | | ||
| Meta-Llama-3-8B-Instruct | ✓ | | ||
| Meta-Llama-3-70B-Instruct | ✓ | | ||
| Phi-3 | x | |