Skip to content

Commit

Permalink
Update LLM readme (#172)
Browse files Browse the repository at this point in the history
* Update LLM readme

Signed-off-by: lvliang-intel <[email protected]>

* update readme

Signed-off-by: lvliang-intel <[email protected]>

* update tgi readme

Signed-off-by: lvliang-intel <[email protected]>

* rollback requirements.txt

Signed-off-by: lvliang-intel <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: lvliang-intel <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
lvliang-intel and pre-commit-ci[bot] authored Jun 13, 2024
1 parent 1b654de commit d4afce6
Show file tree
Hide file tree
Showing 7 changed files with 157 additions and 33 deletions.
File renamed without changes.
File renamed without changes.
File renamed without changes.
66 changes: 66 additions & 0 deletions comps/llms/text-generation/ollama/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Introduction

[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally.

# Get Started

## Setup

Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance.

- Download and install Ollama onto the available supported platforms (including Windows)
- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3`
- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model.

Note:
Special settings are necessary to pull models behind the proxy.

```bash
sudo vim /etc/systemd/system/ollama.service
```

Add your proxy to the above configure file.

```markdown
[Service]
Environment="http_proxy=${your_proxy}"
Environment="https_proxy=${your_proxy}"
```

## Usage

Here are a few ways to interact with pulled local models:

### In the terminal

All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly.

### API access

Send an application/json request to the API endpoint of Ollama to interact.

```bash
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt":"Why is the sky blue?"
}'
```

# Build Docker Image

```bash
cd GenAIComps/
docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile .
```

# Run the Ollama Microservice

```bash
docker run --network host opea/llm-ollama:latest
```

# Consume the Ollama Microservice

```bash
curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json'
```
124 changes: 91 additions & 33 deletions comps/llms/text-generation/tgi/README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,124 @@
# Introduction
# TGI LLM Microservice

[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally.
[Text Generation Inference](https://github.com/huggingface/text-generation-inference) (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more.

# Get Started
# 🚀1. Start Microservice with Python (Option 1)

## Setup
To start the LLM microservice, you need to install python packages first.

Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance.
## 1.1 Install Requirements

- Download and install Ollama onto the available supported platforms (including Windows)
- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3`
- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model.
```bash
pip install -r requirements.txt
```

Note:
Special settings are necessary to pull models behind the proxy.
## 1.2 Start LLM Service

```bash
sudo vim /etc/systemd/system/ollama.service
export HF_TOKEN=${your_hf_api_token}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms"
docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model}
```

Add your proxy to the above configure file.
## 1.3 Verify the TGI Service

```markdown
[Service]
Environment="http_proxy=${your_proxy}"
Environment="https_proxy=${your_proxy}"
```bash
curl http://${your_ip}:8008/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-H 'Content-Type: application/json'
```

## Usage
## 1.4 Start LLM Service with Python Script

Here are a few ways to interact with pulled local models:
```bash
export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
python text-generation/tgi/llm.py
```

### In the terminal
# 🚀2. Start Microservice with Docker (Option 2)

All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly.
If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker.

### API access
## 2.1 Setup Environment Variables

Send an application/json request to the API endpoint of Ollama to interact.
In order to start TGI and LLM services, you need to setup the following environment variables first.

```bash
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt":"Why is the sky blue?"
}'
export HF_TOKEN=${your_hf_api_token}
export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
export LLM_MODEL_ID=${your_hf_llm_model}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/llms"
```

# Build Docker Image
## 2.2 Build Docker Image

```bash
cd GenAIComps/
docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile .
cd ../../
docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile .
```

# Run the Ollama Microservice
To start a docker container, you have two options:

- A. Run Docker with CLI
- B. Run Docker with Docker Compose

You can choose one as needed.

## 2.3 Run Docker with CLI (Option A)

```bash
docker run --network host opea/llm-ollama:latest
docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-tgi:latest
```

# Consume the Ollama Microservice
## 2.4 Run Docker with Docker Compose (Option B)

```bash
curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json'
cd text-generation/tgi
docker compose -f docker_compose_llm.yaml up -d
```

# 🚀3. Consume LLM Service

## 3.1 Check Service Status

```bash
curl http://${your_ip}:9000/v1/health_check\
-X GET \
-H 'Content-Type: application/json'
```

## 3.2 Consume LLM Service

You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`.

The `streaming` parameter determines the format of the data returned by the API. It will return text string with `streaming=false`, return text streaming flow with `streaming=true`.

```bash
# non-streaming mode
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
-H 'Content-Type: application/json'

# streaming mode
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
-H 'Content-Type: application/json'
```

## 4. Validated Model

| Model | TGI-Gaudi |
| ------------------------- | --------- |
| Intel/neural-chat-7b-v3-3 ||
| Llama-2-7b-chat-hf ||
| Llama-2-70b-chat-hf ||
| Meta-Llama-3-8B-Instruct ||
| Meta-Llama-3-70B-Instruct ||
| Phi-3 | x |

0 comments on commit d4afce6

Please sign in to comment.