diff --git a/comps/llms/text-generation/qwen2/Dockerfile b/comps/llms/text-generation/native/Dockerfile similarity index 100% rename from comps/llms/text-generation/qwen2/Dockerfile rename to comps/llms/text-generation/native/Dockerfile diff --git a/comps/llms/text-generation/qwen2/llm.py b/comps/llms/text-generation/native/llm.py similarity index 100% rename from comps/llms/text-generation/qwen2/llm.py rename to comps/llms/text-generation/native/llm.py diff --git a/comps/llms/text-generation/qwen2/qwen2.patch b/comps/llms/text-generation/native/qwen2.patch similarity index 100% rename from comps/llms/text-generation/qwen2/qwen2.patch rename to comps/llms/text-generation/native/qwen2.patch diff --git a/comps/llms/text-generation/qwen2/requirements.txt b/comps/llms/text-generation/native/requirements.txt similarity index 100% rename from comps/llms/text-generation/qwen2/requirements.txt rename to comps/llms/text-generation/native/requirements.txt diff --git a/comps/llms/text-generation/qwen2/utils.py b/comps/llms/text-generation/native/utils.py similarity index 100% rename from comps/llms/text-generation/qwen2/utils.py rename to comps/llms/text-generation/native/utils.py diff --git a/comps/llms/text-generation/ollama/README.md b/comps/llms/text-generation/ollama/README.md index e69de29bb..a5bd486d6 100644 --- a/comps/llms/text-generation/ollama/README.md +++ b/comps/llms/text-generation/ollama/README.md @@ -0,0 +1,66 @@ +# Introduction + +[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally. + +# Get Started + +## Setup + +Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance. + +- Download and install Ollama onto the available supported platforms (including Windows) +- Fetch available LLM model via ollama pull . View a list of available models via the model library and pull to use locally with the command `ollama pull llama3` +- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model. + +Note: +Special settings are necessary to pull models behind the proxy. + +```bash +sudo vim /etc/systemd/system/ollama.service +``` + +Add your proxy to the above configure file. + +```markdown +[Service] +Environment="http_proxy=${your_proxy}" +Environment="https_proxy=${your_proxy}" +``` + +## Usage + +Here are a few ways to interact with pulled local models: + +### In the terminal + +All of your local models are automatically served on localhost:11434. Run ollama run to start interacting via the command line directly. + +### API access + +Send an application/json request to the API endpoint of Ollama to interact. + +```bash +curl http://localhost:11434/api/generate -d '{ + "model": "llama3", + "prompt":"Why is the sky blue?" +}' +``` + +# Build Docker Image + +```bash +cd GenAIComps/ +docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile . +``` + +# Run the Ollama Microservice + +```bash +docker run --network host opea/llm-ollama:latest +``` + +# Consume the Ollama Microservice + +```bash +curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json' +``` diff --git a/comps/llms/text-generation/tgi/README.md b/comps/llms/text-generation/tgi/README.md index a5bd486d6..1a2ef8ddc 100644 --- a/comps/llms/text-generation/tgi/README.md +++ b/comps/llms/text-generation/tgi/README.md @@ -1,66 +1,124 @@ -# Introduction +# TGI LLM Microservice -[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally. +[Text Generation Inference](https://github.com/huggingface/text-generation-inference) (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. -# Get Started +# 🚀1. Start Microservice with Python (Option 1) -## Setup +To start the LLM microservice, you need to install python packages first. -Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance. +## 1.1 Install Requirements -- Download and install Ollama onto the available supported platforms (including Windows) -- Fetch available LLM model via ollama pull . View a list of available models via the model library and pull to use locally with the command `ollama pull llama3` -- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model. +```bash +pip install -r requirements.txt +``` -Note: -Special settings are necessary to pull models behind the proxy. +## 1.2 Start LLM Service ```bash -sudo vim /etc/systemd/system/ollama.service +export HF_TOKEN=${your_hf_api_token} +export LANGCHAIN_TRACING_V2=true +export LANGCHAIN_API_KEY=${your_langchain_api_key} +export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms" +docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model} ``` -Add your proxy to the above configure file. +## 1.3 Verify the TGI Service -```markdown -[Service] -Environment="http_proxy=${your_proxy}" -Environment="https_proxy=${your_proxy}" +```bash +curl http://${your_ip}:8008/generate \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \ + -H 'Content-Type: application/json' ``` -## Usage +## 1.4 Start LLM Service with Python Script -Here are a few ways to interact with pulled local models: +```bash +export TGI_LLM_ENDPOINT="http://${your_ip}:8008" +python text-generation/tgi/llm.py +``` -### In the terminal +# 🚀2. Start Microservice with Docker (Option 2) -All of your local models are automatically served on localhost:11434. Run ollama run to start interacting via the command line directly. +If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker. -### API access +## 2.1 Setup Environment Variables -Send an application/json request to the API endpoint of Ollama to interact. +In order to start TGI and LLM services, you need to setup the following environment variables first. ```bash -curl http://localhost:11434/api/generate -d '{ - "model": "llama3", - "prompt":"Why is the sky blue?" -}' +export HF_TOKEN=${your_hf_api_token} +export TGI_LLM_ENDPOINT="http://${your_ip}:8008" +export LLM_MODEL_ID=${your_hf_llm_model} +export LANGCHAIN_TRACING_V2=true +export LANGCHAIN_API_KEY=${your_langchain_api_key} +export LANGCHAIN_PROJECT="opea/llms" ``` -# Build Docker Image +## 2.2 Build Docker Image ```bash -cd GenAIComps/ -docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile . +cd ../../ +docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile . ``` -# Run the Ollama Microservice +To start a docker container, you have two options: + +- A. Run Docker with CLI +- B. Run Docker with Docker Compose + +You can choose one as needed. + +## 2.3 Run Docker with CLI (Option A) ```bash -docker run --network host opea/llm-ollama:latest +docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-tgi:latest ``` -# Consume the Ollama Microservice +## 2.4 Run Docker with Docker Compose (Option B) ```bash -curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json' +cd text-generation/tgi +docker compose -f docker_compose_llm.yaml up -d ``` + +# 🚀3. Consume LLM Service + +## 3.1 Check Service Status + +```bash +curl http://${your_ip}:9000/v1/health_check\ + -X GET \ + -H 'Content-Type: application/json' +``` + +## 3.2 Consume LLM Service + +You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`. + +The `streaming` parameter determines the format of the data returned by the API. It will return text string with `streaming=false`, return text streaming flow with `streaming=true`. + +```bash +# non-streaming mode +curl http://${your_ip}:9000/v1/chat/completions \ + -X POST \ + -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \ + -H 'Content-Type: application/json' + +# streaming mode +curl http://${your_ip}:9000/v1/chat/completions \ + -X POST \ + -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \ + -H 'Content-Type: application/json' +``` + +## 4. Validated Model + +| Model | TGI-Gaudi | +| ------------------------- | --------- | +| Intel/neural-chat-7b-v3-3 | ✓ | +| Llama-2-7b-chat-hf | ✓ | +| Llama-2-70b-chat-hf | ✓ | +| Meta-Llama-3-8B-Instruct | ✓ | +| Meta-Llama-3-70B-Instruct | ✓ | +| Phi-3 | x |