This folder contains a collection of Kubernetes manifest files for deploying the ChatQnA service across scalable nodes. It includes a comprehensive benchmarking tool that enables throughput analysis to assess inference performance.
By following this guide, you can run benchmarks on your deployment and share the results with the OPEA community.
We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading llms, serving frameworks etc.
The benchmark will report the below metrics, including:
- Number of Concurrent Requests
- End-to-End Latency: P50, P90, P99 (in milliseconds)
- End-to-End First Token Latency: P50, P90, P99 (in milliseconds)
- Average Next Token Latency (in milliseconds)
- Average Token Latency (in milliseconds)
- Requests Per Second (RPS)
- Output Tokens Per Second
- Input Tokens Per Second
Results will be displayed in the terminal and saved as CSV file named 1_stats.csv
for easy export to spreadsheets.
-
Install Kubernetes by following this guide.
-
Every node has direct internet access
-
Set up kubectl on the master node with access to the Kubernetes cluster.
-
Install Python 3.8+ on the master node for running the stress tool.
-
Ensure all nodes have a local /mnt/models folder, which will be mounted by the pods.
-
Ensure that the container's ulimit can meet the the number of requests.
# The way to modify the containered ulimit:
sudo systemctl edit containerd
# Add two lines:
[Service]
LimitNOFILE=65536:1048576
sudo systemctl daemon-reload; sudo systemctl restart containerd
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 35d v1.29.6
k8s-work1 Ready <none> 35d v1.29.5
k8s-work2 Ready <none> 35d v1.29.6
k8s-work3 Ready <none> 35d v1.29.6
We have created the BKC manifest for single node, two nodes and four nodes K8s cluster. In order to apply, we need to check out and configure some values.
# on k8s-master node
git clone https://github.com/opea-project/GenAIExamples.git
cd GenAIExamples/ChatQnA/benchmark
# replace the image tag from latest to v0.9 since we want to test with v0.9 release
IMAGE_TAG=v0.9
find . -name '*.yaml' -type f -exec sed -i "s#image: opea/\(.*\):latest#image: opea/\1:${IMAGE_TAG}#g" {} \;
# set the huggingface token
HUGGINGFACE_TOKEN=<your token>
find . -name '*.yaml' -type f -exec sed -i "s#\${HF_TOKEN}#${HUGGINGFACE_TOKEN}#g" {} \;
# set models
LLM_MODEL_ID=Intel/neural-chat-7b-v3-3
EMBEDDING_MODEL_ID=BAAI/bge-base-en-v1.5
RERANK_MODEL_ID=BAAI/bge-reranker-base
find . -name '*.yaml' -type f -exec sed -i "s#\$(LLM_MODEL_ID)#${LLM_MODEL_ID}#g" {} \;
find . -name '*.yaml' -type f -exec sed -i "s#\$(EMBEDDING_MODEL_ID)#${EMBEDDING_MODEL_ID}#g" {} \;
find . -name '*.yaml' -type f -exec sed -i "s#\$(RERANK_MODEL_ID)#${RERANK_MODEL_ID}#g" {} \;
The test uses the benchmark tool to do performance test. We need to set up benchmark tool at the master node of Kubernetes which is k8s-master.
# on k8s-master node
git clone https://github.com/opea-project/GenAIEval.git
cd GenAIEval
python3 -m venv stress_venv
source stress_venv/bin/activate
pip install -r requirements.txt
Workload configuration:
Key | Value |
---|---|
Workload | ChatQnA |
Tag | V0.9 |
Models configuration
Key | Value |
---|---|
Embedding | BAAI/bge-base-en-v1.5 |
Reranking | BAAI/bge-reranker-base |
Inference | Intel/neural-chat-7b-v3-3 |
Benchmark parameters
Key | Value |
---|---|
LLM input tokens | 1024 |
LLM output tokens | 128 |
Number of test requests for different scheduled node number:
Node count | Concurrency | Query number |
---|---|---|
1 | 128 | 640 |
2 | 256 | 1280 |
4 | 512 | 2560 |
More detailed configuration can be found in configuration file benchmark.yaml.
We add label to 1 Kubernetes node to make sure all pods are scheduled to this node:
kubectl label nodes k8s-worker1 node-type=chatqna-opea
Go to BKC manifest and apply to K8s.
# on k8s-master node
cd GenAIExamples/ChatQnA/benchmark/tuned/with_rerank/single_gaudi
kubectl apply -f .
Before running tests, upload a specified file to make sure the llm input have the token length of 1k.
Run the following command to check the cluster ip of dataprep.
kubectl get svc
Substitute the ${cluster_ip}
into the real cluster ip of dataprep microservice as below.
dataprep-svc ClusterIP xx.xx.xx.xx <none> 6007/TCP 5m app=dataprep-deploy
Run the cURL command to upload file:
cd GenAIEval/evals/benchmark/data
# RAG with Rerank
curl -X POST "http://${cluster_ip}:6007/v1/dataprep" \
-H "Content-Type: multipart/form-data" \
-F "files=@./upload_file.txt" \
-F "chunk_size=3800"
# RAG without Rerank
curl -X POST "http://${cluster_ip}:6007/v1/dataprep" \
-H "Content-Type: multipart/form-data" \
-F "files=@./upload_file_no_rerank.txt"
We copy the configuration file benchmark.yaml to GenAIEval/evals/benchmark/benchmark.yaml
and config test_suite_config.user_queries
and test_suite_config.test_output_dir
.
export USER_QUERIES="[4, 8, 16, 640]"
export TEST_OUTPUT_DIR="/home/sdp/benchmark_output/node_1"
envsubst < ./benchmark.yaml > GenAIEval/evals/benchmark/benchmark.yaml
And then run the benchmark tool by:
cd GenAIEval/evals/benchmark
python benchmark.py
All the test results will come to this folder /home/sdp/benchmark_output/node_1
configured by the environment variable TEST_OUTPUT_DIR
in previous steps.
# on k8s-master node
cd GenAIExamples/ChatQnA/benchmark/tuned/with_rerank/single_gaudi
kubectl delete -f .
kubectl label nodes k8s-worker1 node-type-
We add label to 2 Kubernetes node to make sure all pods are scheduled to this node:
kubectl label nodes k8s-worker1 k8s-worker2 node-type=chatqna-opea
Go to BKC manifest and apply to K8s.
# on k8s-master node
cd GenAIExamples/ChatQnA/benchmark/tuned/with_rerank/two_gaudi
kubectl apply -f .
We copy the configuration file benchmark.yaml to GenAIEval/evals/benchmark/benchmark.yaml
and config test_suite_config.user_queries
and test_suite_config.test_output_dir
.
export USER_QUERIES="[4, 8, 16, 1280]"
export TEST_OUTPUT_DIR="/home/sdp/benchmark_output/node_2"
envsubst < ./benchmark.yaml > GenAIEval/evals/benchmark/benchmark.yaml
And then run the benchmark tool by:
cd GenAIEval/evals/benchmark
python benchmark.py
All the test results will come to this folder /home/sdp/benchmark_output/node_2
configured by the environment variable TEST_OUTPUT_DIR
in previous steps.
# on k8s-master node
kubectl delete -f .
kubectl label nodes k8s-worker1 k8s-worker2 node-type-
We add label to 4 Kubernetes node to make sure all pods are scheduled to this node:
kubectl label nodes k8s-master k8s-worker1 k8s-worker2 k8s-worker3 node-type=chatqna-opea
Go to BKC manifest and apply to K8s.
# on k8s-master node
cd GenAIExamples/ChatQnA/benchmark/tuned/with_rerank/four_gaudi
kubectl apply -f .
We copy the configuration file benchmark.yaml to GenAIEval/evals/benchmark/benchmark.yaml
and config test_suite_config.user_queries
and test_suite_config.test_output_dir
.
export USER_QUERIES="[4, 8, 16, 2560]"
export TEST_OUTPUT_DIR="/home/sdp/benchmark_output/node_4"
envsubst < ./benchmark.yaml > GenAIEval/evals/benchmark/benchmark.yaml
And then run the benchmark tool by:
cd GenAIEval/evals/benchmark
python benchmark.py
All the test results will come to this folder /home/sdp/benchmark_output/node_4
configured by the environment variable TEST_OUTPUT_DIR
in previous steps.
# on k8s-master node
cd GenAIExamples/ChatQnA/benchmark/tuned/with_rerank/single_gaudi
kubectl delete -f .
kubectl label nodes k8s-master k8s-worker1 k8s-worker2 k8s-worker3 node-type-
Check OOB performance data here, tuned performance data will be released soon.