diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 04d5c2eb..2a5e36b4 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -10,6 +10,7 @@ repos:
         files: (.*\.(py|md|rst|yaml|yml|json|ts|js|html|svelte|sh))$
       - id: check-json
       - id: check-yaml
+        args: [--allow-multiple-documents]
       - id: debug-statements
       - id: requirements-txt-fixer
       - id: trailing-whitespace
diff --git a/evals/auto_tuning/README.md b/evals/auto_tuning/README.md
new file mode 100644
index 00000000..6b7d6e4e
--- /dev/null
+++ b/evals/auto_tuning/README.md
@@ -0,0 +1,151 @@
+# Auto-Tuning for ChatQnA: Optimizing Resource Allocation in Kubernetes
+
+This document describes the Auto-Tuning framework, a tool designed to streamline deployment strategies for resource-intensive services, particularly in ChatQnA environments. It leverages Kubernetes for container orchestration and integrates experimental data with out prior knowledge to fine-tune deployments for optimal performance.
+
+## Key Features
+* Hardware Efficiency: Focuses on adjusting replica counts and maximizing the utilization of CPU and HPU (Habana Processing Unit) resources.
+
+* Theoretical and Experimental Optimization: Integrates theoretical best practices with our prior knowledge to ensure optimal resource allocation for services.
+
+# Usage
+
+To generate the strategy.json configuration file for deployment, use the following command:
+
+
+```bash
+# Kubernetes Deployment
+python3 tuning.py --tuning_config replica_tuning_config.json --hardware_info hardware_info_gaudi.json --service_info chatqna_neuralchat_rerank_latest.yaml
+
+# Note: Add --config_only to output deployment configs only.
+```
+
+## Configuration Files
+1. hardware_info_gaudi.json: Specifies the hardware details (CPU, HPU, etc.).
+
+2. chatqna_neuralchat_rerank_latest.yaml: Contains service deployment information.
+
+3. tuning_config.json: Customizes tuning parameters for replica counts and granularity.
+
+### Hardrware_info.json 
+This file lists only the hardware devices to be used in deployment.
+
+```json
+{
+    "device_0": {
+        "ip": ["10.239.1.5", "10.239.10.6"],
+        "type": "hpu",
+        "sockets": 2,
+        "cores_per_socket": 64,
+        "num_cards": 8
+    }
+}
+```
+Please refer to `hardware_info_gaudi.json` for more details.
+
+### chatqna_neuralchat_rerank_latest.yaml
+This file includes all services that will be deployed.
+```yaml
+opea_micro_services:
+    data_prep:
+        ... ...
+    embedding:
+        ... ...
+
+    reranking:
+        ... ...
+
+    llm:
+        opea/llm-tgi:
+            tag: latest
+            type: cpu
+            dependency:
+                ghcr.io/huggingface/tgi-gaudi:
+                    tag: 2.0.4
+                    type: hpu
+                    requirements:
+                        model_id: "Intel/neural-chat-7b-v3-3"
+
+opea_mega_service:
+    opea/chatqna:
+        tag: latest
+        type: cpu
+```
+Please refer to `chatqna_neuralchat_rerank_latest.yaml` for more details.
+
+### Tuning Config Parameters
+
+`embedding_replicas_granularity = 1`: This defines the step size for scaling the number of replicas for the embedding server.
+* Value (1): Each scaling operation increases or decreases the number of replicas by 1 at a time.
+
+`embedding_replicas_min = 1`: This sets the minimum number of replicas allowed for the embedding server.
+* Value (1): The service will always have at least 1 replica running, ensuring that it is available for deployment.
+
+`embedding_replicas_max = 4`: This defines the maximum number of replicas allowed for the embedding server.
+* Value (4): The service can be scaled up to a maximum of 4 replicas, limiting resource consumption and avoiding over-provisioning.
+
+`microservice_replicas_granularity = 1`: This specifies the scaling step size for other microservices (such as retrieval, dataprep, etc.).
+* Value (1): Similar to the embedding_replicas_granularity, the number of replicas for these microservices will scale by 1 replica at a time.
+
+`microservice_replicas_min = 1`: This parameter sets the minimum number of replicas for these microservices.
+* Value (1): Ensures that each microservice always has at least 1 replica running.
+
+`microservice_replicas_max = 4`: This defines the upper limit for scaling replicas for these microservices.
+* Value (4): The maximum number of replicas allowed for the microservices is 4.
+
+
+If you want to adjust the default tuning parameters, just create a replica_tuning_config.json file. For example:
+
+```json
+{
+    "embedding_replicas_granularity": 1,
+    "embedding_replicas_min": 1,
+    "embedding_replicas_max": 4,
+
+    "microservice_replicas_granularity": 1,
+    "microservice_replicas_min": 1,
+    "microservice_replicas_max": 4
+}
+```
+Please refer to `replica_tuning_config.json` for more details.
+
+## Output
+
+The output of the auto-tuning process includes two key components: 
+1. strategy_files: Contains optimized configurations for deploying services, such as replica counts and hardware resource allocations.
+
+2. K8S manifests: Provides the Kubernetes deployment specifications, including pod definitions and resource limits, ready for deployment.
+
+Example of a strategy file:
+```json
+{
+    "embedding-dependency": {
+        "type": "cpu",
+        "image": "ghcr.io/huggingface/text-embeddings-inference:cpu-1.5",
+        "model_id": "BAAI/bge-base-en-v1.5",
+        "replica": 1
+    },
+    "llm-microservice": {
+        "type": "cpu",
+        "image": "opea/llm-tgi:latest",
+        "replica": 4
+    },
+
+    ... ...
+    "reranking-dependency": {
+        "type": "hpu",
+        "image": "opea/tei-gaudi:latest",
+        "model_id": "BAAI/bge-reranker-base",
+        "replica": 1,
+        "cards": 1
+    },
+    "chatqna_mega_service": {
+        "image": "opea/chatqna:latest",
+        "type": "cpu",
+        "replica": 4
+    }
+}
+```
+
+Both the K8S manifests and strategy files are generated in the current directory, providing everything needed for deployment.
+
+Deployment methods: simply run `kubectl apply -f` on the newly generated *_run.yaml files and the chatqna_config_map.
diff --git a/evals/auto_tuning/baseline/chatqna_config_map.yaml b/evals/auto_tuning/baseline/chatqna_config_map.yaml
new file mode 100644
index 00000000..368c800e
--- /dev/null
+++ b/evals/auto_tuning/baseline/chatqna_config_map.yaml
@@ -0,0 +1,23 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: qna-config
+  namespace: default
+data:
+  EMBEDDING_MODEL_ID: BAAI/bge-base-en-v1.5
+  RERANK_MODEL_ID: BAAI/bge-reranker-base
+  LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
+  TEI_EMBEDDING_ENDPOINT: http://embedding-dependency-svc.default.svc.cluster.local:6006
+  TEI_RERANKING_ENDPOINT: http://reranking-dependency-svc.default.svc.cluster.local:8808
+  TGI_LLM_ENDPOINT: http://llm-dependency-svc.default.svc.cluster.local:9009
+  REDIS_URL: redis://vector-db.default.svc.cluster.local:6379
+  INDEX_NAME: rag-redis
+  HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN}
+  EMBEDDING_SERVICE_HOST_IP: embedding-svc
+  RETRIEVER_SERVICE_HOST_IP: retriever-svc
+  RERANK_SERVICE_HOST_IP: reranking-svc
+  NODE_SELECTOR: chatqna-opea
+  LLM_SERVICE_HOST_IP: llm-svc
diff --git a/evals/auto_tuning/baseline/chatqna_mega_service.yaml b/evals/auto_tuning/baseline/chatqna_mega_service.yaml
new file mode 100644
index 00000000..98422525
--- /dev/null
+++ b/evals/auto_tuning/baseline/chatqna_mega_service.yaml
@@ -0,0 +1,55 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: chatqna-backend-server-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: chatqna-backend-server-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: chatqna-backend-server-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: chatqna-backend-server-deploy
+      hostIPC: true
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: opea/chatqna:latest
+        imagePullPolicy: IfNotPresent
+        name: chatqna-backend-server-deploy
+        args: null
+        ports:
+        - containerPort: 8888
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: chatqna-backend-server-svc
+spec:
+  type: NodePort
+  selector:
+    app: chatqna-backend-server-deploy
+  ports:
+  - name: service
+    port: 8888
+    targetPort: 8888
+    nodePort: 30888
diff --git a/evals/auto_tuning/baseline/dataprep-microservice.yaml b/evals/auto_tuning/baseline/dataprep-microservice.yaml
new file mode 100644
index 00000000..cc00b08b
--- /dev/null
+++ b/evals/auto_tuning/baseline/dataprep-microservice.yaml
@@ -0,0 +1,76 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: dataprep-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: dataprep-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: dataprep-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: dataprep-deploy
+      hostIPC: true
+      containers:
+      - env:
+        - name: REDIS_URL
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: REDIS_URL
+        - name: TEI_ENDPOINT
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: TEI_EMBEDDING_ENDPOINT
+        - name: INDEX_NAME
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: INDEX_NAME
+        image: opea/dataprep-redis:latest
+        imagePullPolicy: IfNotPresent
+        name: dataprep-deploy
+        args: null
+        ports:
+        - containerPort: 6007
+        - containerPort: 6008
+        - containerPort: 6009
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: dataprep-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: dataprep-deploy
+  ports:
+  - name: port1
+    port: 6007
+    targetPort: 6007
+  - name: port2
+    port: 6008
+    targetPort: 6008
+  - name: port3
+    port: 6009
+    targetPort: 6009
diff --git a/evals/auto_tuning/baseline/embedding-dependency.yaml b/evals/auto_tuning/baseline/embedding-dependency.yaml
new file mode 100644
index 00000000..f5bfd023
--- /dev/null
+++ b/evals/auto_tuning/baseline/embedding-dependency.yaml
@@ -0,0 +1,63 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: embedding-dependency-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: embedding-dependency-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: embedding-dependency-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
+        name: embedding-dependency-deploy
+        args:
+        - --model-id
+        - $(EMBEDDING_MODEL_ID)
+        - --auto-truncate
+        volumeMounts:
+        - mountPath: /data
+          name: model-volume
+        - mountPath: /dev/shm
+          name: shm
+        ports:
+        - containerPort: 80
+      serviceAccountName: default
+      volumes:
+      - name: model-volume
+        hostPath:
+          path: /mnt/models
+          type: Directory
+      - name: shm
+        emptyDir:
+          medium: Memory
+          sizeLimit: 1Gi
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: embedding-dependency-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: embedding-dependency-deploy
+  ports:
+  - name: service
+    port: 6006
+    targetPort: 80
diff --git a/evals/auto_tuning/baseline/embedding-microservice.yaml b/evals/auto_tuning/baseline/embedding-microservice.yaml
new file mode 100644
index 00000000..cbd4e624
--- /dev/null
+++ b/evals/auto_tuning/baseline/embedding-microservice.yaml
@@ -0,0 +1,55 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: embedding-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: embedding-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: embedding-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: embedding-deploy
+      hostIPC: true
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: opea/embedding-tei:latest
+        imagePullPolicy: IfNotPresent
+        name: embedding-deploy
+        args: null
+        ports:
+        - containerPort: 6000
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: embedding-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: embedding-deploy
+  ports:
+  - name: service
+    port: 6000
+    targetPort: 6000
diff --git a/evals/auto_tuning/baseline/llm-dependency.yaml b/evals/auto_tuning/baseline/llm-dependency.yaml
new file mode 100644
index 00000000..32c8e1cc
--- /dev/null
+++ b/evals/auto_tuning/baseline/llm-dependency.yaml
@@ -0,0 +1,71 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: llm-dependency-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: llm-dependency-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: llm-dependency-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      hostIPC: true
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: ghcr.io/huggingface/text-generation-inference:1.4
+        name: llm-dependency-deploy-demo
+        securityContext:
+          capabilities:
+            add:
+            - SYS_NICE
+        args:
+        - --model-id
+        - $(LLM_MODEL_ID)
+        - --max-input-length
+        - '2048'
+        - --max-total-tokens
+        - '4096'
+        volumeMounts:
+        - mountPath: /data
+          name: model-volume
+        - mountPath: /dev/shm
+          name: shm
+        ports:
+        - containerPort: 80
+      serviceAccountName: default
+      volumes:
+      - name: model-volume
+        hostPath:
+          path: /mnt/models
+          type: Directory
+      - name: shm
+        emptyDir:
+          medium: Memory
+          sizeLimit: 1Gi
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: llm-dependency-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: llm-dependency-deploy
+  ports:
+  - name: service
+    port: 9009
+    targetPort: 80
diff --git a/evals/auto_tuning/baseline/llm-microservice.yaml b/evals/auto_tuning/baseline/llm-microservice.yaml
new file mode 100644
index 00000000..15bee44a
--- /dev/null
+++ b/evals/auto_tuning/baseline/llm-microservice.yaml
@@ -0,0 +1,55 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: llm-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: llm-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: llm-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: llm-deploy
+      hostIPC: true
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: opea/llm-tgi:latest
+        imagePullPolicy: IfNotPresent
+        name: llm-deploy
+        args: null
+        ports:
+        - containerPort: 9000
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: llm-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: llm-deploy
+  ports:
+  - name: service
+    port: 9000
+    targetPort: 9000
diff --git a/evals/auto_tuning/baseline/reranking-dependency.yaml b/evals/auto_tuning/baseline/reranking-dependency.yaml
new file mode 100644
index 00000000..58eb592e
--- /dev/null
+++ b/evals/auto_tuning/baseline/reranking-dependency.yaml
@@ -0,0 +1,70 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: reranking-dependency-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: reranking-dependency-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: reranking-dependency-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: reranking-dependency-deploy
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
+        name: reranking-dependency-deploy
+        args:
+        - --model-id
+        - $(RERANK_MODEL_ID)
+        - --auto-truncate
+        volumeMounts:
+        - mountPath: /data
+          name: model-volume
+        - mountPath: /dev/shm
+          name: shm
+        ports:
+        - containerPort: 80
+      serviceAccountName: default
+      volumes:
+      - name: model-volume
+        hostPath:
+          path: /mnt/models
+          type: Directory
+      - name: shm
+        emptyDir:
+          medium: Memory
+          sizeLimit: 1Gi
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: reranking-dependency-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: reranking-dependency-deploy
+  ports:
+  - name: service
+    port: 8808
+    targetPort: 80
diff --git a/evals/auto_tuning/baseline/reranking-microservice.yaml b/evals/auto_tuning/baseline/reranking-microservice.yaml
new file mode 100644
index 00000000..d742663b
--- /dev/null
+++ b/evals/auto_tuning/baseline/reranking-microservice.yaml
@@ -0,0 +1,55 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: reranking-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: reranking-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: "true"
+      labels:
+        app: reranking-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+               matchLabels:
+                 app: reranking-deploy
+      hostIPC: true
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: opea/reranking-tei:latest
+        imagePullPolicy: IfNotPresent
+        name: reranking-deploy
+        args:
+        ports:
+        - containerPort: 8000
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: reranking-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: reranking-deploy
+  ports:
+    - name: service
+      port: 8000
+      targetPort: 8000
diff --git a/evals/auto_tuning/baseline/retrieval-microservice.yaml b/evals/auto_tuning/baseline/retrieval-microservice.yaml
new file mode 100644
index 00000000..4d532be0
--- /dev/null
+++ b/evals/auto_tuning/baseline/retrieval-microservice.yaml
@@ -0,0 +1,73 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: retriever-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: retriever-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: retriever-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: retriever-deploy
+      hostIPC: true
+      containers:
+      - env:
+        - name: REDIS_URL
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: REDIS_URL
+        - name: TEI_EMBEDDING_ENDPOINT
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: TEI_EMBEDDING_ENDPOINT
+        - name: HUGGINGFACEHUB_API_TOKEN
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: HUGGINGFACEHUB_API_TOKEN
+        - name: INDEX_NAME
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: INDEX_NAME
+        image: opea/retriever-redis:latest
+        imagePullPolicy: IfNotPresent
+        name: retriever-deploy
+        args: null
+        ports:
+        - containerPort: 7000
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: retriever-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: retriever-deploy
+  ports:
+  - name: service
+    port: 7000
+    targetPort: 7000
diff --git a/evals/auto_tuning/baseline/vector-db.yaml b/evals/auto_tuning/baseline/vector-db.yaml
new file mode 100644
index 00000000..be934f3f
--- /dev/null
+++ b/evals/auto_tuning/baseline/vector-db.yaml
@@ -0,0 +1,49 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vector-db
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: vector-db
+  template:
+    metadata:
+      labels:
+        app: vector-db
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: vector-db
+      containers:
+      - name: vector-db
+        image: redis/redis-stack:7.2.0-v9
+        ports:
+        - containerPort: 6379
+        - containerPort: 8001
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: vector-db
+spec:
+  type: ClusterIP
+  selector:
+    app: vector-db
+  ports:
+  - name: vector-db-service
+    port: 6379
+    targetPort: 6379
+  - name: vector-db-insight
+    port: 8001
+    targetPort: 8001
diff --git a/evals/auto_tuning/benchmark.py b/evals/auto_tuning/benchmark.py
new file mode 100644
index 00000000..d63798ae
--- /dev/null
+++ b/evals/auto_tuning/benchmark.py
@@ -0,0 +1,1138 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import concurrent.futures
+import json
+import random
+import time
+
+import msgspec
+import numpy
+import requests
+
+# from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
+
+query_128 = "In a world where technology has advanced beyond our wildest dreams, humanity stands on the brink of a new era. The year is 2050, and artificial intelligence has become an integral part of everyday life. Autonomous vehicles zip through the streets, drones deliver packages with pinpoint accuracy, and smart homes anticipate every need of their inhabitants. But with these advancements come new challenges and ethical dilemmas. As society grapples with the implications of AI, questions about privacy, security, and the nature of consciousness itself come to the forefront. Amidst this backdrop, a new breakthrough in quantum computing promises to revolutionize the field even further."
+
+query_128_zh = "请为我总结以下内容：在古老的东方，有一个神秘而美丽的国家，这里有着悠久的历史和丰富的文化。国家的山川河流孕育了智慧勤劳的人民，他们用双手和智慧创造了辉煌的文明。古老的城市保存着许多历史悠久的建筑和文物，展现了国家的辉煌历史。在这个国家，人们热爱生活，追求幸福，传统节日里欢聚一堂，共享美好时光。这里的教育传统深厚，科技发展迅猛，人民努力追求卓越，文化和美食也非常独特。国家的自然风光和科技进步都引人注目。请详细地总结这些信息的主要内容项。"
+
+prompt_2k = """
+### 你将扮演一个乐于助人、尊重他人并诚实的助手，你的目标是帮助用户解答问题。有效地利用来自本地知识库的搜索结果。确保你的回答中只包含相关信息。如果你不确定问题的答案，请避免分享不准确的信息。
+### 搜索结果：在遥远的东方，有一个古老而神秘的国家，这个国家有着悠久的历史和丰富的文化。这里的山川河流孕育了无数代勤劳智慧的人民，他们用自己的双手和智慧创造了辉煌的文明。在这片土地上，有着许多美丽的传说和动人的故事。相传很久以前，这里住着一个年轻的樵夫，他每天都到深山里砍柴，虽然生活艰辛，但他始终保持着乐观的态度。一天，樵夫在山中偶然发现了一棵长满奇异果实的树，他小心翼翼地摘下一个果子，尝了一口，顿时感到浑身充满了力量。从那天起，他的生活发生了巨大的变化。这个国家不仅有美丽的自然风光，还有丰富的文化遗产。古老的城市中，保存着许多历史悠久的建筑和文物，它们见证了这个国家的辉煌历史。在这里，你可以看到雄伟的宫殿、精致的园林和宏伟的庙宇，每一处都散发着浓厚的历史气息。走在古老的街道上，仿佛能够听到历史的回声，感受到古人的智慧和创造力。这个国家的人们勤劳勇敢，善良友爱。他们热爱生活，追求幸福。无论是丰收的季节，还是节日的庆典，人们总是欢聚一堂，尽情享受美好的时光。传统的节日里，人们穿上节日的盛装，载歌载舞，互相祝福，分享美食。特别是春节，这是这个国家最重要的节日，家家户户都会张灯结彩，迎接新年的到来。孩子们最喜欢这个时候，因为他们可以收到长辈们的红包，穿上新衣服，吃到各种美味的年货。这个国家还有着深厚的教育传统。古代的学者们孜孜不倦地追求知识，他们留下了许多宝贵的文献和著作，对后世产生了深远的影响。如今，这里的学校和大学依然是培养人才的摇篮，许多年轻人怀揣着梦想，努力学习，期望将来能够为国家和社会做出贡献。科学技术的飞速发展，也为这个国家带来了新的机遇和挑战。人们不断创新，追求卓越，力争在国际舞台上占据一席之地。这个国家的美食也是一大特色。无论是南方的清淡口味，还是北方的浓郁口感，都能找到独特的风味。街头巷尾的小吃摊，饭馆里的家常菜，无不让人垂涎欲滴。特别是火锅，作为这里的代表性美食，深受人们的喜爱。无论寒冬腊月，还是炎炎夏日，围坐在火锅旁，大家一起涮菜，一起聊天，其乐融融。在这个国家的广袤大地上，还有许多值得探索的地方。壮丽的高山、广袤的草原、秀美的江南水乡，每一处都让人流连忘返。探寻自然的奥秘，感受大自然的神奇，是许多人心中的梦想。无论是徒步旅行，还是自驾游，都能在这里找到属于自己的乐趣。这个国家的科技发展也令人瞩目。从古代的四大发明到现代的高科技产业，这里的科技进步始终走在世界前列。许多科技公司和研究机构在这里落户，吸引了大量的科技人才。创新和研发成为推动经济发展的重要力量。尤其是在人工智能、量子计算、生物技术等领域，这个国家取得了显著的成就，为全球科技进步贡献了自己的力量。这个国家的人民非常注重教育和文化传承。从小，孩子们就被教导要尊重知识，热爱学习。学校教育不仅关注学术成绩，还注重培养学生的综合素质和创新能力。各种文化活动和社会实践让学生们在学习的同时，了解社会，增长见识。这个国家的图书馆、博物馆和文化中心遍布各地，为人们提供了丰富的文化资源和学习机会。这个国家的艺术和文学也具有独特的魅力。古代的诗人和画家留下了许多不朽的作品，现代的艺术家们则不断探索新的表现形式和创作方法。无论是传统的书法绘画，还是现代的电影音乐，都展现出这个国家深厚的文化底蕴和创新精神。每年，这里都会举办各种艺术节和文化展览，吸引了大批艺术爱好者和游客。这个国家的体育事业也蓬勃发展。无论是传统的武术，还是现代的竞技体育，都有着广泛的群众基础。每当有大型体育赛事，人们都会热情地支持自己的运动员，为他们加油助威。这个国家的运动员们在国际比赛中屡创佳绩，为国家争光。体育不仅是强身健体的方式，也是增进友谊和团结的纽带。总之，这个古老而又充满活力的国家，正以崭新的姿态走向未来。她的人民团结一心，勇往直前，为实现梦想而努力奋斗。每一个人都是这个国家的一部分，每一个梦想都值得被尊重和珍惜。未来的路还很长，但只要大家携手并进，就一定能够迎来更加美好的明天。在这片充满希望的土地上，还有许多未被发掘的宝藏等待着人们去发现。历史的长河流淌不息，新时代的篇章也在不断书写。这个国家的人民用他们的智慧和双手，创造着一个又一个奇迹。无论是科技的突破，还是文化的传承，每一个领域都在焕发着新的活力。每一位平凡的劳动者，每一位辛勤的学者，每一位勇敢的创新者，都是这个国家走向繁荣与强大的基石。在这片土地上，人们懂得感恩与珍惜。他们感恩祖先留下的宝贵遗产，珍惜现在所拥有的一切。无论生活多么忙碌，人们总会抽出时间，陪伴家人，朋友，共同度过那些美好时光。家庭的温暖，友情的珍贵，这些都是人生中最宝贵的财富。面对未来，这个国家的人民充满信心。他们深知，只有团结一致，才能克服一切困难。无论前方的路多么崎岖，只要大家心往一处想，劲往一处使，就一定能够迎来光明的明天。每一个人的梦想，都是这个国家梦想的一部分。每一个人的努力，都是这个国家前进的动力。无论身处何地，心系祖国，这份情感，永不改变。这个国家的故事还在继续，每一天都是新的篇章。未来的道路上，充满了无限的可能。这个国家的人民，将继续用他们的智慧和努力，书写更加辉煌的历史。无论风雨，始终前行，因为他们相信，光明就在前方，梦想终会实现。在这片充满希望的土地上，还有许多未被发掘的宝藏等待着人们去发现。历史的长河流淌不息，新时代的篇章也在不断书写。这个国家的人民用他们的智慧和双手，创造着一个又一个奇迹。无论是科技的突破，还是文化的传承，每一个领域都在焕发着新的活力。每一位平凡的劳动者，每一位辛勤的学者，每一位勇敢的创新者，都是这个国家走向繁荣与强大的基石。在这片土地上，人们懂得感恩与珍惜。他们感恩祖先留下的宝贵遗产，珍惜现在所拥有的一切。无论生活多么忙碌，人们总会抽出时间，陪伴家人，朋友，共同度过那些美好时光。家庭的温暖，友情的珍贵，这些都是人生中最宝贵的财富。面对未来，这个国家的人民充满信心。他们深知，只有团结一致，才能克服一切困难。无论前方的路多么崎岖，只要大家心往一处想，劲往一处使，就一定能够迎来光明的明天。每一个人的梦想，都是这个国家梦想的一部分。每一个人的努力，都是这个国家前进的动力。无论身处何地，心系祖国，这份情感，永不改变。这个国家的故事还在继续，每一天都是新的篇章。未来的道路上，充满了无限的可能。这个国家的人民，将继续用他们的智慧和努力，书写更加辉煌的历史。无论风雨，始终前行，因为他们相信，光明就在前方，梦想终会实现。这个国家的艺术和文化也在不断发展。传统与现代在这里交融，古老的技艺与新的创意相得益彰。无论是传统的书法绘画，还是现代的电影音乐，这里都有无数的艺术瑰宝等待人们去发现和欣赏。艺术节、文化展览遍布全国，每一处都吸引着大量的游客和艺术爱好者。这里的博物馆收藏了无数珍贵的文物，展示了这个国家丰富的历史和文化。体育也是这个国家人民生活中不可或缺的一部分。每到大型体育赛事，举国上下都充满了激情和活力。运动员们用他们的拼搏精神和优异成绩，为国家赢得了无数荣誉。无论是传统的武术，还是现代的竞技体育，都有着广泛的群众基础。体育不仅是强身健体的方式，也是增进友谊和团结的纽带。这个国家的科技发展也是全球瞩目的。人工智能、量子计算、生物技术等领域，这个国家的研究机构和企业都取得了显著的成就。许多科技公司在国际上享有盛誉，吸引了大量的科技人才前来创业和工作。创新和研发成为推动经济发展的重要力量。教育也是这个国家的优先发展领域。从幼儿园到大学，教育体系不断完善，为每一个孩子提供了公平的教育机会。学校不仅关注学生的学术成绩，还注重他们的综合素质和创新能力的培养。各种课外活动和社会实践让学生们在学习的同时，了解社会，增长见识。教育的不断进步，为国家的发展提供了源源不断的人才支持。旅游业也是这个国家的重要产业。这里有壮丽的自然景观和丰富的人文景观，每年吸引着大量的游客前来观光。无论是高山、草原、湖泊，还是古城、庙宇、园林，每一处都让人流连忘返。探寻自然的奥秘，感受历史的厚重，是许多人心中的梦想。这个国家的农业也在不断现代化。通过科技的应用，农业生产效率大大提高，农民的生活水平也得到了显著改善。绿色农业和有机农业的发展，不仅保护了环境，也为人们提供了健康的食品。农村的基础设施建设也在不断推进，乡村旅游成为新的经济增长点。在这个国家的每一个角落，人们都在为美好的生活而努力奋斗。无论是城市还是乡村，每一个地方都焕发着新的生机。真棒。
+### 问题：请总结这篇文章。
+### 回答：
+"""
+
+prompt_3k = """
+### 你将扮演一个乐于助人、尊重他人并诚实的助手，你的目标是帮助用户解答问题。有效地利用来自本地知识库的搜索结果。确保你的回答中只包含相关信息。如果你不确定问题的答案，请避免分享不准确的信息。
+### 搜索结果：在遥远的东方，有一个古老而神秘的国家，这个国家有着悠久的历史和丰富的文化。这里的山川河流孕育了无数代勤劳智慧的人民，他们用自己的双手和智慧创造了辉煌的文明。在这片土地上，有着许多美丽的传说和动人的故事。相传很久以前，这里住着一个年轻的樵夫，他每天都到深山里砍柴，虽然生活艰辛，但他始终保持着乐观的态度。一天，樵夫在山中偶然发现了一棵长满奇异果实的树，他小心翼翼地摘下一个果子，尝了一口，顿时感到浑身充满了力量。从那天起，他的生活发生了巨大的变化。这个国家不仅有美丽的自然风光，还有丰富的文化遗产。古老的城市中，保存着许多历史悠久的建筑和文物，它们见证了这个国家的辉煌历史。在这里，你可以看到雄伟的宫殿、精致的园林和宏伟的庙宇，每一处都散发着浓厚的历史气息。走在古老的街道上，仿佛能够听到历史的回声，感受到古人的智慧和创造力。这个国家的人们勤劳勇敢，善良友爱。他们热爱生活，追求幸福。无论是丰收的季节，还是节日的庆典，人们总是欢聚一堂，尽情享受美好的时光。传统的节日里，人们穿上节日的盛装，载歌载舞，互相祝福，分享美食。特别是春节，这是这个国家最重要的节日，家家户户都会张灯结彩，迎接新年的到来。孩子们最喜欢这个时候，因为他们可以收到长辈们的红包，穿上新衣服，吃到各种美味的年货。这个国家还有着深厚的教育传统。古代的学者们孜孜不倦地追求知识，他们留下了许多宝贵的文献和著作，对后世产生了深远的影响。如今，这里的学校和大学依然是培养人才的摇篮，许多年轻人怀揣着梦想，努力学习，期望将来能够为国家和社会做出贡献。科学技术的飞速发展，也为这个国家带来了新的机遇和挑战。人们不断创新，追求卓越，力争在国际舞台上占据一席之地。这个国家的美食也是一大特色。无论是南方的清淡口味，还是北方的浓郁口感，都能找到独特的风味。街头巷尾的小吃摊，饭馆里的家常菜，无不让人垂涎欲滴。特别是火锅，作为这里的代表性美食，深受人们的喜爱。无论寒冬腊月，还是炎炎夏日，围坐在火锅旁，大家一起涮菜，一起聊天，其乐融融。在这个国家的广袤大地上，还有许多值得探索的地方。壮丽的高山、广袤的草原、秀美的江南水乡，每一处都让人流连忘返。探寻自然的奥秘，感受大自然的神奇，是许多人心中的梦想。无论是徒步旅行，还是自驾游，都能在这里找到属于自己的乐趣。这个国家的科技发展也令人瞩目。从古代的四大发明到现代的高科技产业，这里的科技进步始终走在世界前列。许多科技公司和研究机构在这里落户，吸引了大量的科技人才。创新和研发成为推动经济发展的重要力量。尤其是在人工智能、量子计算、生物技术等领域，这个国家取得了显著的成就，为全球科技进步贡献了自己的力量。这个国家的人民非常注重教育和文化传承。从小，孩子们就被教导要尊重知识，热爱学习。学校教育不仅关注学术成绩，还注重培养学生的综合素质和创新能力。各种文化活动和社会实践让学生们在学习的同时，了解社会，增长见识。这个国家的图书馆、博物馆和文化中心遍布各地，为人们提供了丰富的文化资源和学习机会。这个国家的艺术和文学也具有独特的魅力。古代的诗人和画家留下了许多不朽的作品，现代的艺术家们则不断探索新的表现形式和创作方法。无论是传统的书法绘画，还是现代的电影音乐，都展现出这个国家深厚的文化底蕴和创新精神。每年，这里都会举办各种艺术节和文化展览，吸引了大批艺术爱好者和游客。这个国家的体育事业也蓬勃发展。无论是传统的武术，还是现代的竞技体育，都有着广泛的群众基础。每当有大型体育赛事，人们都会热情地支持自己的运动员，为他们加油助威。这个国家的运动员们在国际比赛中屡创佳绩，为国家争光。体育不仅是强身健体的方式，也是增进友谊和团结的纽带。总之，这个古老而又充满活力的国家，正以崭新的姿态走向未来。她的人民团结一心，勇往直前，为实现梦想而努力奋斗。每一个人都是这个国家的一部分，每一个梦想都值得被尊重和珍惜。未来的路还很长，但只要大家携手并进，就一定能够迎来更加美好的明天。在这片充满希望的土地上，还有许多未被发掘的宝藏等待着人们去发现。历史的长河流淌不息，新时代的篇章也在不断书写。这个国家的人民用他们的智慧和双手，创造着一个又一个奇迹。无论是科技的突破，还是文化的传承，每一个领域都在焕发着新的活力。每一位平凡的劳动者，每一位辛勤的学者，每一位勇敢的创新者，都是这个国家走向繁荣与强大的基石。在这片土地上，人们懂得感恩与珍惜。他们感恩祖先留下的宝贵遗产，珍惜现在所拥有的一切。无论生活多么忙碌，人们总会抽出时间，陪伴家人，朋友，共同度过那些美好时光。家庭的温暖，友情的珍贵，这些都是人生中最宝贵的财富。面对未来，这个国家的人民充满信心。他们深知，只有团结一致，才能克服一切困难。无论前方的路多么崎岖，只要大家心往一处想，劲往一处使，就一定能够迎来光明的明天。每一个人的梦想，都是这个国家梦想的一部分。每一个人的努力，都是这个国家前进的动力。无论身处何地，心系祖国，这份情感，永不改变。这个国家的故事还在继续，每一天都是新的篇章。未来的道路上，充满了无限的可能。这个国家的人民，将继续用他们的智慧和努力，书写更加辉煌的历史。无论风雨，始终前行，因为他们相信，光明就在前方，梦想终会实现。在这片充满希望的土地上，还有许多未被发掘的宝藏等待着人们去发现。历史的长河流淌不息，新时代的篇章也在不断书写。这个国家的人民用他们的智慧和双手，创造着一个又一个奇迹。无论是科技的突破，还是文化的传承，每一个领域都在焕发着新的活力。每一位平凡的劳动者，每一位辛勤的学者，每一位勇敢的创新者，都是这个国家走向繁荣与强大的基石。在这片土地上，人们懂得感恩与珍惜。他们感恩祖先留下的宝贵遗产，珍惜现在所拥有的一切。无论生活多么忙碌，人们总会抽出时间，陪伴家人，朋友，共同度过那些美好时光。家庭的温暖，友情的珍贵，这些都是人生中最宝贵的财富。面对未来，这个国家的人民充满信心。他们深知，只有团结一致，才能克服一切困难。无论前方的路多么崎岖，只要大家心往一处想，劲往一处使，就一定能够迎来光明的明天。每一个人的梦想，都是这个国家梦想的一部分。每一个人的努力，都是这个国家前进的动力。无论身处何地，心系祖国，这份情感，永不改变。这个国家的故事还在继续，每一天都是新的篇章。未来的道路上，充满了无限的可能。这个国家的人民，将继续用他们的智慧和努力，书写更加辉煌的历史。无论风雨，始终前行，因为他们相信，光明就在前方，梦想终会实现。这个国家的艺术和文化也在不断发展。传统与现代在这里交融，古老的技艺与新的创意相得益彰。无论是传统的书法绘画，还是现代的电影音乐，这里都有无数的艺术瑰宝等待人们去发现和欣赏。艺术节、文化展览遍布全国，每一处都吸引着大量的游客和艺术爱好者。这里的博物馆收藏了无数珍贵的文物，展示了这个国家丰富的历史和文化。体育也是这个国家人民生活中不可或缺的一部分。每到大型体育赛事，举国上下都充满了激情和活力。运动员们用他们的拼搏精神和优异成绩，为国家赢得了无数荣誉。无论是传统的武术，还是现代的竞技体育，都有着广泛的群众基础。体育不仅是强身健体的方式，也是增进友谊和团结的纽带。这个国家的科技发展也是全球瞩目的。人工智能、量子计算、生物技术等领域，这个国家的研究机构和企业都取得了显著的成就。许多科技公司在国际上享有盛誉，吸引了大量的科技人才前来创业和工作。创新和研发成为推动经济发展的重要力量。教育也是这个国家的优先发展领域。从幼儿园到大学，教育体系不断完善，为每一个孩子提供了公平的教育机会。学校不仅关注学生的学术成绩，还注重他们的综合素质和创新能力的培养。各种课外活动和社会实践让学生们在学习的同时，了解社会，增长见识。教育的不断进步，为国家的发展提供了源源不断的人才支持。旅游业也是这个国家的重要产业。这里有壮丽的自然景观和丰富的人文景观，每年吸引着大量的游客前来观光。无论是高山、草原、湖泊，还是古城、庙宇、园林，每一处都让人流连忘返。探寻自然的奥秘，感受历史的厚重，是许多人心中的梦想。这个国家的农业也在不断现代化。通过科技的应用，农业生产效率大大提高，农民的生活水平也得到了显著改善。绿色农业和有机农业的发展，不仅保护了环境，也为人们提供了健康的食品。农村的基础设施建设也在不断推进，乡村旅游成为新的经济增长点。在这个国家的每一个角落，人们都在为美好的生活而努力奋斗。无论是城市还是乡村，每一个地方都焕发着新的生机。政府的政策支持和人民的共同努力，让这个国家不断向前发展。未来，这个国家将继续保持开放与包容，吸引更多的人才和资源，推动经济和社会的全面进步。这个国家的历史悠久而丰富，跨越了数千年的风风雨雨。从古代的封建社会到现代的民主制度，这里的历史发展经历了许多波折和变迁。每一个历史时期都留下了深刻的印记，塑造了今天这个国家的面貌。古代的帝国、王朝和 dynasties 都在这里留下了丰 富的文化遗产，今天的人们仍然能够从这些历史遗迹中感受到当时的辉煌与荣耀。古老的皇宫、寺庙和城堡，每一处都讲述着一段段动人的历史故事。现代化进程中，这个国家也经历了巨大的变革。工业化和城市化的快速发展改变了人们的生活方式，也带来了许多新的挑战。传统的农业社会逐渐转变为现代化的工业社会，科技和经济的发展推动了社会的进步。高楼大厦、繁忙的街道和现代化的交通系统，都展示了这个国家的蓬勃发展和强大实力。在这些变化中，人们不断调整自己的生活节奏，适应新的社会环境，同时也在努力保持和传承那些宝贵的传统和文化。这个国家的社会结构也在不断演变。随着教育水平的提高和社会意识的增强，人们对平等和公正的要求越来越高。政府在推动社会进步的同时，也在不断完善法律制度和社会保障体系。平等的机会、社会福利和人权保护，成为了这个国家发展的重要方向。人们的生活质量逐渐提高，社会的整体和谐也得到了改善。各种社会活动和公益事业的兴起，让人们在追求物质财富的同时，也更加关注社会责任和人际关系的和谐。科技的飞速发展对社会各个领域产生了深远的影响。互联网的普及、智能设备的广泛应用，改变了人们的沟通方式和生活习惯。信息技术的进步推动了各行各业的变革，从医疗、教育到金融、娱乐，每个领域都在经历着数字化和智能化的浪潮。电子商务的兴起让购物变得更加方便快捷，在线教育的普及让学习变得更加灵活多样。科技的发展不仅提高了生产效率，也丰富了人们的生活方式。在国际事务中，这个国家也发挥着重要作用。作为全球经济和政治的重要一员，它在国际组织和多边合作中积极参与，推动全球治理和国际合作。无论是应对气候变化、解决国际冲突，还是促进全球贸易和经济发展，这个国家都在为全球的和平与繁荣做出贡献。它与世界各国建立了广泛的外交关系，通过合作与交流，共同应对全球性挑战。这个国家的国际形象也因其在全球事务中的积极作用而得到了提升，赢得了国际社会的尊重和认可。文化交流也是这个国家对外开放的重要组成部分。通过文化交流活动，人民能够更加深入地了解其他国家的风俗习惯和文化背景，同时也向世界展示自己独特的文化魅力。各种文化交流项目、艺术展览和国际比赛，为不同国家和地区的人民提供了交流和学习的机会。通过这些活动，这个国家的文化影响力不断扩大，国际社会对其文化的认同度也在不断提高。人们通过艺术和文化的交流，不仅加深了对其他国家文化的理解，也促进了国际间的友谊和合作。教育的全球化也在不断推进。这个国家的教育体系不仅吸引了大量的国际学生，也在积极参与全球教育合作。国际化的教育项目和跨国交流项目，让学生们能够在多元文化的环境中成长。通过与其他国家的教育机构合作，分享教育资源和教学经验，推动教育领域的共同进步。教育的全球化不仅提升了国家的教育水平，也增强了国际间的文化理解和合作。未来，这个国家将继续在教育领域发挥重要作用，为全球教育的发展做出贡献。这个国家的科技创新不仅在国内产生了深远的影响，也在全球范围内引起了广泛关注。从传统的制造业到现代的高科技产业，这个国家一直致力于推动技术进步和创新。科技园区和研究机构的建立，为科技人才提供了良好的发展平台。政府的政策支持和投资推动了科技领域的快速发展。人工智能、大数据、区块链等前沿技术在这里得到了广泛应用，推动了各个行业的变革。无论是智能家居、自动驾驶还是医疗科技，这些创新技术都在不断提升人们的生活质量和生产效率。与此同时，这个国家也在注重科技伦理和社会责任。科技的进步带来了许多机遇，但也伴随着一些挑战。数据隐私、安全问题和技术滥用等问题，引发了社会各界的广泛关注。政府和科技企业正在积极探索制定相关法规和标准，确保科技的健康发展。在推动科技创新的同时，重视科技对社会的影响，推动科技与社会的和谐发展，成为了国家发展战略的重要组成部分。在环境保护方面，这个国家也采取了积极的措施。面对全球气候变化和环境污染的问题，国家制定了一系列环境保护政策。绿色能源、可再生资源的开发利用，减少碳排放和环境污染，成为了国家发展的重要目标。各种环保项目和绿色技术的应用，有效地改善了空气和水质，保护了生态环境。公众的环保意识也在不断提高，越来越多的人参与到环保行动中，共同推动可持续发展的实现。文化创意产业也是这个国家经济的重要组成部分。影视制作、音乐、文学等领域蓬勃发展，涌现出了一大批优秀的文化作品和艺术家。国家对文化创意产业的支持力度不断加大。真棒。
+### 问题：请总结这篇文章。
+### 回答：
+"""
+
+# tokens = tokenizer.encode(query_128)
+# num_tokens = len(tokens)
+# #print(num_tokens)
+
+# tokens of query is 512
+query = "In a world where technology has advanced beyond our wildest dreams, humanity stands on the brink of a new era. The year is 2050, and artificial intelligence has become an integral part of everyday life. Autonomous vehicles zip through the streets, drones deliver packages with pinpoint accuracy, and smart homes anticipate every need of their inhabitants. But with these advancements come new challenges and ethical dilemmas. As society grapples with the implications of AI, questions about privacy, security, and the nature of consciousness itself come to the forefront. Amidst this backdrop, a new breakthrough in quantum computing promises to revolutionize the field even further. Scientists have developed a quantum processor capable of performing calculations at speeds previously thought impossible. This leap in technology opens the door to solving problems that have long stumped researchers, from predicting climate change patterns with unprecedented accuracy to unraveling the mysteries of the human genome. However, the power of this new technology also raises concerns about its potential misuse. Governments and corporations race to secure their own quantum capabilities, sparking a new kind of arms race. Meanwhile, a group of rogue programmers, known as the Shadow Collective, seeks to exploit the technology for their own ends. As tensions rise, a young scientist named Dr. Evelyn Zhang finds herself at the center of this unfolding drama. She has discovered a way to harness quantum computing to create a true artificial general intelligence (AGI), a machine capable of independent thought and reasoning. Dr. Zhang's creation, named Athena, possesses the potential to either save humanity from its own worst impulses or to become the ultimate instrument of control. As she navigates the treacherous waters of corporate espionage, government intrigue, and ethical quandaries, Dr. Zhang must decide the fate of her creation and, with it, the future of humanity. Will Athena be a benevolent guardian or a malevolent dictator? The answer lies in the choices made by those who wield its power. The world watches with bated breath as the next chapter in the saga of human and machine unfolds. In the midst of these global tensions, everyday life continues. Children attend schools where AI tutors provide personalized learning experiences. Hospitals use advanced algorithms to diagnose and treat patients with greater accuracy than ever before. The entertainment industry is transformed by virtual reality experiences that are indistinguishable from real life."
+my_query = "What is Deep Learning?"
+query_rerank_1 = """Deep learning is a subset of machine learning, which itself is a branch of artificial intelligence (AI). It involves the use of neural networks with many layers—hence "deep." These networks are capable of learning from data in a way that mimics human cognition to some extent. The key idea is to create a system that can process inputs through multiple layers where each layer learns to transform its input data into a slightly more abstract and composite representation. In a typical deep learning model, the input layer receives the raw data, similar to the way our senses work. This data is then passed through multiple hidden layers, each of which transforms the incoming data using weights that are adjusted during training. These layers might be specialized to recognize certain types of features in the data, like edges or textures in an image, specific words or phrases in a text, or particular frequency patterns in audio. The final layer produces the output of the model, which could be a class label in classification tasks, a continuous value in regression, or a complex pattern in generative models. Deep learning has been behind many of the recent advancements in AI, including speech recognition, image recognition, natural language processing, and autonomous driving."""
+query_rerank_2 = """Deep learning is a powerful tool in the field of artificial intelligence, but it's important to recognize what it is not. Deep learning is not a solution to all types of data processing or decision-making problems. While deep learning models excel at tasks involving large amounts of data and complex patterns, they are not as effective for tasks that require reasoning, logic, or understanding of abstract concepts, which are better handled by other types of AI algorithms. Deep learning is also not a synonym for all of machine learning. Traditional machine learning encompasses a broader range of techniques that include not only neural networks but also methods like decision trees, support vector machines, and linear regression. These traditional models often require less data and computational power and can be more interpretable than deep learning models. They are particularly useful in scenarios where the underlying relationships in the data are more straightforward or where transparency in decision-making is critical. Additionally, deep learning is not inherently unbiased or fair. The models can perpetuate or even amplify biases present in the training data, leading to unfair outcomes in applications like hiring, lending, and law enforcement."""
+
+query_1k = """
+### You are a helpful, respectful and honest assistant to help the user with questions. \
+Please refer to the search results obtained from the local knowledge base. \
+But be careful to not incorporate the information that you think is not relevant to the question. \
+If you don't know the answer to a question, please don't share false information. \
+### Search results: In a world where technology has advanced beyond our wildest dreams, humanity stands on the brink of a new era. The year is 2050, and artificial intelligence has become an integral part of everyday life. Autonomous vehicles zip through the streets, drones deliver packages with pinpoint accuracy, and smart homes anticipate every need of their inhabitants. But with these advancements come new challenges and ethical dilemmas. As society grapples with the implications of AI, questions about privacy, security, and the nature of consciousness itself come to the forefront. Amidst this backdrop, a new breakthrough in quantum computing promises to revolutionize the field even further. Scientists have developed a quantum processor capable of performing calculations at speeds previously thought impossible. This leap in technology opens the door to solving problems that have long stumped researchers, from predicting climate change patterns with unprecedented accuracy to unraveling the mysteries of the human genome. However, the power of this new technology also raises concerns about its potential misuse. Governments and corporations race to secure their own quantum capabilities, sparking a new kind of arms race. Meanwhile, a group of rogue programmers, known as the Shadow Collective, seeks to exploit the technology for their own ends. As tensions rise, a young scientist named Dr. Evelyn Zhang finds herself at the center of this unfolding drama. She has discovered a way to harness quantum computing to create a true artificial general intelligence (AGI), a machine capable of independent thought and reasoning. Dr. Zhang's creation, named Athena, possesses the potential to either save humanity from its own worst impulses or to become the ultimate instrument of control. As she navigates the treacherous waters of corporate espionage, government intrigue, and ethical quandaries, Dr. Zhang must decide the fate of her creation and, with it, the future of humanity. Will Athena be a benevolent guardian or a malevolent dictator? The answer lies in the choices made by those who wield its power. The world watches with bated breath as the next chapter in the saga of human and machine unfolds. In the midst of these global tensions, everyday life continues. Children attend schools where AI tutors provide personalized learning experiences. Hospitals use advanced algorithms to diagnose and treat patients with greater accuracy than ever before. The entertainment industry is transformed by virtual reality experiences that are indistinguishable from real life. Yet, for all the benefits, there are those who feel left behind by this technological revolution. Communities that once thrived on traditional industries find themselves struggling to adapt. The digital divide grows wider, creating new forms of inequality. Dr. Zhang's journey is not just a scientific quest but a deeply personal one. Her motivations are shaped by a desire to honor her late father's legacy, a pioneer in the field of AI who envisioned a future where technology would serve humanity's highest ideals. As she delves deeper into her research, she encounters allies and adversaries from unexpected quarters. A former colleague, Dr. Marcus Holt, now working for a rival tech giant, becomes both a rival and a potential ally as they navigate their complex relationship. In a hidden lab, far from prying eyes, Dr. Zhang and her team work tirelessly to refine Athena. They face numerous setbacks and breakthroughs, each step bringing them closer to their goal. The ethical implications of their work weigh heavily on them. Can a machine truly understand human emotions? Is it possible to program empathy and compassion? These questions haunt Dr. Zhang as she watches Athena's capabilities grow. As word of Athena's development leaks, the world reacts with a mixture of hope and fear. Protests erupt in major cities, with demonstrators demanding transparency and ethical oversight. Governments convene emergency sessions to discuss the potential impact of AGI on national security and global stability. Amid the chaos, the Shadow Collective launches a cyber-attack on Dr. Zhang's lab, attempting to steal her research. The attack is thwarted, but it serves as a stark reminder of the dangers they face. The final phase of Athena's development involves a series of tests to evaluate her decision-making abilities. This is the whole story.\n
+### Question: Summarize the story above into three sentences.\n
+### Answer:
+"""
+
+# tokens of query is 1k
+query_llm = "In a world where technology has advanced beyond our wildest dreams, humanity stands on the brink of a new era. The year is 2050, and artificial intelligence has become an integral part of everyday life. Autonomous vehicles zip through the streets, drones deliver packages with pinpoint accuracy, and smart homes anticipate every need of their inhabitants. But with these advancements come new challenges and ethical dilemmas. As society grapples with the implications of AI, questions about privacy, security, and the nature of consciousness itself come to the forefront. Amidst this backdrop, a new breakthrough in quantum computing promises to revolutionize the field even further. Scientists have developed a quantum processor capable of performing calculations at speeds previously thought impossible. This leap in technology opens the door to solving problems that have long stumped researchers, from predicting climate change patterns with unprecedented accuracy to unraveling the mysteries of the human genome. However, the power of this new technology also raises concerns about its potential misuse. Governments and corporations race to secure their own quantum capabilities, sparking a new kind of arms race. Meanwhile, a group of rogue programmers, known as the Shadow Collective, seeks to exploit the technology for their own ends. As tensions rise, a young scientist named Dr. Evelyn Zhang finds herself at the center of this unfolding drama. She has discovered a way to harness quantum computing to create a true artificial general intelligence (AGI), a machine capable of independent thought and reasoning. Dr. Zhang's creation, named Athena, possesses the potential to either save humanity from its own worst impulses or to become the ultimate instrument of control. As she navigates the treacherous waters of corporate espionage, government intrigue, and ethical quandaries, Dr. Zhang must decide the fate of her creation and, with it, the future of humanity. Will Athena be a benevolent guardian or a malevolent dictator? The answer lies in the choices made by those who wield its power. The world watches with bated breath as the next chapter in the saga of human and machine unfolds. In the midst of these global tensions, everyday life continues. Children attend schools where AI tutors provide personalized learning experiences. Hospitals use advanced algorithms to diagnose and treat patients with greater accuracy than ever before. The entertainment industry is transformed by virtual reality experiences that are indistinguishable from real life. Yet, for all the benefits, there are those who feel left behind by this technological revolution. Communities that once thrived on traditional industries find themselves struggling to adapt. The digital divide grows wider, creating new forms of inequality. Dr. Zhang's journey is not just a scientific quest but a deeply personal one. Her motivations are shaped by a desire to honor her late father's legacy, a pioneer in the field of AI who envisioned a future where technology would serve humanity's highest ideals. As she delves deeper into her research, she encounters allies and adversaries from unexpected quarters. A former colleague, Dr. Marcus Holt, now working for a rival tech giant, becomes both a rival and a potential ally as they navigate their complex relationship. In a hidden lab, far from prying eyes, Dr. Zhang and her team work tirelessly to refine Athena. They face numerous setbacks and breakthroughs, each step bringing them closer to their goal. The ethical implications of their work weigh heavily on them. Can a machine truly understand human emotions? Is it possible to program empathy and compassion? These questions haunt Dr. Zhang as she watches Athena's capabilities grow. As word of Athena's development leaks, the world reacts with a mixture of hope and fear. Protests erupt in major cities, with demonstrators demanding transparency and ethical oversight. Governments convene emergency sessions to discuss the potential impact of AGI on national security and global stability. Amid the chaos, the Shadow Collective launches a cyber-attack on Dr. Zhang's lab, attempting to steal her research. The attack is thwarted, but it serves as a stark reminder of the dangers they face. The final phase of Athena's development involves a series of tests to evaluate her decision-making abilities. Dr. Zhang designs scenarios that challenge Athena to balance competing interests and make ethical choices. In one test, Athena must decide whether to divert a runaway trolley to save a group of people at the expense of one individual. In another, she is tasked with allocating limited medical resources during a pandemic. Each test pushes the boundaries of machine ethics and highlights the complexities of programming morality. Summarize the story above."
+# length of my_embedding is 768
+my_embedding = [
+    0.00030903306,
+    -0.06356524,
+    0.0025720573,
+    -0.012404448,
+    0.050649878,
+    0.023426073,
+    0.022131812,
+    0.000759529,
+    -0.00021144224,
+    -0.03351229,
+    -0.024963351,
+    0.0064628883,
+    -0.007054883,
+    0.066674456,
+    0.0013026494,
+    0.046839874,
+    0.06272031,
+    -0.021033816,
+    0.011214508,
+    0.043999936,
+    -0.050784662,
+    -0.06221004,
+    -0.04018244,
+    0.017779319,
+    -0.0013301502,
+    0.0022156204,
+    -0.043744676,
+    0.012752031,
+    -0.023972677,
+    0.011199989,
+    0.028703978,
+    -0.0089899,
+    0.03712499,
+    -0.027488017,
+    0.016138831,
+    0.041751742,
+    -0.03958115,
+    -0.03528769,
+    -0.022453403,
+    -0.019844962,
+    -0.018594252,
+    -0.042406067,
+    -0.0120475935,
+    0.049004447,
+    -0.08094748,
+    0.017947419,
+    -0.12090019,
+    0.0023762283,
+    -0.022721844,
+    -0.0122670885,
+    -0.07537693,
+    0.051195897,
+    0.032084838,
+    -0.0191422,
+    0.042885557,
+    0.0152152525,
+    0.0042946604,
+    -0.08067345,
+    0.010296512,
+    -0.05629215,
+    0.051881734,
+    0.037080515,
+    -0.018511552,
+    -0.027629064,
+    -0.0010543121,
+    -0.02618493,
+    0.024228664,
+    0.042858265,
+    -0.02330382,
+    -0.0034123377,
+    -0.028686361,
+    0.029237133,
+    -0.020652898,
+    -0.005005634,
+    -0.052511718,
+    -0.011031183,
+    0.012807135,
+    0.0143450685,
+    0.08218706,
+    -0.008386834,
+    0.0036734014,
+    0.06236072,
+    0.04255367,
+    0.03158083,
+    0.004631116,
+    0.0007993413,
+    -0.019410692,
+    -0.004640353,
+    -0.044894144,
+    0.022581149,
+    0.010380893,
+    -0.053084206,
+    0.060135297,
+    0.051447738,
+    0.014172936,
+    0.0076013976,
+    0.01375325,
+    -0.035371594,
+    -0.011681993,
+    -0.014776056,
+    -0.023268431,
+    -0.0590664,
+    -0.016947128,
+    -0.0146322865,
+    -0.048343826,
+    0.026675656,
+    0.052418776,
+    -0.013986488,
+    0.014608619,
+    -0.019658033,
+    -0.0014043319,
+    -0.008499042,
+    -0.0025460746,
+    -0.04858996,
+    -0.04293979,
+    -0.00791175,
+    -0.01644228,
+    0.0038053868,
+    -0.025010196,
+    -0.04599194,
+    0.03430527,
+    0.0382939,
+    0.0019500003,
+    0.021234535,
+    -0.03411336,
+    0.015422987,
+    0.0040041124,
+    0.018236278,
+    0.004566607,
+    -0.02694257,
+    0.020603696,
+    0.0168677,
+    -0.007864176,
+    0.02186715,
+    -0.014774427,
+    0.00078197615,
+    -0.020355146,
+    0.006654448,
+    0.025772778,
+    0.009957317,
+    -0.0025282202,
+    -0.0579994,
+    0.030099394,
+    -0.03549671,
+    0.05439607,
+    -0.015254235,
+    -0.007988717,
+    -0.004305188,
+    -0.018912116,
+    0.0027841094,
+    -0.044504374,
+    0.05556499,
+    -0.018894102,
+    -0.049442377,
+    0.008305442,
+    0.039805025,
+    -0.00042916916,
+    0.0059957127,
+    0.034555893,
+    0.02306613,
+    0.05890197,
+    -0.019604865,
+    -0.05472663,
+    -0.009928875,
+    -0.02455136,
+    -0.054289207,
+    0.055403363,
+    0.024503028,
+    -0.019979116,
+    0.025056925,
+    -0.0020133695,
+    -0.011331945,
+    0.020181546,
+    -0.012020893,
+    0.011718686,
+    0.047295712,
+    0.028600235,
+    0.034037635,
+    0.043115,
+    0.051445063,
+    -0.065478735,
+    0.046462707,
+    -0.00893844,
+    -0.0063705654,
+    -0.044797033,
+    -0.03157799,
+    0.04950285,
+    -0.010792562,
+    0.03688506,
+    0.014347515,
+    -0.063743494,
+    -0.036214367,
+    -0.03380074,
+    -0.03769261,
+    0.033050846,
+    -0.016999796,
+    -0.015086913,
+    0.082186624,
+    -0.011051229,
+    0.04645044,
+    0.054343436,
+    -0.05152064,
+    0.015258479,
+    -0.016340451,
+    -0.027205588,
+    0.029828794,
+    0.01575663,
+    -0.04375617,
+    -0.003217223,
+    0.0033928305,
+    0.0076283724,
+    -0.049442016,
+    -0.0053870296,
+    0.001464261,
+    0.043246116,
+    0.030448606,
+    -0.007991404,
+    -0.00472732,
+    0.0065691406,
+    -0.018045014,
+    0.0050486918,
+    -0.042211313,
+    0.024785575,
+    0.002973673,
+    0.008309046,
+    0.08794761,
+    0.041150656,
+    -0.051644977,
+    0.03518446,
+    -0.037274398,
+    0.003677234,
+    0.02468397,
+    -0.012616027,
+    0.019353414,
+    0.013835055,
+    -0.027715908,
+    0.014544011,
+    0.0104869455,
+    0.04520827,
+    -0.03349062,
+    -0.070577316,
+    0.006990252,
+    -0.047459435,
+    0.05270745,
+    0.011758987,
+    0.009585331,
+    0.033369783,
+    -0.014058916,
+    -0.01459581,
+    -0.016755696,
+    -0.004542376,
+    0.00010269242,
+    0.016674489,
+    0.029076884,
+    -0.02398147,
+    -0.059065636,
+    0.0021090624,
+    -0.009751267,
+    0.10289938,
+    0.027459696,
+    -0.050843943,
+    0.051473383,
+    -0.027577678,
+    0.022293199,
+    -0.02546725,
+    -0.095162235,
+    -0.02834687,
+    -0.020029712,
+    0.08765645,
+    -0.014138398,
+    0.048151582,
+    0.0074673486,
+    0.03930912,
+    8.716728e-05,
+    -0.026958048,
+    0.0055812267,
+    0.054877758,
+    0.055222698,
+    -0.012584492,
+    -0.04345845,
+    -0.02426138,
+    0.066533394,
+    0.0056506116,
+    -0.015095139,
+    0.027254738,
+    -0.025936818,
+    -0.0030386604,
+    -0.008605405,
+    -0.00891901,
+    0.0043280497,
+    0.03594552,
+    0.061649352,
+    -0.042369556,
+    0.048818704,
+    0.021097481,
+    0.053623416,
+    0.045890126,
+    -0.02760507,
+    -0.01573271,
+    8.311729e-05,
+    -0.007044427,
+    0.039558847,
+    -0.021737648,
+    0.03881644,
+    0.020095227,
+    -0.0130994925,
+    0.07956597,
+    -0.014619613,
+    -0.196594,
+    -0.012995427,
+    0.017993039,
+    -0.0073582316,
+    0.03813464,
+    -0.05930209,
+    -0.005811095,
+    -0.009954021,
+    0.0018040026,
+    -0.02305836,
+    -0.027102914,
+    -0.006594491,
+    0.03801163,
+    0.025225805,
+    0.019853814,
+    -0.01661875,
+    0.00875584,
+    -0.016539048,
+    -0.036775734,
+    0.045325384,
+    -0.031573802,
+    -0.029247303,
+    -0.01253526,
+    0.07143945,
+    -0.029145112,
+    0.027142324,
+    -0.084799446,
+    -0.05071047,
+    -0.0028705404,
+    -0.0021605634,
+    -0.023848932,
+    -0.028478833,
+    -0.0324437,
+    0.04862323,
+    0.023280755,
+    0.016372373,
+    0.027676713,
+    -0.03990074,
+    -0.002498963,
+    0.017739112,
+    -0.03355715,
+    -0.048603803,
+    0.003019928,
+    -0.040887985,
+    0.044802677,
+    0.015728928,
+    -0.09309996,
+    -0.04836613,
+    -0.014831327,
+    0.0010454153,
+    -0.010638626,
+    -0.024611702,
+    -0.06786172,
+    -0.0013613648,
+    0.015592544,
+    -0.004870558,
+    0.0025347366,
+    -0.012121049,
+    -0.024824884,
+    0.036656864,
+    -0.0031881756,
+    -0.020234713,
+    -0.02279762,
+    -0.05922489,
+    -0.020922685,
+    -0.02317517,
+    -0.0610787,
+    -0.062339265,
+    0.017110312,
+    0.03338325,
+    -0.010112536,
+    0.048114073,
+    -0.06444785,
+    -0.04852081,
+    0.006865087,
+    -0.025729232,
+    -0.029516479,
+    -0.00941828,
+    0.05484419,
+    0.027107889,
+    0.008253239,
+    -0.06284466,
+    0.035466067,
+    0.012162117,
+    -0.009598869,
+    -0.048561577,
+    0.046412956,
+    -0.03714821,
+    -0.020295296,
+    -0.028690876,
+    0.06459795,
+    -0.006428147,
+    -0.026629865,
+    -0.026355268,
+    0.03504117,
+    0.019873064,
+    0.0032821875,
+    0.028802538,
+    -0.013105742,
+    0.019568242,
+    -0.021279998,
+    -0.024270158,
+    -0.04382199,
+    -0.016565602,
+    -0.040926415,
+    -0.022030178,
+    -0.009905917,
+    0.030040652,
+    0.10125908,
+    -0.00263213,
+    -0.037816163,
+    0.014336923,
+    0.025456406,
+    0.00100471,
+    0.00032630135,
+    -0.030703938,
+    0.016242733,
+    0.0013898151,
+    0.018662402,
+    -0.038746417,
+    -0.03208466,
+    0.05599271,
+    0.0056110374,
+    0.04541296,
+    0.015634691,
+    -0.0295602,
+    0.0008552127,
+    0.0152370455,
+    0.01917365,
+    -0.025870943,
+    0.020953277,
+    -0.0003668304,
+    0.012462414,
+    0.008920647,
+    -0.0016022202,
+    -0.012868524,
+    -0.010962337,
+    -0.0068797423,
+    -0.009876324,
+    0.009545094,
+    -0.0076226145,
+    0.0016608062,
+    0.01671912,
+    -0.015954005,
+    -0.020932103,
+    0.049466487,
+    -0.073524654,
+    0.060834516,
+    -0.0069076903,
+    -0.014720568,
+    0.014687667,
+    -0.028758403,
+    0.025296489,
+    -0.058295064,
+    0.0300228,
+    -0.0070548407,
+    0.010030844,
+    -0.0065278015,
+    -0.028693652,
+    -0.04413148,
+    0.010020056,
+    0.03030962,
+    -0.009985439,
+    0.0104528945,
+    0.055963244,
+    0.054369748,
+    -0.026280807,
+    -0.061695196,
+    0.03131826,
+    0.012127447,
+    0.034067005,
+    -0.029661555,
+    -0.008471412,
+    -0.031715434,
+    -0.014869134,
+    0.036652327,
+    0.026443308,
+    -0.005586143,
+    0.02489041,
+    0.058810584,
+    0.017560603,
+    0.039287437,
+    -0.0034399417,
+    0.033162847,
+    0.050130997,
+    0.032992795,
+    -0.029766096,
+    0.0061241565,
+    -0.055100117,
+    0.028030321,
+    -0.038325004,
+    0.024334624,
+    -0.017313298,
+    -0.019499615,
+    -0.01981792,
+    -0.027658446,
+    -0.018781614,
+    0.047175173,
+    -0.0034721645,
+    -0.020667735,
+    -0.039781824,
+    -0.019210767,
+    -0.026337992,
+    -0.023234084,
+    0.04964025,
+    -0.07777429,
+    0.030660955,
+    0.048808888,
+    0.044913623,
+    0.03674177,
+    -0.011647912,
+    -0.02756851,
+    -0.07255596,
+    -0.087645784,
+    -0.039343175,
+    -0.04203861,
+    -0.0039666323,
+    0.01671798,
+    0.026770905,
+    -0.03026136,
+    0.029986707,
+    0.024289394,
+    0.0117887445,
+    -0.012229226,
+    -0.047474023,
+    -0.03667933,
+    0.026632814,
+    0.03635988,
+    0.0005169153,
+    0.017991144,
+    0.009195582,
+    -0.0069137816,
+    0.011830262,
+    -0.005349248,
+    -0.034725383,
+    0.031615537,
+    -0.05287625,
+    0.014696611,
+    -0.014054976,
+    -0.016312832,
+    0.0019933872,
+    0.02526325,
+    -0.07060638,
+    0.010108201,
+    -0.014116627,
+    -0.0059261527,
+    -0.008993763,
+    0.021177163,
+    -0.04376879,
+    -0.028056782,
+    0.06090816,
+    0.0039020707,
+    -0.038584042,
+    -0.048930347,
+    0.023969071,
+    -0.059767634,
+    -0.029087082,
+    -0.055471163,
+    -0.0693663,
+    -0.005782939,
+    -0.02213406,
+    -0.008931021,
+    -0.0056467317,
+    0.029872,
+    0.022359788,
+    0.008790491,
+    -0.03974519,
+    -0.0064023994,
+    0.065675184,
+    -0.01572894,
+    -0.03746496,
+    -0.061758112,
+    -0.028639734,
+    0.08637485,
+    0.031286176,
+    -0.0007831992,
+    0.0030584438,
+    0.012293266,
+    0.020008529,
+    -0.028351337,
+    0.0020157974,
+    0.027084284,
+    0.0027892909,
+    -0.03614263,
+    0.006040403,
+    -0.0475395,
+    -0.004725341,
+    -0.021484248,
+    -0.022895435,
+    -0.015276968,
+    -0.04321307,
+    -0.04412736,
+    -0.005665974,
+    -0.009453732,
+    -0.028690176,
+    0.010030023,
+    0.027899086,
+    0.060336158,
+    0.06936418,
+    0.006905735,
+    -0.024200331,
+    0.04907079,
+    0.0031401473,
+    0.00441764,
+    -0.029459601,
+    0.03803177,
+    -0.0353827,
+    -0.04895069,
+    0.04761868,
+    0.007312183,
+    -0.008343287,
+    -0.035251893,
+    0.036832787,
+    0.0246635,
+    -0.03892744,
+    0.018956844,
+    0.013805393,
+    -0.048437007,
+    -0.04829463,
+    0.022492649,
+    -0.029296776,
+    0.041375805,
+    0.046585515,
+    0.020296978,
+    0.03789685,
+    0.059837162,
+    0.011104047,
+    -0.032134652,
+    0.07064702,
+    0.04802412,
+    0.01730015,
+    0.07398111,
+    -0.049616653,
+    0.073309965,
+    -0.009425022,
+    -0.06281925,
+    0.024277369,
+    0.021769999,
+    0.018801004,
+    0.020460334,
+    -0.017282128,
+    0.02107381,
+    0.050663974,
+    0.05384202,
+    -0.015786275,
+    0.054115638,
+    0.051110543,
+    0.07228662,
+    -0.0297164,
+    0.048188735,
+    0.0064821052,
+    -0.025109168,
+    0.013359567,
+    -0.021189261,
+    0.025518114,
+    -0.048609257,
+    0.035189547,
+    0.08076792,
+    0.0037926896,
+    -0.015581124,
+    0.0021879557,
+    0.03258444,
+    0.1159761,
+    -0.021879155,
+    -0.029991308,
+    0.016155615,
+    -0.0064807986,
+    -0.06050641,
+    -0.0056326366,
+    0.028292047,
+    -0.02181108,
+    0.032760337,
+    -0.02199964,
+    -0.034708463,
+    0.011786828,
+    -0.035356887,
+    -0.014913256,
+    -0.039785992,
+    -0.021320345,
+    0.026806,
+    -0.002236271,
+    0.044643793,
+    -0.015494709,
+    -0.0065790443,
+    0.0066197272,
+    -0.0050217584,
+    -0.077643394,
+    0.054302536,
+    0.02795664,
+    -0.03983502,
+    -0.027030395,
+    -0.024944995,
+    -0.0022802327,
+    0.07870793,
+    -0.034157082,
+    0.037108578,
+    0.044204045,
+    0.012753803,
+    0.0037155224,
+    0.008254912,
+    0.013719737,
+    -0.010619027,
+    -0.021691227,
+    0.05794269,
+    -0.075987175,
+    -0.054171626,
+    0.0038932571,
+    0.0039806664,
+    -0.037909392,
+    -0.030339854,
+    0.063346766,
+    -0.088324875,
+    -0.06095589,
+    0.08515697,
+    0.020457987,
+    0.080888115,
+    0.032549396,
+    0.003924944,
+    0.029362155,
+    0.012281526,
+    -0.06369542,
+    0.023577815,
+    -0.017478395,
+    -0.0016188929,
+    0.01734596,
+    0.043068424,
+    0.049590185,
+    0.028447397,
+    0.021328118,
+    -0.0025053236,
+    -0.030895222,
+    -0.055287424,
+    -0.045610603,
+    0.04216762,
+    -0.027732681,
+    -0.036629654,
+    0.028555475,
+    0.066825,
+    -0.061748896,
+    -0.08889239,
+    0.045914087,
+    -0.004745301,
+    0.034891862,
+    -0.0065364013,
+    -0.0069724764,
+    -0.061335582,
+    0.02129905,
+    -0.02776986,
+    -0.0246678,
+    0.03999176,
+    0.037477136,
+    -0.006806653,
+    0.02261455,
+    -0.04570737,
+    -0.033122733,
+    0.022785513,
+    0.0160026,
+    -0.021343587,
+    -0.029969815,
+    -0.0049176104,
+]
+
+DATA = {
+    "tei_embedding": {"inputs": query_128},
+    "mosec_embedding": {"input": query_128, "model": "/root/bge-large-zh-v1.5"},
+    "embedding": {"text": query_128},
+    "neuralspeed_embedding": {"query": query_128},
+    "guardrail": {"text": "How do you buy a tiger in the US?"},
+    "retrieval": {"text": my_query, "embedding": my_embedding},
+    "tei_rerank": {"query": my_query, "texts": [query_rerank_1, query_rerank_2]},
+    "mosec_rerank": {"query": my_query, "texts": [query_rerank_1, query_rerank_2]},
+    "reranking": {"initial_query": my_query, "retrieved_docs": [{"text": query_rerank_1}, {"text": query_rerank_2}]},
+    "tgi": {"inputs": query_llm, "parameters": {"max_new_tokens": 128}},
+    "llm": {"query": query_llm, "max_new_tokens": 128},
+    "rag": {"messages": query_128, "max_tokens": 128},
+}
+
+
+def send_single_request(task, idx, queries, concurrency, url):
+    res = []
+    headers = {"Content-Type": "application/json"}
+    data = DATA[task]
+    while idx < len(queries):
+        start_time = time.time()
+        response = requests.post(url, json=data, headers=headers)
+        end_time = time.time()
+        res.append({"idx": idx, "start": start_time, "end": end_time})
+        idx += concurrency
+        print(response.content)
+    return res
+
+
+def send_concurrency_requests(task, request_url, num_queries):
+    if num_queries <= 5:
+        concurrency = 1
+    else:
+        concurrency = num_queries // 5
+    responses = []
+    stock_queries = [query for _ in range(num_queries)]
+    test_start_time = time.time()
+    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
+        futures = []
+        for i in range(concurrency):
+            futures.append(
+                executor.submit(
+                    send_single_request,
+                    task=task,
+                    idx=i,
+                    queries=stock_queries,
+                    concurrency=concurrency,
+                    url=request_url,
+                )
+            )
+        for future in concurrent.futures.as_completed(futures):
+            responses = responses + future.result()
+    test_end_time = time.time()
+
+    print("=======================")
+    for r in responses:
+        r["total_time"] = r["end"] - r["start"]
+        print("query:", r["idx"], "    time taken:", r["total_time"])
+
+    print("=======================")
+    print(f"Total Concurrency: {concurrency}")
+    print(f"Total Requests: {len(stock_queries)}")
+    print(f"Total Test time: {test_end_time - test_start_time}")
+
+    response_times = [r["total_time"] for r in responses]
+
+    # Calculate the P50 (median)
+    p50 = numpy.percentile(response_times, 50)
+    print("P50 total latency is ", p50, "s")
+
+    # Calculate the P99
+    p99 = numpy.percentile(response_times, 99)
+    print("P99 total latency is ", p99, "s")
+
+    return p50, p99
+
+
+def send_single_request_zh(task, idx, queries, concurrency, url, data_zh=None):
+    res = []
+    headers = {"Content-Type": "application/json"}
+    query = random.choice(data_zh)
+    data = {"messages": query, "max_tokens": 128}
+    if task == "rag":
+        data = {"messages": query, "max_tokens": 128}
+    elif task == "embedding":
+        data = {"text": query}
+    elif task == "llm":
+        data = {"query": query, "max_new_tokens": 128}
+    print(data)
+    while idx < len(queries):
+        start_time = time.time()
+        response = requests.post(url, json=data, headers=headers)
+        end_time = time.time()
+        res.append({"idx": idx, "start": start_time, "end": end_time})
+        idx += concurrency
+        print(response.content)
+    return res
+
+
+def send_single_request_v2(task, idx, queries, concurrency, url, data_zh=None):
+    res = []
+    headers = {"Content-Type": "application/json"}
+    data = DATA[task]
+    if task == "neuralspeed_embedding":
+        data = msgspec.msgpack.encode(data)
+    print(data)
+    while idx < len(queries):
+        start_time = time.time()
+        if task in ["llm", "tgi", "rag"]:
+            start_time = time.time()
+            response = requests.post(url, json=data, headers=headers, stream=True)
+            idx += concurrency
+            token_idx = 0
+            is_first = True
+            for chunk in response.iter_content(chunk_size=1024):
+                if chunk:
+                    if is_first:
+                        first_token_finished_time = time.time()
+                        is_first = False
+
+                    token_idx += 1
+                    # print("chunk: ", chunk)
+            end_time = time.time()  # end time equals to the last_token_finished_time
+            res.append(
+                {
+                    "idx": idx,
+                    "start": start_time,
+                    "end": end_time,
+                    "first_token_finished_time": first_token_finished_time,
+                    "token_num": token_idx - 1,
+                }
+            )
+            # print("token_num===", token_idx-1)
+        else:
+            start_time = time.time()
+            if task == "neuralspeed_embedding":
+                response = requests.post(url, data=data, headers=headers)
+            else:
+                response = requests.post(url, json=data, headers=headers)
+            # response = requests.post(url, data=msgspec.msgpack.encode(data), headers=headers)
+            end_time = time.time()
+            res.append({"idx": idx, "start": start_time, "end": end_time})
+            idx += concurrency
+            print(response.content)
+    return res
+
+
+def send_concurrency_requests_v2(task, request_url, num_queries):
+    if num_queries <= 5:
+        concurrency = 1
+    else:
+        concurrency = num_queries // 5
+
+    responses = []
+    stock_queries = [query for _ in range(num_queries)]
+    test_start_time = time.time()
+    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
+        futures = []
+        for i in range(concurrency):
+            futures.append(
+                executor.submit(
+                    send_single_request_v2,
+                    task=task,
+                    idx=i,
+                    queries=stock_queries,
+                    concurrency=concurrency,
+                    url=request_url,
+                )
+            )
+        for future in concurrent.futures.as_completed(futures):
+            responses = responses + future.result()
+    test_end_time = time.time()
+
+    print("=======================")
+    if task in ["llm", "tgi", "rag"]:
+        for r in responses:
+            r["total_time"] = r["end"] - r["start"]
+            r["first_token_time"] = r["first_token_finished_time"] - r["start"]
+            r["avg_token_time"] = (r["end"] - r["first_token_finished_time"]) / r["token_num"]
+            print(
+                "query:",
+                r["idx"],
+                "    time taken:",
+                r["total_time"],
+                "     first token latency:",
+                r["first_token_time"],
+                "     average token latency",
+                r["avg_token_time"],
+                "     token_chunk:",
+                r["token_num"],
+            )
+    else:
+        for r in responses:
+            r["total_time"] = r["end"] - r["start"]
+            print("query:", r["idx"], "    time taken:", r["total_time"])
+
+    print("=======================")
+    print(f"Total Concurrency: {concurrency}")
+    print(f"Total Requests: {len(stock_queries)}")
+    print(f"Total Test time: {test_end_time - test_start_time}")
+
+    response_times = [r["total_time"] for r in responses]
+    print("responses===================", responses)
+    if task in ["llm", "tgi", "rag"]:
+        first_token_times = [r["first_token_time"] for r in responses]
+        avg_token_time = [r["avg_token_time"] for r in responses]
+
+    # Calculate the P50 (median)
+    p50_total = numpy.percentile(response_times, 50)
+    avg_total = numpy.mean(response_times)
+    print("P50 total latency is ", p50_total, "s")
+    if task in ["llm", "tgi", "rag"]:
+        p50_first = numpy.percentile(first_token_times, 50)
+        p50_avg = numpy.percentile(avg_token_time, 50)
+        print("P50 first token latency is ", p50_first, "s")
+        print("P50 average token latency is ", p50_avg, "s")
+
+    p90_total = numpy.percentile(response_times, 90)
+    print("P90 total latency is ", p90_total, "s")
+    if task in ["llm", "tgi", "rag"]:
+        p90_first = numpy.percentile(first_token_times, 90)
+        p90_avg = numpy.percentile(avg_token_time, 90)
+        print("P90 first token latency is ", p90_first, "s")
+        print("P90 average token latency is ", p90_avg, "s")
+
+    # Calculate the P99
+    p99_total = numpy.percentile(response_times, 99)
+    print("P99 total latency is ", p99_total, "s")
+    if task in ["llm", "tgi", "rag"]:
+        p99_first = numpy.percentile(first_token_times, 99)
+        p99_avg = numpy.percentile(avg_token_time, 99)
+        print("P99 first token latency is ", p99_first, "s")
+        print("P99 average token latency is ", p99_avg, "s")
+
+    if task in ["llm", "tgi", "rag"]:
+        return avg_total, p50_total, p90_total, p99_total, p50_first, p90_first, p99_first, p50_avg, p90_avg, p99_avg
+    else:
+        return avg_total, p50_total, p90_total, p99_total
+
+
+def send_concurrency_requests_zh(task, request_url, num_queries):
+    if num_queries <= 4:
+        concurrency = 1
+    else:
+        concurrency = num_queries // 4
+
+    data_zh = []
+    file_path = "./stress_benchmark/data_zh.txt"
+    with open(file_path, "r") as file:
+        for line in file:
+            data_zh.append(line.strip())
+
+    responses = []
+    stock_queries = [query for _ in range(num_queries)]
+    test_start_time = time.time()
+    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
+        futures = []
+        for i in range(concurrency):
+            futures.append(
+                executor.submit(
+                    send_single_request_zh,
+                    task=task,
+                    idx=i,
+                    queries=stock_queries,
+                    concurrency=concurrency,
+                    url=request_url,
+                    data_zh=data_zh,
+                )
+            )
+        for future in concurrent.futures.as_completed(futures):
+            responses = responses + future.result()
+    test_end_time = time.time()
+
+    print("=======================")
+    for r in responses:
+        r["total_time"] = r["end"] - r["start"]
+        print("query:", r["idx"], "    time taken:", r["total_time"])
+
+    print("=======================")
+    print(f"Total Concurrency: {concurrency}")
+    print(f"Total Requests: {len(stock_queries)}")
+    print(f"Total Test time: {test_end_time - test_start_time}")
+
+    response_times = [r["total_time"] for r in responses]
+
+    # Calculate the P50 (median)
+    p50 = numpy.percentile(response_times, 50)
+    print("P50 total latency is ", p50, "s")
+
+    # Calculate the P99
+    p99 = numpy.percentile(response_times, 99)
+    print("P99 total latency is ", p99, "s")
+
+    return p50, p99
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Concurrent client to send POST requests")
+    parser.add_argument("--task", type=str, default="llm", help="Task to perform")
+    parser.add_argument("--url", type=str, default="http://localhost:8080", help="Service URL")
+    parser.add_argument("--num-queries", type=int, default=192, help="Number of queries to be sent")
+    parser.add_argument("--zh", help="data_zh", action="store_true")
+    args = parser.parse_args()
+
+    if args.zh:
+        send_concurrency_requests_zh(args.task, args.url, args.num_queries)
+    else:
+        send_concurrency_requests(args.task, args.url, args.num_queries)
diff --git a/evals/auto_tuning/chatqna_neuralchat_rerank_latest.yaml b/evals/auto_tuning/chatqna_neuralchat_rerank_latest.yaml
new file mode 100644
index 00000000..0571e359
--- /dev/null
+++ b/evals/auto_tuning/chatqna_neuralchat_rerank_latest.yaml
@@ -0,0 +1,67 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+opea_micro_services:
+  embedding:
+    opea/embedding-tei:
+      tag: latest
+      type: cpu
+      dependency:
+        ghcr.io/huggingface/text-embeddings-inference:
+          tag: cpu-1.5
+          type: cpu
+          requirements:
+            model_id: "BAAI/bge-base-en-v1.5"
+
+  llm:
+    opea/llm-tgi:
+      tag: latest
+      type: cpu
+      dependency:
+        ghcr.io/huggingface/tgi-gaudi:
+          tag: 2.0.4
+          type: hpu
+          requirements:
+            model_id: "Intel/neural-chat-7b-v3-3"
+        ghcr.io/huggingface/text-generation-inference:
+          tag: 1.4
+          type: cpu
+          requirements:
+            model_id: "Intel/neural-chat-7b-v3-3"
+
+  data_prep:
+    opea/dataprep-redis:
+      tag: latest
+      type: cpu
+      dependency:
+        redis/redis-stack:
+          tag: 7.2.0-v9
+          type: cpu
+
+
+  reranking:
+    opea/reranking-tei:
+      tag: latest
+      type: cpu
+      dependency:
+        ghcr.io/huggingface/text-embeddings-inference:
+          tag: cpu-1.5
+          type: cpu
+          requirements:
+            model_id: "BAAI/bge-reranker-base"
+        opea/tei-gaudi:
+          tag: latest
+          type: hpu
+          requirements:
+            model_id: "BAAI/bge-reranker-base"
+
+  retrieval:
+    opea/retriever-redis:
+      tag: latest
+      type: cpu
+
+opea_mega_service:
+  opea/chatqna:
+    tag: latest
+    type: cpu
diff --git a/evals/auto_tuning/hardware_info_gaudi.json b/evals/auto_tuning/hardware_info_gaudi.json
new file mode 100644
index 00000000..ba16b690
--- /dev/null
+++ b/evals/auto_tuning/hardware_info_gaudi.json
@@ -0,0 +1,9 @@
+{
+    "device_0": {
+        "ip": ["100.83.111.232"],
+        "type": "hpu",
+        "sockets": 2,
+        "cores_per_socket": 80,
+        "num_cards": 8
+    }
+}
diff --git a/evals/auto_tuning/kubernetes/__init__.py b/evals/auto_tuning/kubernetes/__init__.py
new file mode 100644
index 00000000..c495d189
--- /dev/null
+++ b/evals/auto_tuning/kubernetes/__init__.py
@@ -0,0 +1,6 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+#
diff --git a/evals/auto_tuning/kubernetes/prepare_k8s_pods.sh b/evals/auto_tuning/kubernetes/prepare_k8s_pods.sh
new file mode 100755
index 00000000..fefeaf2a
--- /dev/null
+++ b/evals/auto_tuning/kubernetes/prepare_k8s_pods.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+
+# PATH_PREFIX="./manifest"
+
+function apply_yamls() {
+    JSON_FILE="$1"
+    PATH_PREFIX="$2"
+
+    kubectl apply -f "$PATH_PREFIX/chatqna_config_map.yaml"
+    echo "chatqna_config_map.yaml is applied."
+
+    # Build and run k8s pods for each service
+    sudo jq -c '. | to_entries[]' $JSON_FILE | while read -r service; do
+        SERVICE_NAME=$(echo $service | jq -r '.key')
+        echo "Applying manifest from ${SERVICE_NAME}_run.yaml"
+        kubectl apply -f "$PATH_PREFIX/${SERVICE_NAME}_run.yaml"
+    done
+
+    echo "All services have been applied."
+}
+
+
+function delete_yamls() {
+    JSON_FILE="$1"
+    PATH_PREFIX="$2"
+
+    # Build and run k8s pods for each service
+    sudo jq -c '. | to_entries[]' $JSON_FILE | while read -r service; do
+        SERVICE_NAME=$(echo $service | jq -r '.key')
+        echo "Deleting manifest from ${SERVICE_NAME}_run.yaml"
+        kubectl delete -f "$PATH_PREFIX/${SERVICE_NAME}_run.yaml"
+    done
+
+    echo "All services have been deleted."
+}
+
+
+function main() {
+
+    # Check if a task is provided as an argument
+    if [ "$#" -ne 3 ]; then
+        echo "Please pass a task argument."
+        exit 1
+    fi
+
+    local TASK="$1"
+    local JSON_FILE="$2"
+    local PATH_PREFIX="$3"
+
+
+    case "$TASK" in
+        *apply*)
+            apply_yamls $JSON_FILE $PATH_PREFIX
+            echo "[ apply ] Succeed"
+        ;;
+        *delete*)
+            delete_yamls $JSON_FILE $PATH_PREFIX
+            echo "[ delete ] Succeed"
+        ;;
+        *)
+            echo "Task $TASK is not supported"
+            exit 1
+            ;;
+    esac
+
+}
+
+main "$@"
diff --git a/evals/auto_tuning/kubernetes/prepare_manifest.py b/evals/auto_tuning/kubernetes/prepare_manifest.py
new file mode 100644
index 00000000..010ff3c6
--- /dev/null
+++ b/evals/auto_tuning/kubernetes/prepare_manifest.py
@@ -0,0 +1,132 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+import logging
+import os
+import subprocess
+import time
+from pathlib import Path
+
+import yaml
+
+log_level = os.getenv("LOGLEVEL", "INFO")
+logging.basicConfig(level=log_level.upper(), format="%(asctime)s - %(levelname)s - %(message)s")
+
+
+def update_model_id(service_name, chatqna_config_map, service_info):
+    if "embed" in service_name:
+        key = "EMBEDDING_MODEL_ID"
+    elif "rerank" in service_name:
+        key = "RERANK_MODEL_ID"
+    elif "llm" in service_name:
+        key = "LLM_MODEL_ID"
+    elif "guard" in service_name:
+        key = "GUARDRAIL_LLM_MODEL_ID"
+    else:
+        raise Exception(f"Service {service_name} does not support model_id now.")
+    # service_info may not include the model_id.
+    if "model_id" in service_info:
+        chatqna_config_map["data"][key] = service_info["model_id"]
+
+
+def update_hpu_env(manifest_content, service_info, service_name, chatqna_config_map):
+
+    env_list = [
+        {"name": "OMPI_MCA_btl_vader_single_copy_mechanism", "value": "none"},
+        {"name": "PT_HPU_ENABLE_LAZY_COLLECTIVES", "value": "true"},
+        {"name": "runtime", "value": "habana"},
+        {"name": "HABANA_VISIBLE_DEVICES", "value": "all"},
+        {"name": "HF_TOKEN", "value": chatqna_config_map["data"]["HUGGINGFACEHUB_API_TOKEN"]},
+    ]
+
+    if service_name == "reranking-dependency":
+        env_list.append({"name": "MAX_WARMUP_SEQUENCE_LENGTH", "value": "512"})
+
+    manifest_content["spec"]["template"]["spec"]["containers"][0]["env"] = env_list
+
+    if "mosec" in manifest_content["metadata"]["name"]:
+        if service_info.get("cards") and service_info["cards"] > 1:
+            manifest_content["spec"]["template"]["spec"]["containers"][0]["args"] = []
+            manifest_content["spec"]["template"]["spec"]["containers"][0]["args"].extend(
+                ["--sharded", "true", "--num-shard", str(service_info["cards"])]
+            )
+    else:
+        if (
+            service_info.get("cards")
+            and "--sharded" not in manifest_content["spec"]["template"]["spec"]["containers"][0]["args"]
+            and service_info["cards"] > 1
+        ):
+            manifest_content["spec"]["template"]["spec"]["containers"][0]["args"].extend(
+                ["--sharded", "true", "--num-shard", str(service_info["cards"])]
+            )
+
+
+def update_deployment_resources(manifest_content, service_info):
+    manifest_content["spec"]["replicas"] = service_info["replica"]
+    manifest_content["spec"]["template"]["spec"]["containers"][0]["image"] = service_info["image"]
+
+    resources = manifest_content["spec"]["template"]["spec"]["containers"][0].get("resources", {})
+    limits = resources.get("limits", {})
+    requests = resources.get("requests", {})
+
+    if service_info.get("cores"):
+        limits["cpu"] = service_info["cores"]
+        requests["cpu"] = service_info["cores"]
+    if service_info.get("memory"):
+        limits["memory"] = service_info["memory"]
+        requests["memory"] = service_info["memory"]
+    if service_info.get("cards"):
+        limits["habana.ai/gaudi"] = service_info["cards"]
+
+    if limits != {}:
+        resources["limits"] = limits
+    if requests != {}:
+        resources["requests"] = requests
+    if resources != {}:
+        manifest_content["spec"]["template"]["spec"]["containers"][0]["resources"] = resources
+
+
+def update_k8s_yaml(json_file, manifest_directory="./manifest/general"):
+
+    # read json file
+    with open(json_file, "r") as file:
+        services = json.load(file)
+
+    # 01. Updating the chatqna_config_yaml
+    config_filepath = Path(manifest_directory) / "chatqna_config_map.yaml"
+    with open(config_filepath, "r") as file:
+        chatqna_config_map = yaml.safe_load(file)
+
+    for service_name, service_info in services.items():
+
+        # update model_id in config_map.yaml
+        if service_name in ["embedding-dependency", "reranking-dependency", "llm-dependency", "guardrails-dependency"]:
+            update_model_id(service_name, chatqna_config_map, service_info)
+
+    with open(config_filepath, "w") as file:
+        yaml.dump(chatqna_config_map, file, default_flow_style=False, sort_keys=False)
+        logging.info(f"YAML file for {config_filepath} has been updated successfully.")
+
+    # 02. Updating the manifest
+    for service_name, service_info in services.items():
+        manifest_path = Path(manifest_directory) / f"{service_name}.yaml"
+        print(manifest_path)
+
+        if manifest_path.exists():
+            with open(manifest_path, "r") as file:
+                manifest_file = list(yaml.safe_load_all(file))
+        else:
+            logging.info(f"YAML file for {service_name} does not exist in the specified directory.")
+
+        for manifest_content in manifest_file:
+            if manifest_content and manifest_content.get("kind", "") == "Deployment":
+                update_deployment_resources(manifest_content, service_info)
+
+                # update habana env variables in manifest
+                if service_info.get("type") == "hpu":
+                    update_hpu_env(manifest_content, service_info, service_name, chatqna_config_map)
+
+        with open(manifest_directory + "/" + service_name + "_run.yaml", "w") as file:
+            yaml.dump_all(manifest_file, file, default_flow_style=False, sort_keys=False)
+            logging.info(f"YAML file for {service_name} has been updated successfully.")
diff --git a/evals/auto_tuning/replica_tuning_config.json b/evals/auto_tuning/replica_tuning_config.json
new file mode 100644
index 00000000..55495b00
--- /dev/null
+++ b/evals/auto_tuning/replica_tuning_config.json
@@ -0,0 +1,9 @@
+{
+    "embedding_replicas_granularity": 1,
+    "embedding_replicas_min": 1,
+    "embedding_replicas_max": 2,
+
+    "microservice_replicas_granularity": 1,
+    "microservice_replicas_min": 1,
+    "microservice_replicas_max": 4
+}
diff --git a/evals/auto_tuning/tuning.py b/evals/auto_tuning/tuning.py
new file mode 100644
index 00000000..9458bf2f
--- /dev/null
+++ b/evals/auto_tuning/tuning.py
@@ -0,0 +1,560 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import copy
+import json
+import logging
+import os
+import shutil
+import subprocess
+import time
+
+import tuning_utils
+from benchmark import send_concurrency_requests
+from kubernetes.prepare_manifest import update_k8s_yaml
+
+log_level = os.getenv("LOGLEVEL", "INFO")
+logging.basicConfig(level=log_level.upper(), format="%(asctime)s - %(levelname)s - %(message)s")
+
+
+def generate_base_config(megeservice_info, hardware_info, **kwargs):
+
+    hpu_exist = tuning_utils.check_hpu_device(hardware_info)
+
+    json_content = {}
+
+    def process_microservice(service_name, microservice_info):
+        microservice_name = list(microservice_info.keys())[0]
+        microservice_details = microservice_info[microservice_name]
+        image_name = f"{microservice_name}:{microservice_details['tag']}"
+        type = microservice_details["type"]
+
+        def add_dependency(service_key, json_key):
+            if "dependency" not in microservice_details:
+                return
+            dependency = microservice_details["dependency"]
+
+            for key, value in dependency.items():
+                if hpu_exist and value["type"] == "hpu":
+                    image_name = f"{key}:{value['tag']}"
+                    json_content[json_key] = {"type": "hpu", "image": image_name}
+                    if "requirements" in value:
+                        json_content[json_key]["model_id"] = value["requirements"]["model_id"]
+                    return
+
+            for key, value in dependency.items():
+                if value["type"] == "cpu":
+                    image_name = f"{key}:{value['tag']}"
+                    json_content[json_key] = {"type": "cpu", "image": image_name}
+                    if "requirements" in value:
+                        json_content[json_key]["model_id"] = value["requirements"]["model_id"]
+                    break
+
+        if service_name == "data_prep":
+            json_content["dataprep-microservice"] = {"type": type, "image": image_name}
+            add_dependency(service_name, "vector-db")
+        else:
+            json_key = f"{service_name}-microservice"
+            json_content[json_key] = {"type": type, "image": image_name}
+
+            add_dependency(service_name, f"{service_name}-dependency")
+
+    for service_name, service_info in megeservice_info.get("opea_micro_services", {}).items():
+        process_microservice(service_name, service_info)
+
+    for service_name, service_info in megeservice_info.get("opea_mega_service", {}).items():
+        image_name = f"{service_name}:{service_info['tag']}"
+        json_content["chatqna_mega_service"] = {"image": image_name}
+        if "type" in service_info:
+            json_content["chatqna_mega_service"]["type"] = service_info["type"]
+
+    return json_content
+
+
+class ReplicaTuning:
+    llm_microservice_name = {"llm-microservice"}
+    llm_dependency_name = {"llm-dependency"}
+
+    guardrails_microservice_name = {"guardrails-microservice"}
+    guardrails_dependency_name = {"guardrails-dependency"}
+
+    reranking_microservice_name = {"reranking-microservice"}
+    reranking_dependency_name = {"reranking-dependency"}
+
+    embedding_dependency_name = {"embedding-dependency"}
+    embedding_microservice_name = {"embedding-microservice"}
+
+    tei_microservice_name = {"embedding-microservice"}
+    tei_dependency_name = {"embedding-dependency"}
+
+    vector_db_name = {"vector-db"}
+    dataprep_microservice_name = {"dataprep-microservice"}
+
+    retrieval_microservice_name = {"retrieval-microservice"}
+    chatqna_mega_service_name = {"chatqna_mega_service"}
+
+    def _load_tuning_config(self, file_path):
+        try:
+            with open(file_path, "r") as f:
+                data = json.load(f)
+        except FileNotFoundError:
+            data = {}
+
+        replicas_config = {
+            "embedding_replicas_granularity": 1,
+            "embedding_replicas_min": 1,
+            "embedding_replicas_max": 1,
+            "reranking_replicas_granularity": 1,
+            "reranking_replicas_min": 1,
+            "reranking_replicas_max": 1,
+            "num_microservice_replca_by_default": 1,
+            "microservice_replicas_granularity": 1,
+            "microservice_replicas_min": 1,
+            "microservice_replicas_max": 1,
+        }
+
+        param_defaults = {}
+        param_defaults.update(replicas_config)
+
+        # Update the data dictionary with default values if the key is missing
+        for param, default_value in param_defaults.items():
+            data[param] = data.get(param, default_value)
+
+        return data
+
+    def __init__(self, config, hardware_info, tuning_config_path, platform="k8s"):
+        self.config = config
+        self.hardware_info = hardware_info
+        self.platform = platform
+
+        self.heterogeneous = self._is_heterogeneous(hardware_info)
+        if self.heterogeneous:
+            self.num_cards = self._get_hpu_num_cards(hardware_info)
+
+        self.reranking_exists = self._check_reranking_exists(config)
+        self.reranking_on_hpu = self._check_reranking_on_gaudi(config)
+        logging.info(f"Deployed reranking on hpu: {self.reranking_on_hpu}")
+        if self.reranking_exists:
+            self.tei_dependency_name.add(list(self.reranking_dependency_name)[0])
+            self.tei_microservice_name.add(list(self.reranking_microservice_name)[0])
+
+        self.guardrails_exists = self._check_guardrails_exists(config)
+        self._load_hardware_info()
+        self.tuning_config_data = self._load_tuning_config(tuning_config_path)
+        self._load_tuning_parameters(self.tuning_config_data)
+
+        self.reserved_svc_cores = -1
+
+        self.strategy_version = 1
+
+        self.embedding_replicas_list = [
+            i
+            for i in range(
+                self.embedding_replicas_min, self.embedding_replicas_max + 1, self.embedding_replicas_granularity
+            )
+        ]
+
+        self.microservice_replicas_list = [
+            i
+            for i in range(
+                self.microservice_replicas_min,
+                self.microservice_replicas_max + 1,
+                self.microservice_replicas_granularity,
+            )
+        ]
+
+        logging.info(f"embedding_replicas_list: {self.embedding_replicas_list}")
+        logging.info(f"microservice_replicas_list: {self.microservice_replicas_list}")
+
+    def _load_hardware_info(self):
+        self.reserved_cores_by_default = 4
+        self.total_cores, self.physcial_cores, self.max_cores_per_socket, self.total_sockets = self._get_cores_info(
+            self.hardware_info
+        )
+
+    def _load_tuning_parameters(self, tuning_config_data):
+
+        param_names = list(tuning_config_data.keys())
+
+        for param in param_names:
+            setattr(self, param, tuning_config_data.get(param))
+
+        # Log all parameters
+        log_info = {param: getattr(self, param) for param in param_names}
+        logging.info(f"Tuning Config Parameters: {log_info}")
+
+    def _get_cores_info(self, hardware_info):
+        total_cores = 0
+        physcial_cores = 0
+        max_cores_per_socket = 0
+        total_sockets = 0
+        for device_key, device_info in hardware_info.items():
+            num_devices = len(device_info["ip"])
+            cores = device_info["cores_per_socket"]
+            sockets = device_info["sockets"]
+
+            total_sockets += sockets * num_devices
+            total_cores += cores * sockets * num_devices
+            physcial_cores += cores * num_devices
+            if cores > max_cores_per_socket:
+                max_cores_per_socket = cores
+
+        logging.info(
+            f"The total cores: {total_cores}, physcial_cores: {physcial_cores}, max_cores_per_socket: {max_cores_per_socket}, total_sockets: {total_sockets}"
+        )
+
+        return total_cores, physcial_cores, max_cores_per_socket, total_sockets
+
+    def _get_hpu_num_cards(self, hardware_info):
+        num_cards = 0
+        for device_key, device_info in hardware_info.items():
+            if device_info["type"] == "hpu":
+                num_devices = len(device_info["ip"])
+                num_cards = device_info["num_cards"] * num_devices
+
+        return num_cards
+
+    def _check_reranking_exists(self, config):
+        reranking_exists = False
+        for service_name, service_config in config.items():
+            if service_name in self.reranking_dependency_name:
+                reranking_exists = True
+                break
+
+        return reranking_exists
+
+    def _check_reranking_on_gaudi(self, config):
+        on_gaudi = False
+        for service_name, service_config in config.items():
+            if service_name in self.reranking_dependency_name:
+                if self.heterogeneous and service_config["type"] == "hpu":
+                    on_gaudi = True
+                    break
+
+        return on_gaudi
+
+    def _check_guardrails_exists(self, config):
+        exist = False
+        for service_name, _ in config.items():
+            if service_name in self.guardrails_dependency_name:
+                exist = True
+        return exist
+
+    def _is_heterogeneous(self, hardware_info):
+        hpu_exist = False
+        for _, device_info in hardware_info.items():
+            if device_info["type"] == "hpu":
+                hpu_exist = True
+                break
+
+        return hpu_exist
+
+    def apply_strategy(self):
+        if self.strategy_version == 1:
+            results = []
+            for num_replica in self.microservice_replicas_list:
+                self._microservice_replicas_allocation_v1(self.config, num_replica)
+                if self.platform == "k8s":
+                    result = self.k8s_strategy(self.config)
+
+                results.append(result)
+
+            return results
+
+    def _microservice_replicas_allocation_v1(self, config, num_replica=1):
+        for service_name, service_config in config.items():
+            if service_name in self.chatqna_mega_service_name:
+                service_config["replica"] = num_replica
+
+        for service_name, service_config in config.items():
+            if service_name in self.vector_db_name or service_name in self.dataprep_microservice_name:
+                service_config["replica"] = self.num_microservice_replca_by_default
+
+            elif "microservice" in service_name:
+                if service_name in self.retrieval_microservice_name:
+                    service_config["replica"] = num_replica
+                    continue
+                else:
+                    # the rest of microservice
+                    service_config["replica"] = num_replica
+
+    def _replicas_allocation_on_heterogeneous(self, config):
+        output = []
+        if self.num_cards == 0:
+            logging.error("Please check the number of gaudi cards in the hardware.config.")
+
+        num_cards = self.num_cards
+        for service_name, service_config in config.items():
+            if service_name in self.guardrails_dependency_name:
+                service_config["replica"] = 1
+                service_config["cards"] = 1
+                num_cards -= 1
+
+        for service_name, service_config in config.items():
+            if self.reranking_on_hpu and service_name in self.reranking_dependency_name:
+                service_config["replica"] = 1
+                service_config["cards"] = 1
+                num_cards -= 1
+
+        for service_name, service_config in config.items():
+            if service_name in self.llm_dependency_name:
+                if service_config["type"] == "cpu":
+                    # TODO
+                    service_config["replica"] = 1
+                    continue
+                else:
+                    service_config["replica"] = num_cards
+                    service_config["cards"] = max(num_cards // service_config["replica"], 1)
+
+        for num_rag_replica in self.embedding_replicas_list:
+            for service_name, service_config in config.items():
+                if service_name in self.embedding_dependency_name:
+                    service_config["replica"] = num_rag_replica
+
+                if service_name in self.reranking_dependency_name and not self.reranking_on_hpu:
+                    service_config["replica"] = num_rag_replica
+
+            if num_rag_replica <= 0:
+                tuning_utils.print_strategy_config(config, "deprecated", self.platform)
+                continue
+
+            output.append(copy.deepcopy(config))
+
+        return output
+
+    def k8s_strategy(self, config):
+        output_config = []
+        if self.heterogeneous:
+            if self.strategy_version == 1:
+                tmp_config = self._replicas_allocation_on_heterogeneous(config)
+
+            output_config.extend(tmp_config)
+
+        return output_config
+
+
+def generate_strategy_files(config, strategy_executor, output_folder):
+    os.makedirs(output_folder, exist_ok=True)
+    strategy_files_dict = {}
+    strategy_dict = {}
+
+    all_strategies = strategy_executor.apply_strategy()
+
+    if len(all_strategies[0]) == 0:
+        logging.info(f"{len(strategy_files_dict.keys())} Strategy files have been created.\n")
+        return strategy_files_dict, strategy_dict
+
+    all_files_created = False
+    index = 0
+    for sub_config_list in all_strategies:
+        for idx, config in enumerate(sub_config_list):
+            output_file = os.path.join(output_folder, f"strategy_{index}.json")
+            success = tuning_utils.write_json(config, output_file)
+            if success:
+                all_files_created = True
+                strategy_files_dict[index] = output_file
+                strategy_dict[index] = config
+            else:
+                all_files_created = False
+            index += 1
+
+    if all_files_created:
+        logging.info(f"{len(strategy_files_dict.keys())} Strategy files have been created successfully.\n")
+
+    return strategy_files_dict, strategy_dict
+
+
+def update_and_apply_kubernetes_manifest(strategy_file, manifest_dir, timeout=200):
+    update_k8s_yaml(strategy_file, manifest_dir)
+    bash_script = "kubernetes/prepare_k8s_pods.sh"
+
+    # Ensure script is executable
+    subprocess.run(["chmod", "+x", bash_script], check=True)
+
+    # Delete previous deployment
+    subprocess.run(
+        ["bash", bash_script, "delete", strategy_file, manifest_dir], check=True, text=True, capture_output=False
+    )
+
+    time.sleep(100)
+    while 0:
+        result = subprocess.run(["kubectl", "get", "pods"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+        if "No resources found in default namespace." in result.stderr:
+            print("No resources found in default namespace.")
+            break
+        else:
+            print("Still have Pods in the default namespace. Deleting...")
+            time.sleep(20)
+
+    # Apply updated deployment
+    result = subprocess.run(
+        ["bash", bash_script, "apply", strategy_file, manifest_dir], check=True, text=True, capture_output=False
+    )
+
+    tuning_utils.print_strategy_config(strategy_file)
+    # Sleep to allow deployment to stabilize
+    time.sleep(timeout)
+
+
+def find_best_strategy(perf_data):
+    best_strategy = None
+    best_p50 = float("inf")
+    best_p99 = float("inf")
+
+    for strategy, metrics in perf_data.items():
+        if (metrics["p50"] < best_p50) or (metrics["p50"] == best_p50 and metrics["p99"] < best_p99):
+            best_strategy = strategy
+            best_p50 = metrics["p50"]
+            best_p99 = metrics["p99"]
+
+    return best_p50, best_p99, best_strategy
+
+
+def config_only_print(output_folder, strategy_files_dict, mode="k8s", remove_dir=False):
+    log_file = output_folder + "/all_results.txt"
+    for _, strategy_file in strategy_files_dict.items():
+        tuning_utils.print_strategy_config(strategy_file, platform=mode)
+        tuning_utils.print_strategy_config(strategy_file, log_file=log_file)
+
+    if remove_dir:
+        if os.path.exists(output_folder):
+            shutil.rmtree(output_folder)
+
+    return
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Read and parse JSON/YAML files and output JSON file")
+    parser.add_argument("--hardware_info", help="Path to input JSON file", default="./hardware_info_gaudi.json")
+    parser.add_argument(
+        "--service_info", help="Path to input YAML file", default="./chatqna_neuralchat_rerank_latest.yaml"
+    )
+    parser.add_argument(
+        "--tuning_config", help="Path to input tuning config file", default="./replica_tuning_config.json"
+    )
+    parser.add_argument("--output_file", help="Path to output JSON file", default="./strategy.json")
+    parser.add_argument("--config_only", help="Generate all strategies", action="store_true")
+
+    parser.add_argument("--benchmark", help="Benchmark", action="store_true")
+    parser.add_argument("--task", type=str, default="rag", help="Task to perform")
+    parser.add_argument("--mode", help="Deployment mode", default="k8s")
+    parser.add_argument(
+        "--request_url", type=str, default="http://100.83.111.232:30888/v1/chatqna", help="ChatQnA Service URL"
+    )
+    parser.add_argument("--num_queries", type=int, default=640, help="Number of queries to be sent")
+
+    parser.add_argument("--strategy_file", help="Given the strategy file")
+    parser.add_argument("--manifest_dir", help="Manifest output directory.", default="./baseline")
+
+    args = parser.parse_args()
+
+    if args.mode not in ["k8s"]:
+        raise ValueError(f"Unsupported platform: {args.mode}")
+
+    if args.benchmark:
+        request_url = tuning_utils.get_chatqna_url()
+        logging.info(f"request_url: {request_url}")
+        p50, p99 = send_concurrency_requests(task=args.task, request_url=request_url, num_queries=args.num_queries)
+        return
+
+    # loading info
+    hardware_info = tuning_utils.load_hardware_info(args.hardware_info)
+    service_info = tuning_utils.load_service_info(args.service_info)
+    config = generate_base_config(service_info, hardware_info)
+
+    # create output folder
+    local_time = time.localtime(time.time())
+    output_folder = "result_" + time.strftime("%Y_%m_%d_%H_%M_%S", local_time)
+    result_file = os.path.join(output_folder, "all_results.txt")
+    os.makedirs(output_folder, exist_ok=True)
+    with open(result_file, "a") as file:
+        file.write(
+            f"Benchmark num_queries: {args.num_queries}, mode: {args.mode}, \n\nhardware_info: {hardware_info} \n\n"
+        )
+
+    strategy_executor = ReplicaTuning(copy.deepcopy(config), hardware_info, args.tuning_config)
+
+    perf_data = dict()
+    strategy_files_dict, _ = generate_strategy_files(config, strategy_executor, output_folder)
+
+    if args.config_only:
+        config_only_print(output_folder, strategy_files_dict, mode=args.mode, remove_dir=True)
+        return
+
+    # collect the perf info
+    for _, strategy_file in strategy_files_dict.items():
+        # start services with different deployment modes
+        if args.mode == "k8s":
+            update_and_apply_kubernetes_manifest(strategy_file, args.manifest_dir, timeout=200)
+
+        logging.info(f"{strategy_file} benchmarking...")
+        num_queries = args.num_queries
+        request_url = tuning_utils.get_chatqna_url()
+
+        # Warmup step
+        logging.info("Performing warmup with 8 queries...")
+        try:
+            send_concurrency_requests(task="rag", request_url=request_url, num_queries=2)
+            send_concurrency_requests(task="rag", request_url=request_url, num_queries=8)
+        except Exception as e:
+            with open(result_file, "a") as file:
+                file.write(f"Recording the {strategy_file} tuning result: \n")
+                file.write(f"Warmup Error: {e}\n")
+            logging.info(f"Warmup Error: {e}")
+            continue
+        logging.info("Warmup completed.")
+
+        try:
+            p50_0, p99_0 = send_concurrency_requests(
+                task="rag",
+                request_url=request_url,
+                num_queries=args.num_queries // 2,
+            )
+            p50, p99 = send_concurrency_requests(
+                task="rag",
+                request_url=request_url,
+                num_queries=args.num_queries,
+            )
+            p50_2, p99_2 = send_concurrency_requests(
+                task="rag",
+                request_url=request_url,
+                num_queries=args.num_queries * 2,
+            )
+        except Exception as e:
+            with open(result_file, "a") as file:
+                file.write(f"Recording the {strategy_file} tuning result: \n")
+                file.write(f"Exception Error: {e}\n")
+            continue
+
+        logging.info(f"num_queries: {num_queries}, request_url: {request_url}, {strategy_file} benchmarking result: ")
+        tuning_utils.print_strategy_config(strategy_file)
+        perf_data[strategy_file] = {"p50": p50, "p99": p99}
+
+        with open(result_file, "a") as file:
+            file.write(f"Recording the {strategy_file} tuning result: \n")
+            file.write(f"p50_0 = {p50_0}, p99_0 = {p99_0}\n")
+            file.write(json.dumps(perf_data[strategy_file]) + "\n")
+            file.write(f"p50_2 = {p50_2}, p99_2 = {p99_2}\n")
+        tuning_utils.print_strategy_config(strategy_file, log_file=result_file)
+
+    logging.info(f"Please check the {result_file} in the local directory.")
+
+    # find the best strategy
+    best_p50, best_p99, best_strategy = find_best_strategy(perf_data)
+    if best_strategy is not None:
+        best_strategy_data = {"best_strategy": best_strategy, "p50": best_p50, "p99": best_p99}
+        logging.info(f"Best strategy: {best_strategy_data}")
+        update_k8s_yaml(json_file=strategy_file, manifest_directory=args.manifest_dir)
+        with open(result_file, "a") as file:
+            file.write(f"The best strategy file: {best_strategy} \n")
+            file.write(json.dumps(best_strategy_data["best_strategy"]) + "\n")
+        print("Updated the best manifest Done.")
+    else:
+        logging.info("The best strategy is None.")
+
+    logging.info("Tuning Done.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/evals/auto_tuning/tuning_utils.py b/evals/auto_tuning/tuning_utils.py
new file mode 100644
index 00000000..f0212041
--- /dev/null
+++ b/evals/auto_tuning/tuning_utils.py
@@ -0,0 +1,327 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+import logging
+import subprocess
+import time
+
+import yaml
+from benchmark import send_concurrency_requests
+
+
+def load_hardware_info(file_path):
+    with open(file_path, "r") as f:
+        data = json.load(f)
+    return data
+
+
+def load_service_info(file_path):
+    with open(file_path, "r") as f:
+        data = yaml.safe_load(f)
+    return data
+
+
+def write_json(data, filename):
+    try:
+        with open(filename, "w") as f:
+            json.dump(data, f, indent=4)
+        return True
+    except Exception as e:
+        logging.error(f"Failed to write {filename}: {e}")
+        return False
+
+
+def print_strategy_config(config, tag=None, log_file=None, platform=None):
+    llm_microservice_name = {"llm-microservice"}
+    llm_dependency_name = {"llm-dependency"}
+
+    guardrails_microservice_name = {"guardrails-microservice"}
+    guardrails_dependency_name = {"guardrails-dependency"}
+
+    reranking_microservice_name = {"reranking-microservice"}
+    reranking_dependency_name = {"reranking-dependency"}
+
+    embedding_dependency_name = {"embedding-dependency"}
+    embedding_microservice_name = {"embedding-microservice"}
+
+    tei_microservice_name = {"embedding-microservice"}
+    tei_dependency_name = {"embedding-dependency"}
+
+    vector_db_name = {"vector-db"}
+    dataprep_microservice_name = {"dataprep-microservice"}
+
+    retrieval_microservice_name = {"retrieval-microservice"}
+    chatqna_mega_service_name = {"chatqna_mega_service"}
+
+    if isinstance(config, str):  # Check if config is a file path
+        with open(config, "r") as f:
+            config = json.load(f)
+
+    def get_service_config(service_name_set, config):
+        service_name = list(service_name_set)[0]
+        service_config = config.get(service_name, {})
+        cores = service_config.get("cores", "N/A")
+        replica = service_config.get("replica", "N/A")
+        memory = service_config.get("memory", "N/A")
+        return cores, replica, memory
+
+    def get_hpu_service_config(service_name_set, config):
+        service_name = list(service_name_set)[0]
+        service_config = config.get(service_name, {})
+        cores = service_config.get("cores", "N/A")
+        cards = service_config.get("cards", "N/A")
+        replica = service_config.get("replica", "N/A")
+        memory = service_config.get("memory", "N/A")
+        return cores, cards, replica, memory
+
+    llm_cores, llm_cards, num_llm_replica, llm_memory = get_hpu_service_config(llm_dependency_name, config)
+    llm_svc_cores, llm_svc_replica, llm_svc_memory = get_service_config(llm_microservice_name, config)
+
+    embedding_cores, embedding_replica, embedding_memory = get_service_config(embedding_dependency_name, config)
+    embedding_svc_cores, embedding_msvc_replica, embedding_svc_memory = get_service_config(
+        embedding_microservice_name, config
+    )
+
+    reranking_cores, reranking_cards, reranking_replica, reranking_memory = get_hpu_service_config(
+        reranking_dependency_name, config
+    )
+    reranking_svc_cores, reranking_svc_replica, reranking_svc_memory = get_service_config(
+        reranking_microservice_name, config
+    )
+
+    guardrails_cores, guardrails_cards, guardrails_replica, guardrails_memory = get_hpu_service_config(
+        guardrails_dependency_name, config
+    )
+    guardrails_svc_cores, guardrails_svc_replica, guardrails_svc_memory = get_service_config(
+        guardrails_microservice_name, config
+    )
+
+    vector_db_cores, vector_db_replica, vector_db_memory = get_service_config(vector_db_name, config)
+
+    dataprep_svc_cores, dataprep_svc_replica, dataprep_svc_memory = get_service_config(
+        dataprep_microservice_name, config
+    )
+    retrieval_cores, retrieval_replica, retrieval_memory = get_service_config(retrieval_microservice_name, config)
+    chatqna_cores, chatqna_replica, chatqna_memory = get_service_config(chatqna_mega_service_name, config)
+
+    services = {
+        "llm": {"cores": llm_cores, "cards": llm_cards, "replica": num_llm_replica, "memory": llm_memory},
+        "llm_svc": {"cores": llm_svc_cores, "replica": llm_svc_replica, "memory": llm_svc_memory},
+        "guardrails": {
+            "cores": guardrails_cores,
+            "cards": guardrails_cards,
+            "replica": guardrails_replica,
+            "memory": guardrails_memory,
+        },
+        "guardrails-svc": {
+            "cores": guardrails_svc_cores,
+            "replica": guardrails_svc_replica,
+            "memory": guardrails_svc_memory,
+        },
+        "embedding": {"cores": embedding_cores, "replica": embedding_replica, "memory": embedding_memory},
+        "embedding-svc": {
+            "cores": embedding_svc_cores,
+            "replica": embedding_msvc_replica,
+            "memory": embedding_svc_memory,
+        },
+        "reranking": {
+            "cores": reranking_cores,
+            "cards": reranking_cards,
+            "replica": reranking_replica,
+            "memory": reranking_memory,
+        },
+        "reranking-svc": {
+            "cores": reranking_svc_cores,
+            "replica": reranking_svc_replica,
+            "memory": reranking_svc_memory,
+        },
+        "vector_db": {"cores": vector_db_cores, "replica": vector_db_replica, "memory": vector_db_memory},
+        "dataprep_svc": {"cores": dataprep_svc_cores, "replica": dataprep_svc_replica, "memory": dataprep_svc_memory},
+        "retrieval": {"cores": retrieval_cores, "replica": retrieval_replica, "memory": retrieval_memory},
+        "chatqna": {"cores": chatqna_cores, "replica": chatqna_replica, "memory": chatqna_memory},
+    }
+
+    if log_file:
+        with open(log_file, "a") as f:
+            if tag == "deprecated":
+                f.write(
+                    f"Removed llm cores: {llm_cores:2}, llm cards: {llm_cards:2}, replica: {num_llm_replica}, "
+                    f"embedding cores: {embedding_cores:2}, replica: {embedding_replica:2},"
+                    f"reranking cores: {reranking_cores:2}, replica: {reranking_replica:2}\n"
+                )
+            else:
+                count = 0
+                for service_name, service_info in services.items():
+                    if "cards" in service_info:
+                        f.write(
+                            f"{service_name} cores: {service_info['cores']:2}, cards: {service_info['cards']:2}, replica: {service_info['replica']}, memory: {service_info['memory']} "
+                        )
+                    else:
+                        f.write(
+                            f"{service_name} cores: {service_info['cores']:2}, replica: {service_info['replica']}, memory: {service_info['memory']} "
+                        )
+
+                    count += 1
+                    if count % 2 == 0:
+                        f.write("\n")
+                f.write("\n\n")
+    else:
+        if platform == "k8s":
+            total_cores = 0
+            for _, service_info in config.items():
+                if "cores" in service_info:
+                    total_cores += service_info["cores"] * service_info["replica"]
+            logging.debug(f"total allocated cores: {total_cores:2}")
+        if tag == "deprecated":
+            logging.debug(
+                f"Removed llm cores: {llm_cores:2}, llm cards: {llm_cards:2}, replica: {num_llm_replica}, "
+                f"embedding cores: {embedding_cores:2}, replica: {embedding_replica:2}, "
+                f"reranking cores: {reranking_cores:2}, replica: {reranking_replica:2}\n"
+            )
+        else:
+            count = 0
+            log_message = []
+
+            for service_name, service_info in services.items():
+                if "cards" in service_info:
+                    log_message.append(
+                        f"{service_name:>15} replica: {service_info['replica']:<4} cards: {service_info['cards']:<5} "
+                    )
+                else:
+                    log_message.append(f"{service_name:>15} replica: {service_info['replica']:<17} ")
+
+                count += 1
+                if count % 2 == 0:
+                    logging.info(" | ".join(log_message))
+                    log_message = []  # Reset log_message for the next batch of services
+
+            if log_message:
+                logging.info(", ".join(log_message))
+
+            print("")
+
+
+def check_hpu_device(hardware_info):
+    hpu_exist = False
+    for device_key, device_info in hardware_info.items():
+        # Check for 'cpu' type with 'num_cards' present
+        if device_info["type"] == "cpu" and "num_cards" in device_info:
+            raise ValueError(f"Error in {device_key}: 'type' is 'cpu' and 'num_cards' is present in the configuration.")
+
+        if device_info["type"] == "hpu":
+            hpu_exist = True
+            break
+
+    return hpu_exist
+
+
+def get_svc_info(strategy_json_file, service_name):
+    strategy = load_hardware_info(strategy_json_file)
+    if isinstance(service_name, str):
+        service_info = strategy.get(service_name, {})
+    else:
+        service_info = strategy.get(list(service_name)[0], {})
+    result = {
+        "replica": service_info.get("replica", None),
+        "cards": service_info.get("cards", None),
+        "cores": service_info.get("cores", None),
+    }
+    return result
+
+
+def get_service_cluster_ip(service_name):
+    try:
+        # Run the kubectl command to get the services
+        result = subprocess.run(["kubectl", "get", "svc"], capture_output=True, text=True, check=True)
+
+        # Parse the output
+        lines = result.stdout.splitlines()
+        headers = lines[0].split()
+
+        # Find the indices for the columns we are interested in
+        name_idx = headers.index("NAME")
+        cluster_ip_idx = headers.index("CLUSTER-IP")
+        port_idx = headers.index("PORT(S)")
+
+        for line in lines[1:]:
+            columns = line.split()
+            if columns[name_idx] == service_name:
+                cluster_ip = columns[cluster_ip_idx]
+                ports = columns[port_idx]
+
+                main_part = ports.split("/")[0]
+                port = main_part.split(":")[0]
+                return cluster_ip, port
+
+        raise ValueError(f"Service {service_name} not found.")
+
+    except subprocess.CalledProcessError as e:
+        print(f"Error running kubectl command: {e}")
+        return None
+
+
+def test_embedding_svc_perf(num_queries_list):
+    results = None
+    service_names = ["embedding-svc", "embedding-mosec-svc"]
+    for service_name in service_names:
+        try:
+            results = test_service_performance(
+                service_name=service_name,
+                endpoint="/v1/embeddings",
+                task="embedding",
+                num_queries_list=num_queries_list,
+            )
+            print(f"Successfully tested service: {service_name}")
+            return results[0][0], results[0][1]
+        except Exception as e:
+            print(f"Failed to test service: {service_name} with error: {e}")
+
+    if results is None:
+        raise Exception("Both services failed to be tested.")
+
+    return results[0][0], results[0][1]
+
+
+def test_reranking_svc_perf(num_queries_list):
+
+    results = test_service_performance(
+        service_name="reranking-svc", endpoint="/v1/reranking", task="reranking", num_queries_list=num_queries_list
+    )
+
+    return results[0][0], results[0][1]
+
+
+def test_llm_svc_perf(num_queries_list):
+
+    results = test_service_performance(
+        service_name="llm-svc", endpoint="/v1/chat/completions", task="llm", num_queries_list=num_queries_list
+    )
+
+    return results[0][0], results[0][1]
+
+
+def get_chatqna_url():
+    svc_ip, port = get_service_cluster_ip("chatqna-backend-server-svc")
+    url = f"http://{svc_ip}:{port}/v1/chatqna"
+    return url
+
+
+def test_service_performance(service_name, endpoint, task, num_queries_list):
+    svc_ip, port = get_service_cluster_ip(service_name)
+    url = f"http://{svc_ip}:{port}{endpoint}"
+    print(f"url = {url}, task = {task}, svc_ip = {svc_ip}, port = {port}")
+
+    # Warmup step
+    print("Performing warmup with 8 queries...")
+    send_concurrency_requests(task=task, request_url=url, num_queries=8)
+    print("Warmup completed.")
+
+    results = []
+    for num_queries in num_queries_list:
+        p50, p99 = send_concurrency_requests(task=task, request_url=url, num_queries=num_queries)
+        results.append((p50, p99))
+        print(f"task = {task}, num_queries = {num_queries}, p50 = {p50}, p99 = {p99}")
+
+    print(f"task = {task} Finished! Bye!")
+    return results