Skip to content

Commit

Permalink
Move HPA instructions to its own document
Browse files Browse the repository at this point in the history
Signed-off-by: Eero Tamminen <[email protected]>
  • Loading branch information
eero-t committed Sep 4, 2024
1 parent 3b451ba commit d096521
Show file tree
Hide file tree
Showing 2 changed files with 112 additions and 108 deletions.
111 changes: 111 additions & 0 deletions helm-charts/HPA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# HorizontalPodAutoscaler (HPA) support

## Table of Contents

- [Introduction](#introduction)
- [Pre-conditions](#pre-conditions)
- [Gotchas](#gotchas)
- [Verify](#verify)

## Introduction

`horizontalPodAutoscaler` option enables HPA scaling for the TGI and TEI inferencing deployments:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/).

### Pre-conditions

HPA controlled pods SHOULD have appropriate resource requests or affinity rules (enabled in their
subcharts and tested to work) so that k8s scheduler does not schedule too many of them on the same
node(s). Otherwise they never reach ready state.

Too large requests would not be a problem as long as pods still fit to available nodes, but too
small requests would be an issue:

- Multiple inferencing instances interfere / slow down each other, especially if there are no
[NRI policies](https://github.com/opea-project/GenAIEval/tree/main/doc/platform-optimization)
that provide further isolation
- Containers can become non-functional when their actual resource usage crosses the specified limits

If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus)
yet, it SHOULD be be installed before enabling HPA, e.g. by using:
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

Enabling HPA in top-level Helm chart (e.g. `chatqna`), overwrites cluster's current _PrometheusAdapter_
configuration with relevant custom metric queries. If that has existing queries that should be retained,
relevant queries need to be added to existing _PrometheusAdapter_ configuration _manually_ from the
custom metrics Helm template (in top-level Helm chart).

Names of the _Prometheus-operator_ related objects depend on where it is installed from.
Default ones are:

- "kube-prometheus" upstream manifests:
- Namespace: `monitoring`
- Metrics service: `prometheus-k8s`
- Adapter configMap: `adapter-config`
- Helm chart for "kube-prometheus" (linked above):
- Namespace: `monitoring`
- Metrics service: `prom-kube-prometheus-stack-prometheus`
- Adapter configMap: `prom-adapter-prometheus-adapter`

Make sure correct "configMap" name is used in top-level (e.g. `chatqna`) Helm chart `values.yaml`,
and commands below!

### Gotchas

Why HPA is opt-in:

- Enabling (top level) chart `horizontalPodAutoscaler` option will _overwrite_ cluster's current
`PrometheusAdapter` configuration with its own custom metrics configuration.
Take copy of the existing `configMap` before install, if that matters:
```console
kubectl -n monitoring get cm/prom-adapter-prometheus-adapter -o yaml > adapter-config.yaml
```
- `PrometheusAdapter` needs to be restarted after install, for it to read the new configuration:
```console
ns=monitoring;
kubectl -n $ns delete $(kubectl -n $ns get pod --selector app.kubernetes.io/name=prometheus-adapter -o name)
```
- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml)
for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is
asked to install OPEA services to some other namespace, those rules need to be updated accordingly
- Unless pod resource requests, affinity rules and/or cluster NRI policies are used to better isolated
service inferencing pods from each other, scaled up instances may never get to ready state
- Current HPA rules are examples for Xeon, for efficient scaling they need to be fine-tuned for given setup
performance (underlying HW, used models and data types, OPEA version etc)

### Verify

To verify that horizontalPodAutoscaler options work, it's better to check that both inferencing
services metrics, and HPA rules using custom metrics generated from them work.

Use k8s object names matching your Prometheus installation:

```console
prom_svc=prom-kube-prometheus-stack-prometheus # Metrics service
prom_ns=monitoring; # Prometheus namespace
```

Verify Prometheus found OPEA services metric endpoints, i.e. last number on `curl` output is non-zero:

```console
chart=chatqna; # OPEA services prefix
prom_url=http://$(kubectl -n $prom_ns get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/$prom_svc);
curl --no-progress-meter $prom_url/metrics | grep scrape_pool_targets.*$chart
```

**NOTE**: TGI and TEI inferencing services provide metrics endpoint only after they've processed their first request!

PrometheusAdapter lists TGI and/or TGI custom metrics (`te_*` / `tgi_*`):

```console
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name
```

HPA rules list valid (not `<unknown>`) TARGET values for service deployments:

```console
ns=default; # OPEA namespace
kubectl -n $ns get hpa
```
109 changes: 1 addition & 108 deletions helm-charts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,6 @@ This directory contains helm charts for [GenAIComps](https://github.com/opea-pro
- [Components](#components)
- [How to deploy with helm charts](#deploy-with-helm-charts)
- [Helm Charts Options](#helm-charts-options)
- [HorizontalPodAutoscaler (HPA) support](#horizontalpodautoscaler-hpa-support)
- [Pre-conditions](#pre-conditions)
- [Gotchas](#gotchas)
- [Verify HPA metrics](#verify-hpa-metrics)
- [Using Persistent Volume](#using-persistent-volume)
- [Using Private Docker Hub](#using-private-docker-hub)
- [Helm Charts repository](#helm-chart-repository)
Expand Down Expand Up @@ -66,112 +62,9 @@ There are global options(which should be shared across all components of a workl
| global | http_proxy https_proxy no_proxy | Proxy settings. If you are running the workloads behind the proxy, you'll have to add your proxy settings here. |
| global | modelUsePVC | The PersistentVolumeClaim you want to use as huggingface hub cache. Default "" means not using PVC. Only one of modelUsePVC/modelUseHostPath can be set. |
| global | modelUseHostPath | If you don't have Persistent Volume in your k8s cluster and want to use local directory as huggingface hub cache, set modelUseHostPath to your local directory name. Note that this can't share across nodes. Default "". Only one of modelUsePVC/modelUseHostPath can be set. |
| chatqna | horizontalPodAutoscaler.enabled | Enable HPA autoscaling for TGI and TEI service deployments based on metrics they provide. See [Pre-conditions](#pre-conditions) and [Gotchas](#gotchas) before enabling! |
| chatqna | horizontalPodAutoscaler.enabled | Enable HPA autoscaling for TGI and TEI service deployments based on metrics they provide. See [Pre-conditions](HPA.md#pre-conditions) and [Gotchas](HPA.md#gotchas) before enabling! |
| tgi | LLM_MODEL_ID | The model id you want to use for tgi server. Default "Intel/neural-chat-7b-v3-3". |

## HorizontalPodAutoscaler (HPA) support

`horizontalPodAutoscaler` option enables HPA scaling for the TGI and TEI inferencing deployments:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/).

### Pre-conditions

HPA controlled pods SHOULD have appropriate resource requests or affinity rules (enabled in their
subcharts and tested to work) so that k8s scheduler does not schedule too many of them on the same
node(s). Otherwise they never reach ready state.

Too large requests would not be a problem as long as pods still fit to available nodes, but too
small requests would be an issue:

- Multiple inferencing instances interfere / slow down each other, especially if there are no
[NRI policies](https://github.com/opea-project/GenAIEval/tree/main/doc/platform-optimization)
that provide further isolation
- Containers can become non-functional when their actual resource usage crosses the specified limits

If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus)
yet, it SHOULD be be installed before enabling HPA, e.g. by using:
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

Enabling HPA in top-level Helm chart (e.g. `chatqna`), overwrites cluster's current _PrometheusAdapter_
configuration with relevant custom metric queries. If that has existing queries that should be retained,
relevant queries need to be added to existing _PrometheusAdapter_ configuration _manually_ from the
custom metrics Helm template (in top-level Helm chart).

Names of the _Prometheus-operator_ related objects depend on where it is installed from.
Default ones are:

- "kube-prometheus" upstream manifests:
- Namespace: `monitoring`
- Metrics service: `prometheus-k8s`
- Adapter configMap: `adapter-config`
- Helm chart for "kube-prometheus" (linked above):
- Namespace: `monitoring`
- Metrics service: `prom-kube-prometheus-stack-prometheus`
- Adapter configMap: `prom-adapter-prometheus-adapter`

Make sure correct "configMap" name is used in top-level (e.g. `chatqna`) Helm chart `values.yaml`,
and commands below!

### Gotchas

Why HPA is opt-in:

- Enabling (top level) chart `horizontalPodAutoscaler` option will _overwrite_ cluster's current
`PrometheusAdapter` configuration with its own custom metrics configuration.
Take copy of the existing `configMap` before install, if that matters:
```console
kubectl -n monitoring get cm/prom-adapter-prometheus-adapter -o yaml > adapter-config.yaml
```
- `PrometheusAdapter` needs to be restarted after install, for it to read the new configuration:
```console
ns=monitoring;
kubectl -n $ns delete $(kubectl -n $ns get pod --selector app.kubernetes.io/name=prometheus-adapter -o name)
```
- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml)
for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is
asked to install OPEA services to some other namespace, those rules need to be updated accordingly
- Unless pod resource requests, affinity rules and/or cluster NRI policies are used to better isolated
service inferencing pods from each other, scaled up instances may never get to ready state
- Current HPA rules are examples for Xeon, for efficient scaling they need to be fine-tuned for given setup
performance (underlying HW, used models and data types, OPEA version etc)

### Verify HPA metrics

To verify that horizontalPodAutoscaler options work, it's better to check that both inferencing
services metrics, and HPA rules using custom metrics generated from them work.

Use k8s object names matching your Prometheus installation:

```console
prom_svc=prom-kube-prometheus-stack-prometheus # Metrics service
prom_ns=monitoring; # Prometheus namespace
```

Verify Prometheus found OPEA services metric endpoints, i.e. last number on `curl` output is non-zero:

```console
chart=chatqna; # OPEA services prefix
prom_url=http://$(kubectl -n $prom_ns get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/$prom_svc);
curl --no-progress-meter $prom_url/metrics | grep scrape_pool_targets.*$chart
```

**NOTE**: TGI and TEI inferencing services provide metrics endpoint only after they've processed their first request!

PrometheusAdapter lists TGI and/or TGI custom metrics (`te_*` / `tgi_*`):

```console
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name
```

HPA rules list valid (not `<unknown>`) TARGET values for service deployments:

```console
ns=default; # OPEA namespace
kubectl -n $ns get hpa
```

## Using Persistent Volume

It's common to use Persistent Volume(PV) for model caches(huggingface hub cache) in a production k8s cluster. We support to pass the PersistentVolumeClaim(PVC) to containers, but it's the user's responsibility to create the PVC depending on your k8s cluster's capability.
Expand Down

0 comments on commit d096521

Please sign in to comment.