Skip to content

Commit

Permalink
Merge pull request kubernetes#462 from epam/performance_tests
Browse files Browse the repository at this point in the history
Kueue performance tests
  • Loading branch information
k8s-ci-robot authored Dec 19, 2022
2 parents bdbf260 + 6c0532e commit ed57453
Show file tree
Hide file tree
Showing 9 changed files with 443 additions and 0 deletions.
19 changes: 19 additions & 0 deletions test/performance/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Clusterloader home directory (checkout https://github.com/kubernetes/perf-tests)
export CL2_HOME_DIR="/Users/johny/perf-tests/clusterloader2"

# Run the performance test with Kueue (this requires Kueue to be pre-deployed to the cluster)
# or without Kueue
export USE_KUEUE=false

# Test iterations:
# number-of-small-jobs number-of-large-jobs job-replica-running-time test-timeout cluster-queue-CPU-quota cluster-queue-memory-quota
export EXPERIMENTS=(
"10 2 0 2s 3m 100 100Gi"
"20 2 0 2s 5m 100 100Gi"
)

# Kubeconfig file location
export KUBECONFIG="$HOME/.kube/config"

# Kubernetes kind
export PROVIDER="gke"
4 changes: 4 additions & 0 deletions test/performance/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
*report*/
prerequisites/cluster-queue.yaml
tmp_manifests/
.env
67 changes: 67 additions & 0 deletions test/performance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Kueue Performance Testing

## Measurements

### Job startup latency

How fast do jobs transition from `created` to `started` state?
Time spent between the transition from `job.CreationTimestamp.Time` to `job.Status.StartTime.Time` state.

High Job startup latency in Kueue is expected when the total quota is not enough to schedule all jobs immediately, because the jobs need to queue.

### Job startup throughput

The best workload admission rate per second within 1 minute intervals.
The rate is measured every 5 seconds (see more details in [PromQL examples](https://prometheus.io/docs/prometheus/latest/querying/examples/#subquery)):

`max_over_time(sum(rate(kueue_admitted_workloads_total{cluster_queue="{{$clusterQueue}}"}[1m]))[{{$testTimeout}}:5s])`

This measurement is not accurate if the cluster quota is big enough to schedule all workloads of the test immediately, because Kueue immediately admits all the workloads and the `kueue_admitted_workloads_total` never increases. In this case, the PromQL query returns 0.
## How to run the test?

### Prerequisites

1. Deploy [Kueue](https://github.com/kubernetes-sigs/kueue/blob/main/docs/setup/install.md)
2. Make sure you have `kubectl`, [jq](https://stedolan.github.io/jq/download/), [golang version](https://github.com/mikefarah/yq) of `yq` and `go`
3. Checkout `Clusterloader2` framework: https://github.com/kubernetes/perf-tests and build `clusterloader` binary:

* change to `clusterloader2` directory
* run `go build -o clusterloader './cmd/'`

### Run the test

1. Copy an environment file example to `.env` file:

* `cp .env.example .env`

2. Edit the environment variables

| Variable | Description |
| ----------- | ----------- |
| CL2_HOME_DIR | Clusterloader home directory (checkout https://github.com/kubernetes/perf-tests) |
| USE_KUEUE | Run the performance test with Kueue (this requires Kueue to be pre-deployed to the cluster) or without Kueue |
| EXPERIMENTS | Configuration of iterations iterations (see configuration example in the file) |
| KUBECONFIG | Kubeconfig file location |
| PROVIDER | Kubernetes kind (tested on `gke` only)

3. Run the `run-test.sh` file

### Test results

Every test execution creates a `report_<timestamp>` directory inside `TEST_CONFIG_DIR` with `summary.csv` file, where the following metrics are available:

* P50 Job Create to start latency (ms)
* P90 Job Create to start latency (ms)
* P50 Job Start to complete latency (ms)
* P90 Job Start to complete latency (ms)
* Max Job Throughput (max jobs/s)
* Total Jobs
* Total Pods
* Duration (s)

Additionally, the following metrics are added to the results only for reference. Kueue doesn't influence them directly.

* Avg Pod Waiting time (s)
* P90 Pod Waiting time (s)
* Avg Pod Completion time (s)
* P90 Pod Completion time (s)
181 changes: 181 additions & 0 deletions test/performance/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
{{$MODE := DefaultParam .MODE "Indexed"}}
{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}

{{$smallJobs := DefaultParam .CL2_SMALL_JOBS 10}}
{{$mediumJobs := DefaultParam .CL2_MEDIUM_JOBS 2}}
{{$largeJobs := DefaultParam .CL2_LARGE_JOBS 0}}

{{$namespaces := DefaultParam .CL2_NAMESPACES 1}}

{{$smallJobsPerNamespace := DivideInt $smallJobs $namespaces}}
{{$mediumJobsPerNamespace := DivideInt $mediumJobs $namespaces}}
{{$largeJobsPerNamespace := DivideInt $largeJobs $namespaces}}

{{$smallJobSize := 5}}
{{$mediumJobSize := 20}}
{{$largeJobSize := 100}}

{{$jobRunningTime := DefaultParam .CL2_JOB_RUNNING_TIME "30s"}}

{{$clusterQueue := "default-cluster-queue"}}
{{$localQueue := "local-queue"}}

{{$testTimeout := DefaultParam .CL2_TEST_TIMEOUT "5m"}}

{{$namespacePrefix := "queue-test"}}

{{$useKueue := DefaultParam .CL2_USE_KUEUE false}}

name: batch

namespace:
number: {{$namespaces}}
prefix: {{$namespacePrefix}}

tuningSets:
- name: UniformQPS
qpsLoad:
qps: {{$LOAD_TEST_THROUGHPUT}}

steps:
- name: Start measurements
measurements:
- Identifier: Timer
Method: Timer
Params:
action: start
label: job_performance
- Identifier: WaitForFinishedJobs
Method: WaitForFinishedJobs
Params:
action: start
labelSelector: group = test-job
- Identifier: JobLifecycleLatency
Method: JobLifecycleLatency
Params:
action: start
labelSelector: group = test-job
- Identifier: GenericPrometheusQuery
Method: GenericPrometheusQuery
Params:
action: start
metricName: Job (Kueue) API performance
metricVersion: v1
unit: s
queries:
- name: total_jobs_scheduled
query: count(kube_job_info{namespace=~"{{$namespacePrefix}}.*"})
- name: total_pods_scheduled
query: count(kube_pod_info{namespace=~"{{$namespacePrefix}}.*"})
- name: avg_pod_running_time
query: (avg(kube_pod_completion_time{namespace=~"{{$namespacePrefix}}.*"} - kube_pod_start_time{namespace=~"{{$namespacePrefix}}.*"}))
- name: perc_90_pod_completion_time
query: quantile(0.90, kube_pod_completion_time{namespace=~"{{$namespacePrefix}}.*"} - kube_pod_start_time{namespace=~"{{$namespacePrefix}}.*"})
- name: avg_pod_waiting_time
query: (avg(kube_pod_start_time{namespace=~"{{$namespacePrefix}}.*"} - kube_pod_created{namespace=~"{{$namespacePrefix}}.*"}))
- name: perc_90_pod_waiting_time
query: quantile(0.90, kube_pod_start_time{namespace=~"{{$namespacePrefix}}.*"} - kube_pod_created{namespace=~"{{$namespacePrefix}}.*"})
- name: max_job_throughput
query: max_over_time(sum(rate(kueue_admitted_workloads_total{cluster_queue="{{$clusterQueue}}"}[1m]))[{{$testTimeout}}:5s])
- name: Sleep
measurements:
- Identifier: sleep
Method: Sleep
Params:
duration: 10s
{{if $useKueue}}
- name: Create local queue
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: 1
tuningSet: UniformQPS
objectBundle:
- basename: {{$localQueue}}
objectTemplatePath: "local-queue.yaml"
templateFillMap:
ClusterQueue: {{$clusterQueue}}
{{end}}
- name: Create {{$MODE}} jobs
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: {{$smallJobsPerNamespace}}
tuningSet: UniformQPS
objectBundle:
- basename: small
objectTemplatePath: "job.yaml"
templateFillMap:
UseKueue: {{$useKueue}}
Replicas: {{$smallJobSize}}
Mode: {{$MODE}}
Sleep: {{$jobRunningTime}}
LocalQueue: "{{$localQueue}}-0"
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: {{$mediumJobsPerNamespace}}
tuningSet: UniformQPS
objectBundle:
- basename: medium
objectTemplatePath: "job.yaml"
templateFillMap:
UseKueue: {{$useKueue}}
Replicas: {{$mediumJobSize}}
Mode: {{$MODE}}
Sleep: {{$jobRunningTime}}
LocalQueue: "{{$localQueue}}-0"
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: {{$largeJobsPerNamespace}}
tuningSet: UniformQPS
objectBundle:
- basename: large
objectTemplatePath: "job.yaml"
templateFillMap:
UseKueue: {{$useKueue}}
Replicas: {{$largeJobSize}}
Mode: {{$MODE}}
Sleep: {{$jobRunningTime}}
LocalQueue: "{{$localQueue}}-0"
- name: Wait for {{$MODE}} jobs to finish
measurements:
- Identifier: JobLifecycleLatency
Method: JobLifecycleLatency
Params:
action: gather
timeout: {{$testTimeout}}
- Identifier: WaitForFinishedJobs
Method: WaitForFinishedJobs
Params:
action: gather
timeout: {{$testTimeout}}
- name: Stop Timer
measurements:
- Identifier: Timer
Method: Timer
Params:
action: stop
label: job_performance
- name: Gather Timer
measurements:
- Identifier: Timer
Method: Timer
Params:
action: gather
- name: Sleep
measurements:
- Identifier: sleep
Method: Sleep
Params:
duration: 30s
- name: Gather Prometheus measurements
measurements:
- Identifier: GenericPrometheusQuery
Method: GenericPrometheusQuery
Params:
action: gather
enableViolations: true
30 changes: 30 additions & 0 deletions test/performance/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Name}}
labels:
group: test-job
{{if .UseKueue}}
annotations:
kueue.x-k8s.io/queue-name: {{.LocalQueue}}
{{end}}
spec:
suspend: {{.UseKueue}}
parallelism: {{.Replicas}}
completions: {{.Replicas}}
completionMode: {{.Mode}}
template:
metadata:
labels:
group: test-pod
spec:
containers:
- name: {{.Name}}
image: gcr.io/k8s-staging-perf-tests/sleep:v0.0.3
args:
- {{.Sleep}}
resources:
requests:
cpu: "200m"
memory: "100Mi"
restartPolicy: Never
6 changes: 6 additions & 0 deletions test/performance/local-queue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
apiVersion: kueue.x-k8s.io/v1alpha2
kind: LocalQueue
metadata:
name: {{.Name}}
spec:
clusterQueue: {{.ClusterQueue}}
17 changes: 17 additions & 0 deletions test/performance/prerequisites/cluster-queue.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ClusterQueue
metadata:
name: default-cluster-queue
spec:
namespaceSelector: {}
resources:
- name: "cpu"
flavors:
- name: default
quota:
min: 100
- name: "memory"
flavors:
- name: default
quota:
min: 50Gi
4 changes: 4 additions & 0 deletions test/performance/prerequisites/resource-flavor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ResourceFlavor
metadata:
name: default
Loading

0 comments on commit ed57453

Please sign in to comment.