Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Validate the LogPipeline OTel Setup and test the performance of the log agent #1705

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/contributor/benchmarks/load-test-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ The tests are executed for 20 minutes, so that each test case has a stabilized o
<div class="table-wrapper" markdown="block">

| config | logs received l/s | logs exported l/s | logs queued | cpu | memory MB | no. restarts of gateway | no. restarts of generator |
| --- | --- | --- | --- | --- | --- | ---|
| single | 7193 | 7195 | 16824 | 2.5 | 826 | 0 | 1 |
| batch | 16428 | 16427 | 0 | 3 | 265 | 0 | 1 |
| ------ | ----------------- | ----------------- | ----------- | --- | --------- | ----------------------- | ------------------------- |
| single | 7193 | 7195 | 16824 | 2.5 | 826 | 0 | 1 |
| batch | 16428 | 16427 | 0 | 3 | 265 | 0 | 1 |
</div>

## Interpretation
Expand Down
306 changes: 306 additions & 0 deletions docs/contributor/benchmarks/otlp-logs-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
# OTel LogPipeline set-up validation

This file documents the process of validating the whole LogPipeline with OTLP output flow. It starts by defining the setup, that consists of the manually deployed log agent, the already-implemented log gateway, and log generators using flog.

The scope is to performance test the agent, observing the resulting values, in terms of throughput, resource consumption, reaction to backpressure, etc.



## 1. Set-up configuration steps

### With Helm

``` bash
k apply -f telemetry-manager/config/samples/operator_v1alpha1_telemetry.yaml

// Execute knowledge-hub/scripts/create_cls_log_pipeline.sh with the corresponding environment variables

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

helm install -n kyma-system logging open-telemetry/opentelemetry-collector -f telemetry-manager/docs/contributor/pocs/assets/otel-log-agent-values.yaml
```

### Manual

``` bash
k apply -f telemetry-manager/config/samples/operator_v1alpha1_telemetry.yaml

// Execute knowledge-hub/scripts/create_cls_log_pipeline.sh with the corresponding environment variables

k apply -f ./otlp-logs-validation.yaml
```



## 2. Resulting Resources

### Agent ConfigMap (OTel Config)

See [OTLP Logs Validation YAML](./otlp-logs-validation.yaml)

#### Things to take into consideration (at implementation)
- Dynamically inclusion/exclusion of namespaces, based on LogPipeline spec attributes
- Exclude FluentBit container in OTel configuration and OTel container in FluentBit configuration
- `receivers/filelog/operators`: The copy body to `attributes.original` must be avoided if `dropLogRawBody` flag is enabled

### Agent DaemonSet

See [OTLP Logs Validation YAML](./otlp-logs-validation.yaml)

### How does checkpointing work

- By enabling the storeCheckpoint preset (Helm) the `file_storage` extension is activated in the receiver
- The `file_storage` has the path `/var/lib/otelcol`
- This is later mounted as a `hostPath` volume in the DaemonSet spec
- Also set in the `storage` property of the filelog receiver

> `storage` = The ID of a storage extension to be used to store file offsets. File offsets allow the receiver to pick up where it left off in the case of a collector restart. If no storage extension is used, the receiver will manage offsets in memory only.



## 3. Benchmarking and Performance Tests Results

Setup Configuration:
``` bash
k create ns prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install -n "prometheus" "prometheus" prometheus-community/kube-prometheus-stack -f hack/load-tests/values.yaml --set grafana.adminPassword=myPwd

k apply -f telemetry-manager/hack/load-tests/log-agent-test-setup.yaml
```

For executing the load tests, the generated logs have to be isolated, hence the following line should be replaced in the ConfigMap of the log agent:

``` yaml
receivers:
filelog:
# ...
include:
- /var/log/pods/*/*/*.log # replace with "/var/log/pods/log-load-test*/*flog*/*.log"
```

For the 🏋️‍♀️ Backpressure Scenario additionally apply:
``` bash
k apply -f telemetry-manager/hack/load-tests/log-backpressure-config.yaml
```

PromQL Queries:
``` sql
-- RECEIVED
round(sum(rate(otelcol_receiver_accepted_log_records{service="telemetry-log-agent-metrics"}[20m])))

-- EXPORTED
round(sum(rate(otelcol_exporter_sent_log_records{service="telemetry-log-agent-metrics"}[20m])))

-- QUEUE
avg(sum(otelcol_exporter_queue_size{service="telemetry-log-agent-metrics"}))

-- MEMORY
round(sum(avg_over_time(container_memory_working_set_bytes{namespace="kyma-system", container="collector"}[20m]) * on(namespace,pod) group_left(workload) avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{namespace="kyma-system", workload="telemetry-log-agent"}[20m])) by (pod) / 1024 / 1024)

-- CPU
round(sum(avg_over_time(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="kyma-system"}[20m]) * on(namespace,pod) group_left(workload) avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{namespace="kyma-system", workload="telemetry-log-agent"}[20m])) by (pod), 0.1)
```

### ⭐️ Best Results (Scenario: Single Pipeline)
| Batching | RECEIVED | EXPORTED | QUEUE | MEMORY | CPU |
| :------: | :-------: | :-------: | :---: | :----: | :---: |
| ❌ | max. 8.9K | max. 8.9K | 0 | ~63 | ~0.5 |
| ✅ | 8.6K | 8.6k | 0 | ~73 | ~0.6 |

### ⭐️🏋️‍♀️ Best Results (Scenario: Single Pipeline with Backpressure)
| Batching | RECEIVED | EXPORTED | QUEUE | MEMORY | CPU |
| :------: | :------: | :------: | :---: | :----: | :---: |
| ❌ | 6.8K | 6.8K | ~328 | ~66 | ~0.5 |
| ✅ | - | - | - | - | - |

### 📊 Benchmarking Sessions

| Icon | Meaning |
| ---- | ---------------------------------------------------- |
| ⏳ | Full-test, involving the whole setup, usually 20 min |
| 🪲 | Debugging session, usually shorter, not so reliable |
| 🏋️‍♀️ | Backpressure Scenario |
| ⭐️ | Best results observed (in a given scenario) |

#### ⏳ 18 Dec 2024, 13:45 - 14:05 (20 min)
- **Generator:** 10 replicas x 10 MB
- **Agent:** no CPU limit, no queue
- **Results:**
- Agent RECEIVED/EXPORTED: 6.06K
- Agent Memory:
- Pod1: 70
- Pod2: 70
- Agent CPU:
- Pod1: 0.5
- Pod2: 0.4
- Gateway RECEIVED/EXPORTED: 6.09K
- Gateway QUEUE: 0

#### ⏳ 18 Dec 2024, 14:08 - 14:28 (20 min)
- **Generator:** 20 replicas x 10 MB
- **Agent:** no CPU limit, no queue
- **Results:**
- Agent RECEIVED/EXPORTED: 4.93K
- Agent Memory:
- Pod1: 71
- Pod2: 72
- Agent CPU:
- Pod1: 0.5
- Pod2: 0.4
- Gateway RECEIVED/EXPORTED: 4.93K
- Gateway QUEUE: 0 (max. 6 at some point)

#### ⏳ 18 Dec 2024, 14:50 - 15:10 (20 min)
- **Generator:** 10 replicas x 20 MB
- **Agent:** no CPU limit, no queue
- **Results:**
- Agent RECEIVED/EXPORTED: 5.94K
- Agent Memory:
- Pod1: 76
- Pod2: 81
- Agent CPU:
- Pod1: 0.5
- Pod2: 0.5
- Gateway RECEIVED/EXPORTED: 5.94K
- Gateway QUEUE: 0

#### ⏳⭐️ 18 Dec 2024, 15:24 - 15:34 (10 min)
- **Generator:** 10 replicas x 10 MB
- **Agent:** with CPU limit (1), no queue
- **Results:**
- Agent RECEIVED/EXPORTED: 8.9K
- Agent Memory: 64/62
- Agent CPU: 0.5/0.5
- Gateway RECEIVED/EXPORTED: 8.9K
- Gateway QUEUE: 0

#### 🏋️‍♀️⭐️ 18 Dec 2024, 15:36 - 15:56 (20 min) (backpressure scenario)
- **Generator:** 10 replicas x 10 MB
- **Agent:** with CPU limit (1), no queue
- **Results:**
- Agent RECEIVED/EXPORTED: 6.8K
- Agent Memory:
- Pod1: 66
- Pod2: 67
- Agent CPU:
- Pod1: 0.6
- Pod2: 0.5
- Gateway RECEIVED: 6.8K
- Gateway EXPORTED: 256
- Gateway QUEUE: 328
- **Remarks:**
- Agent does not stop when gateway refuses logs (because backpressure does not backpropagate)
- It slows down/stops in other scenarios (see bellow) => SUCCESS

#### 🪲 19 Dec 2024, Agent exports logs to a debug endpoint (5 min)
- no networking involved
- 12/14 log generators x 10 MB
- 19.5K => ~20K
- MEM: 43/47
- CPU: 0.7/0.8

#### 🪲 19 Dec 2024, Agent exports logs directly to mock backend (5 min)
- networking, but avoiding gateway
- 10 log generators x 10 MB
- 5.3K
- MEM: 58/59
- CPU: 0.4/0.5
- 12 log generators x 10 MB
- not increasing

#### 🪲 19 Dec 2024, Agent exports logs directly to mock backend with batching processor (5 min)
- networking, but with batching mechanism in-place
- 10 log generators x 10 MB, batch size: 1024
- 8.3K
- MEM: 68/73
- CPU: 0.5/0.6
- 12 log generators x 10 MB, batch size: 1024
- starts decreasing (~7.5K)
- 10 log generators x 10 MB, batch size: 2048
- ~9K
- MEM: 74/79
- CPU: 0.6/0.7

#### ⏳ 19 Dec 2024, 13:46 - 14:06 (20 min)
- **Generator:** 10 replicas x 10 MB
- **Agent:** with CPU limit (1), no queue, with batch processing (1024)
- **Results:**
- Agent RECEIVED/EXPORTED: 8.46K
- Gateway RECEIVED/EXPORTED: 8.46K
- Agent Memory: 69/76
- Agent CPU: 0.5/0.7
- Gateway QUEUE: 0 (max 191)

#### ⏳ 19 Dec 2024, ??:?? - ??:?? (20 min)
- **Generator:** 10 replicas x 10 MB
- **Agent:** with CPU limit (1), no queue, with batch processing (2048)
- **Results:**
- lower throughput as for the 1024 scenario

#### ⏳⭐️ 19 Dec 2024, 15:55 - 16:15 (20 min)
- **Agent:** with CPU limit (1), no queue, with batch processing (1024)
- **Mock Backend:** memory limit x2 (2048Mi)
- **Generator:** 10 replicas x 10 MB
- **Results:**
- Agent RECEIVED/EXPORTED: 8.18K
- Gateway RECEIVED/EXPORTED: 8.18K
- Agent Memory: 70/71
- Agent CPU: 0.6/0.6
- Gateway QUEUE: 0
- **Generator:** 12 replicas x 10 MB (16:18 - 16:35)
- **Results:**
- Agent RECEIVED/EXPORTED: 8.6k
- Gateway RECEIVED/EXPORTED: 8.6k
- Agent Memory: 73/74
- Agent CPU: 0.7/0.6
- Gateway QUEUE: 0
- **Generator:** 14 replicas x 10 MB (16:35 - 16:40)
- **Results:**
- Agent RECEIVED/EXPORTED: 7.54K
- Gateway RECEIVED/EXPORTED: 7.54K
- lower

#### ⏳ 19 Dec 2024, 16:50 - 17:10 (20 min)
- **Generator:** 12 replicas x 10 MB
- **Agent:** with CPU limit (1), no queue, with batch processing (2048)
- **Mock Backend:** memory limit x2 (2048Mi)
- **Results:**
- Agent RECEIVED/EXPORTED: 8.1K
- Gateway RECEIVED/EXPORTED: 8.11K
- Agent Memory: 74/81
- Agent CPU: 0.6/0.5
- Gateway QUEUE: 0 (max 2)

#### 🪲 20 Dec 2024, Multiple agents loading the gateway (5 min)
- **Setup:** 10 nodes, 10 agents, 1 generator / node (DaemonSet)
- **Results (WITH BATCHING):**
- Agent RECEIVED/EXPORTED: 61.5K => 6.1K / agent instance
- Gateway RECEIVED/EXPORTED: 61.5K/29.5K => 30K/14.7K / gateway instance
- Agent Memory: 61-68/agent
- Agent CPU: 0.4-0.8/agent
- Gateway QUEUE: 510 (max 512, full)
- ~10% exporter failed enqueue logs
- 0% receiver refused logs
- 0% exporter send failed logs
- **Results (WITHOUT BATCHING):**
- Agent RECEIVED/EXPORTED: 31.4K => 3.1K / agent instance
- Gateway RECEIVED/EXPORTED: 31.4K => 11.4K / gateway instance
- Agent Memory: 61-68/agent
- Agent CPU: 0.4-0.5/agent
- Gateway QUEUE: 0 (max 6)
- 0% exporter failed enqueue logs
- 0% receiver refused logs
- 0% exporter send failed logs


## 4. Conclusions

- A lower performance can be expected, compared to the FluentBit counterpart setup.
- Backpressure is currently not backpropagated from the gateway to the agent, resulting in logs being queued/lost on the gateway end, since the agent has no way of knowing when to stop, thus exports data continuously. (This is a known issue, that should get solved by the OTel community in the next half year)
- Agent slows down if the load is increased (i.e. more generators / more logs / more data).
- The network communication between the agent and the gateway or/and the gateway represent a bottleneck in this setup, since when using just a debug endpoint as an exporter, higher throughput was observed.
- CPU and Memory consumption are surprisingly low, and this was not improved by removing the limits (quite the opposite was observed, with the CPU throttling more often and the throughput decreasing).
- When enabling the batch processor, throughput was increasing, but this comes at the cost of losing logs in some scenarios.
- More/other methods of improving the throughput might still be worth investigating.
Loading
Loading