Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add ADR for LogPipeline health status #934

Merged
merged 5 commits into from
Mar 28, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# 10. Log Flow Health Status API

Date: 2024-03-28

## Status

Proposed

## Context: Key Events in the Fluent Bit Log Flow

[ADR 003: Integrate Prometheus With Telemetry Manager Using Alerting](003-integrate-prometheus-with-telemetry-manager-using-alerting.md) describes a concept for self-monitoring and [ADR 008 Telemetry Flow Health Status API](008-telemetry-flow-healthiness-status-api.md) defines the related pipeline conditions that are derived from the OpenTelemetry Collector metrics. This ADR focuses on events in the Fluent Bit log flow.

![Fluent Bit Data Flow](../assets/fluent-bit-data-flow.drawio.svg "Fluent Bit Data Flow")

### Log Rotation

NHingerl marked this conversation as resolved.
Show resolved Hide resolved
* Container logs are rotated and finally removed by the kubelet. Logs that have not been read by Fluent Bit before rotation are lost.
* If logs are lost because of that, there is little to no indication (metrics, etc.).

### High Buffer Usage

* After reading logs from the host file-system, the tail input plugin writes them the a persistent buffer.
* The buffer has a limited capacity and can fill up if logs are read faster than they can be sent to the backend.
* If the buffer is full and the tail input plugin keeps reading, the oldest logs are dropped.

### Backend Throttling

* Each logging backend has an ingestion rate limit.
* The backend's maximum ingestion rate is propagated to Fluent Bit's output plugins - either by blocking all output threads due to a slow response, or by returning errors, which require the output to perform retries.
* Utilization of the file-system buffer indicates backend throttling.

## Decision

For the pipeline health condition type, the **reason** field can show a value that is most relevant for the user. We suggest the following values, ordered from most to least critical:

```
AllTelemetryDataDropped > SomeTelemetryDataDropped > NoLogsDelivered > BufferFillingUp > Healthy
```

The reasons are based on the following alert rules:

| Alert Rule | Expression |
| --- | --- |
| AgentExporterSendsLogs | `sum(rate(fluentbit_output_bytes_total{...}[5m])) > 0` |
| AgentReceiverReadsLogs | `sum(rate(fluentbit_input_bytes_total{...}[5m])) > 0` |
| AgentExporterDroppedLogs | `sum(rate(fluentbit_output_retries_failed_total{...}[5m])) > 0` |
| AgentBufferInUse | `telemetry_fsbuffer_usage_bytes{...}[5m] > 300000000` |
| AgentBufferFull | `telemetry_fsbuffer_usage_bytes[5m] > 900000000` |

Then, we map the alert rules to the reasons as follows:

| Reason | Alert Rules |
| --- | --- |
| AllTelemetryDataDropped | **not** AgentExporterSendsLogs **and** AgentBufferFull |
| SomeTelemetryDataDropped | AgentExporterSendsLogs **and** AgentBufferFull |
| NoLogsDelivered | **not** AgentExporterSendsLogs **and** AgentReceiverReadsLogs |
| BufferFillingUp | AgentBufferInUse |
| Healthy | **not** (AgentBufferInUse **or** AgentBufferFull) **and** (**not** AgentReceiverReadsLogs **or** AgentExporterSendsLogs) |

> **NOTE:** `BufferFillingUp` should not result in a negative condition status. This reason would be aggregated as warning in the Telemetry module status.

The metrics related to file-system buffer are not mappable to a particular LogPipeline. Thus, telemetry-manager has to set the condition on all pipelines if file-system buffer usage is indicated by the metrics.
chrkl marked this conversation as resolved.
Show resolved Hide resolved
Loading