Skip to content

Commit

Permalink
Update the style and syntax used in Agent Monitoring topics (grafana#…
Browse files Browse the repository at this point in the history
…5929)

* Minor updates and tweaks

* Update docs/sources/flow/monitoring/debugging.md

Co-authored-by: Jack Baldry <[email protected]>

---------

Co-authored-by: Jack Baldry <[email protected]>
  • Loading branch information
2 people authored and BarunKGP committed Feb 20, 2024
1 parent fa0fe84 commit db458f5
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 96 deletions.
39 changes: 15 additions & 24 deletions docs/sources/flow/monitoring/component_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,30 +13,21 @@ weight: 200

# Component metrics

{{< param "PRODUCT_NAME" >}} [components][] may optionally expose Prometheus metrics
which can be used to investigate the behavior of that component. These
component-specific metrics are only generated when an instance of that
component is running.

> Component-specific metrics are different than any metrics being processed by
> the component. Component-specific metrics are used to expose the state of a
> component for observability, alerting, and debugging.
Component-specific metrics are exposed at the `/metrics` HTTP endpoint of the
{{< param "PRODUCT_NAME" >}} HTTP server, which defaults to listening on
`http://localhost:12345`.

> The documentation for the [`grafana-agent run`][grafana-agent run] command describes how to
> modify the address {{< param "PRODUCT_NAME" >}} listens on for HTTP traffic.
Component-specific metrics will have a `component_id` label matching the
component ID generating those metrics. For example, component-specific metrics
for a `prometheus.remote_write` component labeled `production` will have a
`component_id` label with the value `prometheus.remote_write.production`.

The [reference documentation][] for each component will describe the list of
component-specific metrics that component exposes. Not all components will
expose metrics.
{{< param "PRODUCT_NAME" >}} [components][] may optionally expose Prometheus metrics which can be used to investigate the behavior of that component.
These component-specific metrics are only generated when an instance of that component is running.

> Component-specific metrics are different than any metrics being processed by the component.
> Component-specific metrics are used to expose the state of a component for observability, alerting, and debugging.
Component-specific metrics are exposed at the `/metrics` HTTP endpoint of the {{< param "PRODUCT_NAME" >}} HTTP server, which defaults to listening on `http://localhost:12345`.

> The documentation for the [`grafana-agent run`][grafana-agent run] command describes how to > modify the address {{< param "PRODUCT_NAME" >}} listens on for HTTP traffic.
Component-specific metrics have a `component_id` label matching the component ID generating those metrics.
For example, component-specific metrics for a `prometheus.remote_write` component labeled `production` will have a `component_id` label with the value `prometheus.remote_write.production`.

The [reference documentation][] for each component described the list of component-specific metrics that the component exposes.
Not all components expose metrics.

{{% docs/reference %}}
[components]: "/docs/agent/ -> /docs/agent/<AGENT_VERSION>/flow/concepts/components.md"
Expand Down
31 changes: 10 additions & 21 deletions docs/sources/flow/monitoring/controller_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,32 +13,21 @@ weight: 100

# Controller metrics

The {{< param "PRODUCT_NAME" >}} [component controller][] exposes Prometheus metrics
which can be used to investigate the controller state.
The {{< param "PRODUCT_NAME" >}} [component controller][] exposes Prometheus metrics which you can use to investigate the controller state.

Metrics for the controller are exposed at the `/metrics` HTTP endpoint of the
{{< param "PRODUCT_NAME" >}} HTTP server, which defaults to listening on
`http://localhost:12345`.
Metrics for the controller are exposed at the `/metrics` HTTP endpoint of the {{< param "PRODUCT_NAME" >}} HTTP server, which defaults to listening on `http://localhost:12345`.

> The documentation for the [`grafana-agent run`][grafana-agent run] command
> describes how to modify the address {{< param "PRODUCT_NAME" >}} listens on for HTTP
> traffic.
> The documentation for the [`grafana-agent run`][grafana-agent run] command describes how to modify the address {{< param "PRODUCT_NAME" >}} listens on for HTTP traffic.
The controller exposes the following metrics:

* `agent_component_controller_evaluating` (Gauge): Set to `1` whenever the
component controller is currently evaluating components. Note that this value
may be misrepresented depending on how fast evaluations complete or how often
evaluations occur.
* `agent_component_controller_running_components` (Gauge): The current
number of running components by health. The health is represented in the
`health_type` label.
* `agent_component_evaluation_seconds` (Histogram): The time it takes to
evaluate components after one of their dependencies is updated.
* `agent_component_dependencies_wait_seconds` (Histogram): Time spent by
components waiting to be evaluated after one of their dependencies is updated.
* `agent_component_evaluation_queue_size` (Gauge): The current number of
component evaluations waiting to be performed.
* `agent_component_controller_evaluating` (Gauge): Set to `1` whenever the component controller is currently evaluating components.
This value may be misrepresented depending on how fast evaluations complete or how often evaluations occur.
* `agent_component_controller_running_components` (Gauge): The current number of running components by health.
The health is represented in the `health_type` label.
* `agent_component_evaluation_seconds` (Histogram): The time it takes to evaluate components after one of their dependencies is updated.
* `agent_component_dependencies_wait_seconds` (Histogram): Time spent by components waiting to be evaluated after one of their dependencies is updated.
* `agent_component_evaluation_queue_size` (Gauge): The current number of component evaluations waiting to be performed.

{{% docs/reference %}}
[component controller]: "/docs/agent/ -> /docs/agent/<AGENT_VERSION>/flow/concepts/component_controller.md"
Expand Down
82 changes: 31 additions & 51 deletions docs/sources/flow/monitoring/debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,46 +15,37 @@ weight: 300
Follow these steps to debug issues with {{< param "PRODUCT_NAME" >}}:

1. Use the {{< param "PRODUCT_NAME" >}} UI to debug issues.
2. If the {{< param "PRODUCT_NAME" >}} UI doesn't help with debugging an issue, logs can be examined
instead.
1. If the {{< param "PRODUCT_NAME" >}} UI doesn't help with debugging an issue, logs can be examined instead.

## {{< param "PRODUCT_NAME" >}} UI

{{< param "PRODUCT_NAME" >}} includes an embedded UI viewable from the {{< param "PRODUCT_ROOT_NAME" >}} HTTP
server, which defaults to listening at `http://localhost:12345`.
{{< param "PRODUCT_NAME" >}} includes an embedded UI viewable from the {{< param "PRODUCT_ROOT_NAME" >}} HTTP server, which defaults to listening at `http://localhost:12345`.

> **NOTE**: For security reasons, installations of {{< param "PRODUCT_NAME" >}} on
> non-containerized platforms default to listening on `localhost`. default
> prevents other machines on the network from being able to view the UI.
> **NOTE**: For security reasons, installations of {{< param "PRODUCT_NAME" >}} on non-containerized platforms default to listening on `localhost`.
> This default prevents other machines on the network from being able to view the UI.
>
> To expose the UI to other machines on the network on non-containerized
> platforms, refer to the documentation for how you [installed][install]
> {{< param "PRODUCT_NAME" >}}.
> To expose the UI to other machines on the network on non-containerized platforms, refer to the documentation for how you [installed][install] {{< param "PRODUCT_NAME" >}}.
>
> If you are running a custom installation of {{< param "PRODUCT_NAME" >}}, refer to the
> documentation for [the `grafana-agent run` command][grafana-agent run] to
> learn how to change the HTTP listen address, and pass the appropriate flag
> when running {{< param "PRODUCT_NAME" >}}.
> If you are running a custom installation of {{< param "PRODUCT_NAME" >}},
> refer to the documentation for [the `grafana-agent run` command][grafana-agent run] to learn how to change the HTTP listen address,
> and pass the appropriate flag when running {{< param "PRODUCT_NAME" >}}.
### Home page

![](../../../assets/ui_home_page.png)

The home page shows a table of components defined in the configuration file along with
their health.
The home page shows a table of components defined in the configuration file and their health.

Click **View** on a row in the table to navigate to the [Component detail page](#component-detail-page)
for that component.
Click **View** on a row in the table to navigate to the [Component detail page](#component-detail-page) for that component.

Click the {{< param "PRODUCT_ROOT_NAME" >}} logo to navigate back to the home page.

### Graph page

![](../../../assets/ui_graph_page.png)

The **Graph** page shows a graph view of components defined in the configuration file
along with their health. Clicking a component in the graph navigates to the
[Component detail page](#component-detail-page) for that component.
The **Graph** page shows a graph view of components defined in the configuration file and their health.
Clicking a component in the graph navigates to the [Component detail page](#component-detail-page) for that component.

### Component detail page

Expand All @@ -67,8 +58,7 @@ The component detail page shows the following information for each component:
* The current exports for the component.
* The current debug info for the component (if the component has debug info).

> Values marked as a [secret][] are obfuscated and will display as the text
> `(secret)`.
> Values marked as a [secret][] are obfuscated and display as the text `(secret)`.
### Clustering page

Expand All @@ -86,45 +76,35 @@ The clustering page shows the following information for each cluster node:
To debug using the UI:

* Ensure that no component is reported as unhealthy.
* Ensure that the arguments and exports for misbehaving components appear
correct.
* Ensure that the arguments and exports for misbehaving components appear correct.

## Examining logs

Logs may also help debug issues with {{< param "PRODUCT_NAME" >}}.

To reduce logging noise, many components hide debugging info behind debug-level
log lines. It is recommended that you configure the [`logging` block][logging]
to show debug-level log lines when debugging issues with {{< param "PRODUCT_NAME" >}}.
To reduce logging noise, many components hide debugging info behind debug-level log lines.
It is recommended that you configure the [`logging` block][logging] to show debug-level log lines when debugging issues with {{< param "PRODUCT_NAME" >}}.

The location of {{< param "PRODUCT_NAME" >}} logs is different based on how it is deployed.
Refer to the [`logging` block][logging] page to see how to find logs for your
system.
The location of {{< param "PRODUCT_NAME" >}} logs is different based on how it's deployed.
Refer to the [`logging` block][logging] page to see how to find logs for your system.

## Debugging clustering issues

To debug issues when using [clustering][], check for the following symptoms.

- **Cluster not converging**: The cluster peers are not converging on the same
view of their peers' status. This is most likely due to network connectivity
issues between the cluster nodes. Use the Flow UI of each running peer to
understand which nodes are not being picked up correctly.
- **Cluster split brain**: The cluster peers are not aware of one another,
thinking they’re the only node present. Again, check for network connectivity
issues. Check that the addresses or DNS names given to the node to join are
correctly formatted and reachable.
- **Configuration drift**: Clustering assumes that all nodes are running with
the same configuration file at roughly the same time. Check the logs for
issues with the reloaded configuration file as well as the graph page to verify
changes have been applied.
- **Node name conflicts**: Clustering assumes all nodes have unique names;
nodes with conflicting names are rejected and will not join the cluster. Look
at the clustering UI page for the list of current peers with their names, and
check the logs for any reported name conflict events.
- **Node stuck in terminating state**: The node attempted to gracefully shut
down and set its state to Terminating, but it has not completely gone away. Check
the clustering page to view the state of the peers and verify that the
terminating {{< param "PRODUCT_ROOT_NAME" >}} has been shut down.
- **Cluster not converging**: The cluster peers aren't converging on the same view of their peers' status.
This is most likely due to network connectivity issues between the cluster nodes.
Use the {{< param "PRODUCT_NAME" >}} UI of each running peer to understand which nodes aren't being picked up correctly.
- **Cluster split brain**: The cluster peers aren't aware of one another, thinking they’re the only node present.
Again, check for network connectivity issues.
Check that the addresses or DNS names given to the node to join are correctly formatted and reachable.
- **Configuration drift**: Clustering assumes that all nodes are running with the same configuration file at roughly the same time.
Check the logs for issues with the reloaded configuration file as well as the graph page to verify changes have been applied.
- **Node name conflicts**: Clustering assumes all nodes have unique names.
Nodes with conflicting names are rejected and won't join the cluster.
Look at the clustering UI page for the list of current peers with their names, and check the logs for any reported name conflict events.
- **Node stuck in terminating state**: The node attempted to gracefully shut down and set its state to Terminating, but it has not completely gone away.
Check the clustering page to view the state of the peers and verify that the terminating {{< param "PRODUCT_ROOT_NAME" >}} has been shut down.

{{% docs/reference %}}
[logging]: "/docs/agent/ -> /docs/agent/<AGENT_VERSION>/flow/reference/config-blocks/logging.md"
Expand Down

0 comments on commit db458f5

Please sign in to comment.