Linux Agent gets unhealthy on adding Linux integration. #6155

amolnater-qasource · 2024-11-27T10:16:17Z

Kibana Build details:

VERSION: 8.17.0 BC1
BUILD: 80364
COMMIT: e3c75d19d796c366aedc5788960b2c6cc868014f

Artifact Link: https://staging.elastic.co/8.17.0-8031025a/downloads/beats/elastic-agent/elastic-agent-8.17.0-linux-x86_64.tar.gz

Host OS:
SLES15

Preconditions:

8.17.0-BC1 Kibana cloud environment should be available.

Steps to reproduce:

Install Linux agent.
Add linux integration to this agent.
Observe agent gets unhealthy with errors in Linux integration:

Degraded
Error fetching data for metricset system.raid: failed to parse sysfs: no matches from path /sys/block

Degraded
Error fetching data for metricset linux.conntrack: error fetching conntrack stats: open /proc/net/stat/nf_conntrack: no such file or directory

Expected Result:
Linux Agent should remain healthy on adding Linux integration.

Screenshot:

Agent Logs:
elastic-agent-diagnostics-2024-11-27T08-39-33Z-00.zip

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-11-27T10:16:20Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine · 2024-11-27T21:30:38Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

mauri870 · 2024-11-28T10:57:59Z

@amolnater-qasource Can you confirm you have the conntrack module loaded on this system?

lsmod | grep conntrack

Also, was this system upgraded without a restart? This can cause failures sometimes.

mauri870 · 2024-11-28T13:53:32Z

I tried deploying a SUSE 15 instance in Azure to debug this, I'm running the 8.17.0 version in the staging environment, I deployed the Beta Linux Metrics integration and couldn't find any errors from the agent:

$ uname -a
Linux mauri-suse 5.14.21-150500.33.66-azure #1 SMP PREEMPT_DYNAMIC Wed Sep 4 05:47:04 UTC 2024 (4885a53) x86_64 x86_64 x86_64 GNU/Linux
$ sudo elastic-agent status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   └─ status: (HEALTHY) Running
azureuser@mauri-suse:~>

mauri870 · 2024-11-28T13:54:47Z

Aha, after enabling Collect system metrics from Linux instances > Linux host raid metrics the agent went to a degraded state:

$ sudo elastic-agent status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a degraded state
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '3031'
      └─ system/metrics-default-system/metrics-system-7afde2f3-7310-4e56-8554-4847fa1c1567
         └─ status: (DEGRADED) Error fetching data for metricset system.raid: failed to parse sysfs: no matches from path /sys/block

mauri870 · 2024-11-28T13:57:39Z

I just managed to reproduce the same issue on arch linux as well, so definitely not a SUSE issue.

mauri870 · 2024-11-28T14:12:18Z

Could it be that we expect to find devices in raid mode, and if we find zero then we return an error? We should probably return nil in this case or treat the error accordingly.

https://github.com/elastic/beats/blob/42e25f7216862b6779c2e8a87a82c1ae30d9a6e1/metricbeat/module/system/raid/blockinfo/getdev.go#L46-L48

pierrehilbert · 2024-11-28T14:23:52Z

Wouldn't we have face this issue before if this is the root cause?
From what I can see, we didn't change anything there for a while.

Could it be a change in the newest version of Linux kernels that are disabling nf_conntrack by default?

mauri870 · 2024-11-28T14:25:15Z

Does not seem to be the case. I confirmed that on SUSE and arch conntrack is loaded. SUSE is on the 5.14 kernel and Arch on 6.12, so I doubt this is related to the kernel itself.

cc @fearful-symmetry since you worked on the raid metrics.

mauri870 · 2024-11-28T18:33:03Z

I have pushed a fix for the metricbeat system module. To my understanding, there is no point for a customer to enable RAID metrics on a system that does not have a RAID configuration. But if it does so, the agent should not go into a degraded state because of this, it should simply report no metrics at all. That is basically the solution I'm going with.

Please feel free to comment on other ways to handle this.

mauri870 · 2024-12-03T14:20:05Z

From my testing, this is partially fixed. The error is not causing the agent to go into a degraded state anymore, and it is properly shown in the logs:

In the pull request we decided to use PartialMetricsError, to make the error reported in the output of elastic-agent status, but it is not showing it for me. I'll investigate and see why that is the case.

VihasMakwana · 2024-12-03T14:28:25Z

@mauri870 can you do elastic-agent status --output full and see?

mauri870 · 2024-12-03T14:36:07Z

Thanks, here is the full output. From my understanding, it should be reporting this message, right?

elastic-agent status --output=full

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 78e62940-b597-4bc0-afa4-91000d164ccb
   │  ├─ version: 8.17.0
   │  └─ commit: 8a91d5c2306860fa88a1bae9bb7b37b7eabeddf5
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96884'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96864'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96913'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ linux/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96838'
   │  ├─ linux/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ linux/metrics-default-linux/metrics-system-d192f191-c94f-4c99-9363-ff4e8cfb68a5
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '96786'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-fa497a42-ebd4-4117-8c4b-dde7ce717735
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '96812'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ system/metrics-default-system/metrics-system-d192f191-c94f-4c99-9363-ff4e8cfb68a5
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ system/metrics-default-system/metrics-system-fa497a42-ebd4-4117-8c4b-dde7ce717735
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

VihasMakwana · 2024-12-03T14:39:47Z

It should...

Wondering why it doesn't report the error? 🤔

mauri870 · 2024-12-03T14:45:09Z

That is quite intriguing. I'm fairly certain the error I see in the logs originates from the logp line below, suggesting that we have updated the agent's status, but for some reason, it is not being displayed.

// mark module as running if metrics are partially available and display the error message
msw.module.UpdateStatus(status.Running, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err))
logp.Err("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.Name(), err)

mauri870 · 2024-12-03T15:38:56Z

I spoke with Vihas on slack and I have opened elastic/beats#41867 to track this bug. Will keep this issue closed as the reported bug with a degraded agent state is now fixed.

amolnater-qasource · 2024-12-06T04:43:01Z

Hi Team,
We have revalidated this issue on latest 8.17.0 BC5 kibana cloud environment and found it still reproducible.

Observations:

Linux agent still gets unhealthy with error: Error fetching data for metricset linux.conntrack: error fetching conntrack stats: open /proc/net/stat/nf_conntrack: no such file or directory.
No output is observed on running lsmod | grep conntrack.
(DEGRADED) Error fetching data for metricset linux.conntrack: error fetching conntrack stats: open /proc/net/stat/nf_conntrack: no such file or directory is observed on running sudo elastic-agent status --output full.
We have observed data for linux.conntrack dataset under Data Streams tab.

Screenshots:

Logs:
elastic-agent-diagnostics-2024-12-06T04-47-57Z-00.zip

Build details:
VERSION: 8.17.0 BC5
BUILD: 80495
COMMIT: 5c78fb5e4e9b5063bd83ae9bd1e5b32c63f5cc34
Artifact Link: https://staging.elastic.co/8.17.0-a18e6540/downloads/beats/elastic-agent/elastic-agent-8.17.0-linux-x86_64.tar.gz

Please let us know if this is expected.

For now we are reopening this issue until further clarity.

Thanks!

VihasMakwana · 2024-12-06T10:14:24Z

@mauri870 @cmacknz Isn't this expected? The user is trying to use conntrack module without loading the appropriate kernel module.
We can suppress this error, but as a long term solution, I would rather have error thrown in metricset's New(...) method.

mauri870 · 2024-12-06T11:58:48Z

Looks like this issue covers two different errors, the RAID metrics and conntrack metrics. My fix was only for the RAID metrics as I couldn't reproduce the conntrack one. It makes sense that the conntrack failure is due to the module not being loaded.

I think this can be a partial metrics error as well and perhaps more descriptive, the error for /proc/net/stat/nf_conntrack missing could be appended with "conntrack module not loaded/found"

mauri870 · 2024-12-06T12:36:43Z

I have opened a PR to fix this. I agree with @VihasMakwana that we should probably check this in the New call. I'm not that familiar with the metricset initialization, but what happens if the New method from a metricset fails?

cmacknz · 2024-12-06T15:08:22Z

Big +1 that us displaying an error when we are told to collect data from a source and collecting data from that source is impossible without modifying the host system in some way. Showing the user this is the point of the feature that does this.

but what happens if the New method from a metricset fails?

Try it and find out :) What will likely happen is the input in the UI shows as failed with an error that it couldn't reload the configuration because the module couldn't be created.

I would expect the error to pop out here in the Beats code.

mauri870 · 2024-12-09T17:57:47Z

I have filled elastic/beats#41963 to look into handling these in the New call instead of during metric fetching. We should probably look into the other system metricsets to see if they fall into the same category.

amolnater-qasource · 2024-12-11T07:58:05Z

Hi Team,
We have revalidated this issue on latest 8.17.0 BC6 kibana cloud environment and found it fixed now.

Observations:

Linux Agent remains healthy on adding Linux integration.

Screenshots:

Logs:
elastic-agent-diagnostics-2024-12-11T07-55-02Z-00.zip

Build details:
VERSION: 8.17.0 BC6
BUILD: 80521
COMMIT: e8a820624a03a412433584d3e3df951838e4c63c
Artifact Link: https://staging.elastic.co/8.17.0-6b31e673/downloads/beats/elastic-agent/elastic-agent-8.17.0-amd64.deb

Hence we are marking this issue as QA:Validated.

Thanks!

amolnater-qasource added bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 27, 2024

ycombinator added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 27, 2024

mauri870 mentioned this issue Nov 28, 2024

metricbeat: suppress error when RAID metrics are enabled on non-RAID system elastic/beats#41825

Merged

6 tasks

mauri870 self-assigned this Nov 28, 2024

mauri870 closed this as completed in elastic/beats#41825 Dec 2, 2024

This was referenced Dec 2, 2024

[8.x](backport #41825) metricbeat: suppress error when RAID metrics are enabled on non-RAID system elastic/beats#41855

Merged

[8.17](backport #41825) metricbeat: suppress error when RAID metrics are enabled on non-RAID system elastic/beats#41856

Merged

amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Dec 3, 2024

mauri870 mentioned this issue Dec 3, 2024

Metricbeat module with multiple metricsets overwrites the error status set by a previous metricset elastic/beats#41867

Open

amolnater-qasource reopened this Dec 6, 2024

mauri870 mentioned this issue Dec 6, 2024

metricbeat: handle nf_conntrack module not loaded in linux integration elastic/beats#41930

Merged

6 tasks

mauri870 closed this as completed in elastic/beats#41930 Dec 9, 2024

This was referenced Dec 9, 2024

[8.x](backport #41930) metricbeat: handle nf_conntrack module not loaded in linux integration elastic/beats#41961

Merged

[8.17](backport #41930) metricbeat: handle nf_conntrack module not loaded in linux integration elastic/beats#41962

Merged

mauri870 mentioned this issue Dec 9, 2024

metricbeat/system: check requirements for metricsets in the constructor elastic/beats#41963

Open

amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux Agent gets unhealthy on adding Linux integration. #6155

Linux Agent gets unhealthy on adding Linux integration. #6155

amolnater-qasource commented Nov 27, 2024

elasticmachine commented Nov 27, 2024

elasticmachine commented Nov 27, 2024

mauri870 commented Nov 28, 2024 •

edited

Loading

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

pierrehilbert commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Dec 3, 2024

VihasMakwana commented Dec 3, 2024

mauri870 commented Dec 3, 2024

VihasMakwana commented Dec 3, 2024

mauri870 commented Dec 3, 2024

mauri870 commented Dec 3, 2024 •

edited

Loading

amolnater-qasource commented Dec 6, 2024 •

edited

Loading

VihasMakwana commented Dec 6, 2024

mauri870 commented Dec 6, 2024

mauri870 commented Dec 6, 2024

cmacknz commented Dec 6, 2024

mauri870 commented Dec 9, 2024

amolnater-qasource commented Dec 11, 2024 •

edited

Loading

Linux Agent gets unhealthy on adding Linux integration. #6155

Linux Agent gets unhealthy on adding Linux integration. #6155

Comments

amolnater-qasource commented Nov 27, 2024

elasticmachine commented Nov 27, 2024

elasticmachine commented Nov 27, 2024

mauri870 commented Nov 28, 2024 • edited Loading

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

pierrehilbert commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Nov 28, 2024

mauri870 commented Dec 3, 2024

VihasMakwana commented Dec 3, 2024

mauri870 commented Dec 3, 2024

VihasMakwana commented Dec 3, 2024

mauri870 commented Dec 3, 2024

mauri870 commented Dec 3, 2024 • edited Loading

amolnater-qasource commented Dec 6, 2024 • edited Loading

VihasMakwana commented Dec 6, 2024

mauri870 commented Dec 6, 2024

mauri870 commented Dec 6, 2024

cmacknz commented Dec 6, 2024

mauri870 commented Dec 9, 2024

amolnater-qasource commented Dec 11, 2024 • edited Loading

mauri870 commented Nov 28, 2024 •

edited

Loading

mauri870 commented Dec 3, 2024 •

edited

Loading

amolnater-qasource commented Dec 6, 2024 •

edited

Loading

amolnater-qasource commented Dec 11, 2024 •

edited

Loading