-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux Agent gets unhealthy on adding Linux integration. #6155
Linux Agent gets unhealthy on adding Linux integration. #6155
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
@amolnater-qasource Can you confirm you have the conntrack module loaded on this system?
Also, was this system upgraded without a restart? This can cause failures sometimes. |
I tried deploying a SUSE 15 instance in Azure to debug this, I'm running the 8.17.0 version in the staging environment, I deployed the Beta Linux Metrics integration and couldn't find any errors from the agent:
|
Aha, after enabling
|
I just managed to reproduce the same issue on arch linux as well, so definitely not a SUSE issue. |
Could it be that we expect to find devices in raid mode, and if we find zero then we return an error? We should probably return nil in this case or treat the error accordingly. |
Wouldn't we have face this issue before if this is the root cause? Could it be a change in the newest version of Linux kernels that are disabling nf_conntrack by default? |
Does not seem to be the case. I confirmed that on SUSE and arch conntrack is loaded. SUSE is on the 5.14 kernel and Arch on 6.12, so I doubt this is related to the kernel itself. cc @fearful-symmetry since you worked on the raid metrics. |
I have pushed a fix for the metricbeat system module. To my understanding, there is no point for a customer to enable RAID metrics on a system that does not have a RAID configuration. But if it does so, the agent should not go into a degraded state because of this, it should simply report no metrics at all. That is basically the solution I'm going with. Please feel free to comment on other ways to handle this. |
From my testing, this is partially fixed. The error is not causing the agent to go into a degraded state anymore, and it is properly shown in the logs: In the pull request we decided to use PartialMetricsError, to make the error reported in the output of |
@mauri870 can you do |
Thanks, here is the full output. From my understanding, it should be reporting this message, right? elastic-agent status --output=full
|
It should... Wondering why it doesn't report the error? 🤔 |
That is quite intriguing. I'm fairly certain the error I see in the logs originates from the logp line below, suggesting that we have updated the agent's status, but for some reason, it is not being displayed. // mark module as running if metrics are partially available and display the error message
msw.module.UpdateStatus(status.Running, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err))
logp.Err("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.Name(), err) |
I spoke with Vihas on slack and I have opened elastic/beats#41867 to track this bug. Will keep this issue closed as the reported bug with a degraded agent state is now fixed. |
Hi Team, Observations:
Logs: Build details: Please let us know if this is expected. For now we are reopening this issue until further clarity. Thanks! |
Looks like this issue covers two different errors, the RAID metrics and conntrack metrics. My fix was only for the RAID metrics as I couldn't reproduce the conntrack one. It makes sense that the conntrack failure is due to the module not being loaded. I think this can be a partial metrics error as well and perhaps more descriptive, the error for |
I have opened a PR to fix this. I agree with @VihasMakwana that we should probably check this in the |
Big +1 that us displaying an error when we are told to collect data from a source and collecting data from that source is impossible without modifying the host system in some way. Showing the user this is the point of the feature that does this.
Try it and find out :) What will likely happen is the input in the UI shows as failed with an error that it couldn't reload the configuration because the module couldn't be created. I would expect the error to pop out here in the Beats code. |
I have filled elastic/beats#41963 to look into handling these in the |
Hi Team, Observations:
Logs: Build details: Hence we are marking this issue as QA:Validated. Thanks! |
Kibana Build details:
Artifact Link: https://staging.elastic.co/8.17.0-8031025a/downloads/beats/elastic-agent/elastic-agent-8.17.0-linux-x86_64.tar.gz
Host OS:
SLES15
Preconditions:
Steps to reproduce:
Expected Result:
Linux Agent should remain healthy on adding Linux integration.
Screenshot:
Agent Logs:
elastic-agent-diagnostics-2024-11-27T08-39-33Z-00.zip
The text was updated successfully, but these errors were encountered: