-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky: policy is never assigned to agent #144
Comments
Pinging @elastic/agent (Team:Agent) |
@jlind23 It would be good to get some eyes on this from the team as in the past this indicated some bugs. |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
... and another one: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/979/pipeline/ internal logs in |
Another one today: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/PR-1905/2/pipeline It seems to be correlated with long running tests (many data streams, different cases). Is there any possibility that there is any data rollover performed in the meantime? The fun fact is that we can reproduce with CI without any issues. |
@lykkin Do you have any update on this issue? |
@mtojek i've spent some time adding some hooks to the kibana api endpoints and trying to get a reproduction locally, but the repro hasn't been fruitful sadly. do you have the steps used to reproduce this on the CI pipeline written up anywhere? |
So it happened again yesterday: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/1039/pipeline In this case it was the "network_traffic". All you have to do is to fetch the integrations repo and: cd packages/network_traffic # enter the network traffic directory
elastic-package build # build the package
elastic-package stack up -v -d # boot up the local Elastic stack
elastic-package test system -v # execute all system tests, will take some time To reproduce it on the exact Beats CI host, I need to boot a GCP instance. |
@mtojek thanks, i'll take a crack at this tomorrow to see if i can get to the bottom of it |
hey @mtojek, just an update: i have the jenkins environment set up locally and i'm running the tests, but can't seem to get the failures to pop up. i have been running them in the background for a few days with no failures. i'll try it on a GCP instance next. |
@lykkin Anything useful you got out of the logs from our jenkins servers? One suspicion I always have is it might be related to restricted resources. Potentially your local setup has too many resources available. So if you spin up something in GCP, pick a small box. Please also ping me when need so we can hop on a call to brainstorm together and go through some logs. |
Thanks for your engagement, @lykkin! Do you think that you can add more debug/info messages to Fleet's, Agent's of Fleet Server's source code, so we can narrow it down in the CI? |
from what i can tell it accurately updates the policy through kibana, then fleet server seems to ignore the update. what @ruflin said sounds plausible, as it might stall out the fleet server from picking up the policy change. adding more logs sounds like a great idea, and i'll rope @ruflin in when i circle back around to trying to repro this. |
Hi @lykkin and the Team, the issue is striking back again: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/1317/pipeline Let me know if you need more reference data. |
Based on the Jenkins logs this seems to be 7.16.0-SNAPSHOT. I was hoping a few fixes we did for 7.16 which are related to heavy load on ES or fleet-server overloading could fix this. There is one fix that only landed in 7.16.1 and based on the container that is started not such which version was tested. At the same time, the error we are seeing is not a connection error to Elasticsearch. With 7.16 quite a few improvements went in around logging and we now have also diagnostics in place. @lykkin Can you dig into the logs if you see something obvious? Ideally we would even run diagnostics on the Elastic Agent that struggles to see the exact state but this will be trickier as it is inside the container. @kpollich You have recently looked into issues around duplicated fleet-server inputs so I wonder if this could be in any related maybe? Have nothing found to proof this but was wondering if this might ring a bell on your end. @jlind23 We should get the ground of this one. |
Yes this part in particular sounds very much like the duplicate Fleet Server input issue:
When we upgraded Fleet Server policies from older versions (prior to the Fleet Server package having a |
If it is the duplication issue, I assume we should see some errors in the fleet-server logs. |
Hi folks, thanks for investigating the problem. Just wanted to remind you that I spotted the issue today, so I assume the CI used the latest Docker images. |
heya, sorry about the lapse in comms, picking this back up for some more investigation. will update when i find something interesting. |
digging into the logs doesn't have anything conclusive. the fleet-server logs are pretty sparse, and don't track any changes around the time the reassign request happens. tracing the script logs, we see the reassign request being made around 15:29:43
which the kibana logs show were successful {"type":"response","@timestamp":"2021-12-20T15:39:43+00:00","tags":["access:fleet-all"],"pid":1223,"method":"put","statusCode":200,"req":{"url":"/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign","method":"put","headers":{"host":"127.0.0.1:5601","user-agent":"Go-http-client/1.1","content-length":"55","content-type":"application/json","kbn-xsrf":"7.16.0","accept-encoding":"gzip"},"remoteAddress":"172.18.0.1","userAgent":"Go-http-client/1.1"},"res":{"statusCode":200,"responseTime":1871,"contentLength":2},"message":"PUT /api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign 200 1871ms - 2.0B"}
{"type":"response","@timestamp":"2021-12-20T15:39:45+00:00","tags":["access:fleet-read"],"pid":1223,"method":"get","statusCode":200,"req":{"url":"/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6","method":"get","headers":{"host":"127.0.0.1:5601","user-agent":"Go-http-client/1.1","content-type":"application/json","kbn-xsrf":"7.16.0","accept-encoding":"gzip"},"remoteAddress":"172.18.0.1","userAgent":"Go-http-client/1.1"},"res":{"statusCode":200,"responseTime":9,"contentLength":1282},"message":"GET /api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6 200 9ms - 1.3KB"} it appears that the corresponding agent update is missing there's nothing suggestive in the ES or fleet-server logs that would explain the drop in communication from kibana to the agent. it might be that running fleet-server at higher verbosity logging might catch something next time this issue crops up. |
@lykkin Can you check the First one should be with |
@blakerouse good idea! i was under the impression the ES instance these tests ran against was ephemeral. is there a way to get at the ES container it used? unfortunately, i have had no luck reproducing this locally. |
Any update there? |
Not relevant anymore hence closing. |
Hi Team,
it happens from time to time and I suppose we have already fixed a similar issue: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/913/pipeline/ (docker integration)
Logs and metrics are in the "Artifacts" tab.
If you need internal filebeat and metricbeat logs of Elastic Agent, let me know and I will tell you how to access them (stored in beats-ci-temp bucket).
It affects master branch, spotted for 7.14.0-SNAPSHOT
The text was updated successfully, but these errors were encountered: