Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky: policy is never assigned to agent #144

Closed
mtojek opened this issue Oct 1, 2021 · 25 comments
Closed

Flaky: policy is never assigned to agent #144

mtojek opened this issue Oct 1, 2021 · 25 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@mtojek
Copy link
Contributor

mtojek commented Oct 1, 2021

Hi Team,

it happens from time to time and I suppose we have already fixed a similar issue: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/913/pipeline/ (docker integration)

Logs and metrics are in the "Artifacts" tab.

If you need internal filebeat and metricbeat logs of Elastic Agent, let me know and I will tell you how to access them (stored in beats-ci-temp bucket).

It affects master branch, spotted for 7.14.0-SNAPSHOT

@mtojek mtojek added the Team:Elastic-Agent Label for the Agent team label Oct 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/agent (Team:Agent)

@ruflin
Copy link
Contributor

ruflin commented Oct 11, 2021

@jlind23 It would be good to get some eyes on this from the team as in the past this indicated some bugs.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 11, 2021

@jlind23 jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 11, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@mtojek
Copy link
Contributor Author

mtojek commented Oct 11, 2021

... and another one: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/979/pipeline/

internal logs in beats-ci-temp-internal

@lykkin lykkin self-assigned this Oct 11, 2021
@mtojek
Copy link
Contributor Author

mtojek commented Oct 12, 2021

Another one today: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/PR-1905/2/pipeline

It seems to be correlated with long running tests (many data streams, different cases). Is there any possibility that there is any data rollover performed in the meantime? The fun fact is that we can reproduce with CI without any issues.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 18, 2021

@lykkin Do you have any update on this issue?

@lykkin
Copy link
Contributor

lykkin commented Oct 18, 2021

@mtojek i've spent some time adding some hooks to the kibana api endpoints and trying to get a reproduction locally, but the repro hasn't been fruitful sadly. do you have the steps used to reproduce this on the CI pipeline written up anywhere?

@mtojek
Copy link
Contributor Author

mtojek commented Oct 20, 2021

So it happened again yesterday: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/1039/pipeline

In this case it was the "network_traffic". All you have to do is to fetch the integrations repo and:

cd packages/network_traffic # enter the network traffic directory
elastic-package build # build the package
elastic-package stack up -v -d # boot up the local Elastic stack
elastic-package test system -v # execute all system tests, will take some time

To reproduce it on the exact Beats CI host, I need to boot a GCP instance.

@lykkin
Copy link
Contributor

lykkin commented Oct 21, 2021

@mtojek thanks, i'll take a crack at this tomorrow to see if i can get to the bottom of it

@lykkin
Copy link
Contributor

lykkin commented Oct 28, 2021

hey @mtojek, just an update: i have the jenkins environment set up locally and i'm running the tests, but can't seem to get the failures to pop up. i have been running them in the background for a few days with no failures. i'll try it on a GCP instance next.

@ruflin
Copy link
Contributor

ruflin commented Oct 29, 2021

@lykkin Anything useful you got out of the logs from our jenkins servers? One suspicion I always have is it might be related to restricted resources. Potentially your local setup has too many resources available. So if you spin up something in GCP, pick a small box.

Please also ping me when need so we can hop on a call to brainstorm together and go through some logs.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 29, 2021

Thanks for your engagement, @lykkin!

Do you think that you can add more debug/info messages to Fleet's, Agent's of Fleet Server's source code, so we can narrow it down in the CI?

@lykkin
Copy link
Contributor

lykkin commented Nov 1, 2021

from what i can tell it accurately updates the policy through kibana, then fleet server seems to ignore the update. what @ruflin said sounds plausible, as it might stall out the fleet server from picking up the policy change. adding more logs sounds like a great idea, and i'll rope @ruflin in when i circle back around to trying to repro this.

@mtojek
Copy link
Contributor Author

mtojek commented Dec 14, 2021

Hi @lykkin and the Team,

the issue is striking back again: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/1317/pipeline

Let me know if you need more reference data.

@ruflin ruflin added the bug Something isn't working label Dec 14, 2021
@ruflin
Copy link
Contributor

ruflin commented Dec 14, 2021

Based on the Jenkins logs this seems to be 7.16.0-SNAPSHOT. I was hoping a few fixes we did for 7.16 which are related to heavy load on ES or fleet-server overloading could fix this. There is one fix that only landed in 7.16.1 and based on the container that is started not such which version was tested. At the same time, the error we are seeing is not a connection error to Elasticsearch.

With 7.16 quite a few improvements went in around logging and we now have also diagnostics in place. @lykkin Can you dig into the logs if you see something obvious? Ideally we would even run diagnostics on the Elastic Agent that struggles to see the exact state but this will be trickier as it is inside the container.

@kpollich You have recently looked into issues around duplicated fleet-server inputs so I wonder if this could be in any related maybe? Have nothing found to proof this but was wondering if this might ring a bell on your end.

@jlind23 We should get the ground of this one.

@kpollich
Copy link
Member

@kpollich You have recently looked into issues around duplicated fleet-server inputs so I wonder if this could be in any related maybe? Have nothing found to proof this but was wondering if this might ring a bell on your end.

Yes this part in particular sounds very much like the duplicate Fleet Server input issue:

from what i can tell it accurately updates the policy through kibana, then fleet server seems to ignore the update

When we upgraded Fleet Server policies from older versions (prior to the Fleet Server package having a policy_template defined) we'd incur the duplicated input. Would be fixed in 7.16.1 by elastic/kibana#119925.

@ruflin
Copy link
Contributor

ruflin commented Dec 14, 2021

If it is the duplication issue, I assume we should see some errors in the fleet-server logs.

@mtojek
Copy link
Contributor Author

mtojek commented Dec 14, 2021

Hi folks, thanks for investigating the problem. Just wanted to remind you that I spotted the issue today, so I assume the CI used the latest Docker images.

@lykkin
Copy link
Contributor

lykkin commented Dec 20, 2021

heya, sorry about the lapse in comms, picking this back up for some more investigation. will update when i find something interesting.

@lykkin
Copy link
Contributor

lykkin commented Dec 21, 2021

digging into the logs doesn't have anything conclusive. the fleet-server logs are pretty sparse, and don't track any changes around the time the reassign request happens.

tracing the script logs, we see the reassign request being made around 15:29:43

[2021-12-20T15:39:43.337Z] 2021/12/20 15:39:43 DEBUG reassigning original policy back to agent...
[2021-12-20T15:39:43.337Z] 2021/12/20 15:39:43 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign
[2021-12-20T15:39:45.246Z] 2021/12/20 15:39:45 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6

which the kibana logs show were successful

{"type":"response","@timestamp":"2021-12-20T15:39:43+00:00","tags":["access:fleet-all"],"pid":1223,"method":"put","statusCode":200,"req":{"url":"/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign","method":"put","headers":{"host":"127.0.0.1:5601","user-agent":"Go-http-client/1.1","content-length":"55","content-type":"application/json","kbn-xsrf":"7.16.0","accept-encoding":"gzip"},"remoteAddress":"172.18.0.1","userAgent":"Go-http-client/1.1"},"res":{"statusCode":200,"responseTime":1871,"contentLength":2},"message":"PUT /api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign 200 1871ms - 2.0B"}
{"type":"response","@timestamp":"2021-12-20T15:39:45+00:00","tags":["access:fleet-read"],"pid":1223,"method":"get","statusCode":200,"req":{"url":"/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6","method":"get","headers":{"host":"127.0.0.1:5601","user-agent":"Go-http-client/1.1","content-type":"application/json","kbn-xsrf":"7.16.0","accept-encoding":"gzip"},"remoteAddress":"172.18.0.1","userAgent":"Go-http-client/1.1"},"res":{"statusCode":200,"responseTime":9,"contentLength":1282},"message":"GET /api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6 200 9ms - 1.3KB"}

it appears that the corresponding agent update is missing

there's nothing suggestive in the ES or fleet-server logs that would explain the drop in communication from kibana to the agent. it might be that running fleet-server at higher verbosity logging might catch something next time this issue crops up.

@blakerouse
Copy link
Contributor

@lykkin Can you check the .fleet-policies index to see if the policy was placed into the index? You should see the policy with the input twice in the index.

First one should be with revision_idx: ${revision_num} and coordinator_idx: 0 (that one was inserted by Kibana), then you should see a follow up inserted by the Fleet Server Coordinator with revision_idx: ${revision_num} and coordinator_idx: 1.

@lykkin
Copy link
Contributor

lykkin commented Dec 22, 2021

@blakerouse good idea! i was under the impression the ES instance these tests ran against was ephemeral. is there a way to get at the ES container it used?

unfortunately, i have had no luck reproducing this locally.

@jlind23
Copy link
Contributor

jlind23 commented Jan 18, 2022

Any update there?

@jlind23 jlind23 transferred this issue from elastic/beats Mar 7, 2022
@jlind23
Copy link
Contributor

jlind23 commented May 14, 2024

Not relevant anymore hence closing.
cc @ycombinator

@jlind23 jlind23 closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants