Flaky: policy is never assigned to agent #144

mtojek · 2021-10-01T08:52:06Z

Hi Team,

it happens from time to time and I suppose we have already fixed a similar issue: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/913/pipeline/ (docker integration)

Logs and metrics are in the "Artifacts" tab.

If you need internal filebeat and metricbeat logs of Elastic Agent, let me know and I will tell you how to access them (stored in beats-ci-temp bucket).

It affects master branch, spotted for 7.14.0-SNAPSHOT

elasticmachine · 2021-10-01T08:52:07Z

Pinging @elastic/agent (Team:Agent)

ruflin · 2021-10-11T08:42:36Z

@jlind23 It would be good to get some eyes on this from the team as in the past this indicated some bugs.

mtojek · 2021-10-11T10:42:35Z

Most recent bug: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/969/pipeline

elasticmachine · 2021-10-11T13:14:22Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

mtojek · 2021-10-11T15:23:12Z

... and another one: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/979/pipeline/

internal logs in beats-ci-temp-internal

mtojek · 2021-10-12T09:36:07Z

Another one today: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/PR-1905/2/pipeline

It seems to be correlated with long running tests (many data streams, different cases). Is there any possibility that there is any data rollover performed in the meantime? The fun fact is that we can reproduce with CI without any issues.

mtojek · 2021-10-18T14:42:22Z

@lykkin Do you have any update on this issue?

lykkin · 2021-10-18T20:00:35Z

@mtojek i've spent some time adding some hooks to the kibana api endpoints and trying to get a reproduction locally, but the repro hasn't been fruitful sadly. do you have the steps used to reproduce this on the CI pipeline written up anywhere?

mtojek · 2021-10-20T06:22:48Z

So it happened again yesterday: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/1039/pipeline

In this case it was the "network_traffic". All you have to do is to fetch the integrations repo and:

cd packages/network_traffic # enter the network traffic directory
elastic-package build # build the package
elastic-package stack up -v -d # boot up the local Elastic stack
elastic-package test system -v # execute all system tests, will take some time

To reproduce it on the exact Beats CI host, I need to boot a GCP instance.

lykkin · 2021-10-21T01:55:46Z

@mtojek thanks, i'll take a crack at this tomorrow to see if i can get to the bottom of it

lykkin · 2021-10-28T18:12:11Z

hey @mtojek, just an update: i have the jenkins environment set up locally and i'm running the tests, but can't seem to get the failures to pop up. i have been running them in the background for a few days with no failures. i'll try it on a GCP instance next.

ruflin · 2021-10-29T09:11:57Z

@lykkin Anything useful you got out of the logs from our jenkins servers? One suspicion I always have is it might be related to restricted resources. Potentially your local setup has too many resources available. So if you spin up something in GCP, pick a small box.

Please also ping me when need so we can hop on a call to brainstorm together and go through some logs.

mtojek · 2021-10-29T09:19:13Z

Thanks for your engagement, @lykkin!

Do you think that you can add more debug/info messages to Fleet's, Agent's of Fleet Server's source code, so we can narrow it down in the CI?

lykkin · 2021-11-01T20:02:04Z

from what i can tell it accurately updates the policy through kibana, then fleet server seems to ignore the update. what @ruflin said sounds plausible, as it might stall out the fleet server from picking up the policy change. adding more logs sounds like a great idea, and i'll rope @ruflin in when i circle back around to trying to repro this.

mtojek · 2021-12-14T09:58:38Z

Hi @lykkin and the Team,

the issue is striking back again: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/1317/pipeline

Let me know if you need more reference data.

ruflin · 2021-12-14T15:25:29Z

Based on the Jenkins logs this seems to be 7.16.0-SNAPSHOT. I was hoping a few fixes we did for 7.16 which are related to heavy load on ES or fleet-server overloading could fix this. There is one fix that only landed in 7.16.1 and based on the container that is started not such which version was tested. At the same time, the error we are seeing is not a connection error to Elasticsearch.

With 7.16 quite a few improvements went in around logging and we now have also diagnostics in place. @lykkin Can you dig into the logs if you see something obvious? Ideally we would even run diagnostics on the Elastic Agent that struggles to see the exact state but this will be trickier as it is inside the container.

@kpollich You have recently looked into issues around duplicated fleet-server inputs so I wonder if this could be in any related maybe? Have nothing found to proof this but was wondering if this might ring a bell on your end.

@jlind23 We should get the ground of this one.

kpollich · 2021-12-14T15:34:26Z

@kpollich You have recently looked into issues around duplicated fleet-server inputs so I wonder if this could be in any related maybe? Have nothing found to proof this but was wondering if this might ring a bell on your end.

Yes this part in particular sounds very much like the duplicate Fleet Server input issue:

from what i can tell it accurately updates the policy through kibana, then fleet server seems to ignore the update

When we upgraded Fleet Server policies from older versions (prior to the Fleet Server package having a policy_template defined) we'd incur the duplicated input. Would be fixed in 7.16.1 by elastic/kibana#119925.

ruflin · 2021-12-14T15:49:08Z

If it is the duplication issue, I assume we should see some errors in the fleet-server logs.

mtojek · 2021-12-14T16:05:17Z

Hi folks, thanks for investigating the problem. Just wanted to remind you that I spotted the issue today, so I assume the CI used the latest Docker images.

lykkin · 2021-12-20T17:30:44Z

heya, sorry about the lapse in comms, picking this back up for some more investigation. will update when i find something interesting.

lykkin · 2021-12-21T05:54:43Z

digging into the logs doesn't have anything conclusive. the fleet-server logs are pretty sparse, and don't track any changes around the time the reassign request happens.

tracing the script logs, we see the reassign request being made around 15:29:43

[2021-12-20T15:39:43.337Z] 2021/12/20 15:39:43 DEBUG reassigning original policy back to agent...
[2021-12-20T15:39:43.337Z] 2021/12/20 15:39:43 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign
[2021-12-20T15:39:45.246Z] 2021/12/20 15:39:45 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6

which the kibana logs show were successful

{"type":"response","@timestamp":"2021-12-20T15:39:43+00:00","tags":["access:fleet-all"],"pid":1223,"method":"put","statusCode":200,"req":{"url":"/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign","method":"put","headers":{"host":"127.0.0.1:5601","user-agent":"Go-http-client/1.1","content-length":"55","content-type":"application/json","kbn-xsrf":"7.16.0","accept-encoding":"gzip"},"remoteAddress":"172.18.0.1","userAgent":"Go-http-client/1.1"},"res":{"statusCode":200,"responseTime":1871,"contentLength":2},"message":"PUT /api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6/reassign 200 1871ms - 2.0B"}
{"type":"response","@timestamp":"2021-12-20T15:39:45+00:00","tags":["access:fleet-read"],"pid":1223,"method":"get","statusCode":200,"req":{"url":"/api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6","method":"get","headers":{"host":"127.0.0.1:5601","user-agent":"Go-http-client/1.1","content-type":"application/json","kbn-xsrf":"7.16.0","accept-encoding":"gzip"},"remoteAddress":"172.18.0.1","userAgent":"Go-http-client/1.1"},"res":{"statusCode":200,"responseTime":9,"contentLength":1282},"message":"GET /api/fleet/agents/d8e9da95-a991-4daa-9231-94a4f11507a6 200 9ms - 1.3KB"}

it appears that the corresponding agent update is missing

there's nothing suggestive in the ES or fleet-server logs that would explain the drop in communication from kibana to the agent. it might be that running fleet-server at higher verbosity logging might catch something next time this issue crops up.

blakerouse · 2021-12-21T15:14:07Z

@lykkin Can you check the .fleet-policies index to see if the policy was placed into the index? You should see the policy with the input twice in the index.

First one should be with revision_idx: ${revision_num} and coordinator_idx: 0 (that one was inserted by Kibana), then you should see a follow up inserted by the Fleet Server Coordinator with revision_idx: ${revision_num} and coordinator_idx: 1.

lykkin · 2021-12-22T02:30:53Z

@blakerouse good idea! i was under the impression the ES instance these tests ran against was ephemeral. is there a way to get at the ES container it used?

unfortunately, i have had no luck reproducing this locally.

jlind23 · 2022-01-18T16:41:42Z

Any update there?

jlind23 · 2024-05-14T07:07:10Z

Not relevant anymore hence closing.
cc @ycombinator

mtojek added the Team:Elastic-Agent Label for the Agent team label Oct 1, 2021

jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 11, 2021

lykkin self-assigned this Oct 11, 2021

ruflin added the bug Something isn't working label Dec 14, 2021

jlind23 transferred this issue from elastic/beats Mar 7, 2022

ph mentioned this issue Mar 9, 2022

Flaky: policy is never assigned to agent #172

Closed

mtojek mentioned this issue Apr 28, 2022

Testing integrations against unsupported stacks elastic/integrations#3208

Closed

jlind23 closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky: policy is never assigned to agent #144

Flaky: policy is never assigned to agent #144

mtojek commented Oct 1, 2021 •

edited

Loading

elasticmachine commented Oct 1, 2021

ruflin commented Oct 11, 2021

mtojek commented Oct 11, 2021

elasticmachine commented Oct 11, 2021

mtojek commented Oct 11, 2021

mtojek commented Oct 12, 2021 •

edited

Loading

mtojek commented Oct 18, 2021

lykkin commented Oct 18, 2021

mtojek commented Oct 20, 2021 •

edited

Loading

lykkin commented Oct 21, 2021

lykkin commented Oct 28, 2021

ruflin commented Oct 29, 2021

mtojek commented Oct 29, 2021

lykkin commented Nov 1, 2021

mtojek commented Dec 14, 2021

ruflin commented Dec 14, 2021

kpollich commented Dec 14, 2021

ruflin commented Dec 14, 2021

mtojek commented Dec 14, 2021 •

edited

Loading

lykkin commented Dec 20, 2021

lykkin commented Dec 21, 2021

blakerouse commented Dec 21, 2021

lykkin commented Dec 22, 2021

jlind23 commented Jan 18, 2022

jlind23 commented May 14, 2024

Flaky: policy is never assigned to agent #144

Flaky: policy is never assigned to agent #144

Comments

mtojek commented Oct 1, 2021 • edited Loading

elasticmachine commented Oct 1, 2021

ruflin commented Oct 11, 2021

mtojek commented Oct 11, 2021

elasticmachine commented Oct 11, 2021

mtojek commented Oct 11, 2021

mtojek commented Oct 12, 2021 • edited Loading

mtojek commented Oct 18, 2021

lykkin commented Oct 18, 2021

mtojek commented Oct 20, 2021 • edited Loading

lykkin commented Oct 21, 2021

lykkin commented Oct 28, 2021

ruflin commented Oct 29, 2021

mtojek commented Oct 29, 2021

lykkin commented Nov 1, 2021

mtojek commented Dec 14, 2021

ruflin commented Dec 14, 2021

kpollich commented Dec 14, 2021

ruflin commented Dec 14, 2021

mtojek commented Dec 14, 2021 • edited Loading

lykkin commented Dec 20, 2021

lykkin commented Dec 21, 2021

blakerouse commented Dec 21, 2021

lykkin commented Dec 22, 2021

jlind23 commented Jan 18, 2022

jlind23 commented May 14, 2024

mtojek commented Oct 1, 2021 •

edited

Loading

mtojek commented Oct 12, 2021 •

edited

Loading

mtojek commented Oct 20, 2021 •

edited

Loading

mtojek commented Dec 14, 2021 •

edited

Loading