Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network failures (?) cause *excessive* logging #2628

Closed
2 tasks
honzakral opened this issue May 8, 2023 · 5 comments · Fixed by elastic/elastic-agent-client#70
Closed
2 tasks

Network failures (?) cause *excessive* logging #2628

honzakral opened this issue May 8, 2023 · 5 comments · Fixed by elastic/elastic-agent-client#70
Assignees
Labels
8.9-candidate bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@honzakral
Copy link

honzakral commented May 8, 2023

Issue

I noticed I wasn't getting any data from my elastic-agent deployment in my k8s cluster (monitoring GKE, GCP infra, billing, logs from gcp pubsub), when I looked at the logs they were full of messages:

{"log.level":"error","@timestamp":"2023-05-08T19:45:28.241Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":350,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-08T19:45:28.273Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:47298->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":350,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

I manually restarted the deployment and it all went back to normal. Unfortunately I cannot provide more details because I had to act quick - the agent was spamming more than 17k (!!) messages like these into the log every second. I unfortunately don't have any other logs because of it.

For confirmed bugs, please report:

  • Version: Binary: 8.7.0 (build: fc4a15b8a56c2ac2d7b878a706937c60e93816f9 at 2023-03-28 04:02:11 +0000 UTC)
  • Operating System: Kubernetes
  • Steps to Reproduce: unknown

Definition of done

  • Root cause has been identified and fixed (what is the deadlock origin and how to avoid id?)
  • Tests are in place to ensure that it can't happen in the future.
@honzakral honzakral added the bug Something isn't working label May 8, 2023
@cmacknz
Copy link
Member

cmacknz commented May 8, 2023

This is the problem described in elastic/elastic-agent-client#66

We have seen this happen rarely but haven't isolated what causes it (besides knowing what causes the excessive logging).

@cmacknz
Copy link
Member

cmacknz commented May 8, 2023

The logs here indicate the Beat subprocesses are failing their checkin/health check back to the Elastic Agent parent process because of a timeout dialing the agent's gRPC server on port 6789.

@vitalyrychkov
Copy link

vitalyrychkov commented May 9, 2023

Upgraded Elasticsearch, Kibana and elastic-agent from 8.7.0 to 8.7.1 and got the same errors in the elastic-agent logs:

{"log.level":"error","@timestamp":"2023-05-09T10:27:56.141Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-09T10:27:56.142Z","message":"elastic-agent-client error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:37210->127.0.0.1:6789: i/o timeout\"","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":369,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}

Agent is also failing to send any message to Elasticsearch.
Reverting Agent back to the 8.7.0 image restores the connection.
Reproduced in two different Kubernetes clusters.

@cmacknz
Copy link
Member

cmacknz commented May 9, 2023

Restarting the agent fixed this in one of the other reports of this problem we've had.

@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team 8.9-candidate labels May 9, 2023
@cmacknz
Copy link
Member

cmacknz commented May 10, 2023

We have an internal support case for this problem now where we've been able to obtain a stack trace of the agent when this is happening. It looks like a deadlock in the agent so far, we are still trying to determine exactly where.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.9-candidate bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants