Skip to content
This repository has been archived by the owner on Sep 17, 2024. It is now read-only.

k8s-autodiscover for elastic-agent fails #1992

Closed
mdelapenya opened this issue Jan 11, 2022 · 12 comments · Fixed by #2141
Closed

k8s-autodiscover for elastic-agent fails #1992

mdelapenya opened this issue Jan 11, 2022 · 12 comments · Fixed by #2141
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@mdelapenya
Copy link
Contributor

mdelapenya commented Jan 11, 2022

Steps to reproduce

Run this command from the root dir of the test framework:

TAGS="elastic-agent" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true ELASTIC_APM_ACTIVE=false PROVIDER=docker make -C e2e/_suites/kubernetes-autodiscover functional-test

Expected behaviour: the elastic-agent container is up and running
Current behaviour: the elastic-agent container cannot be found (see logs)

TRAC[2022-01-14T08:17:01+01:00] Binary is present                             binary=kubectl path=/usr/local/bin/kubectl
TRAC[2022-01-14T08:17:01+01:00] Executing command                             args="[--kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-1862958236/kubeconfig --namespace test-3e9e36c8-6ad3-4af8-bea9-81af9fe9afa9 cp --no-preserve test-3e9e36c8-6ad3-4af8-bea9-81af9fe9afa9/elastic-agent-j78bv:/tmp/beats-events-20220114.ndjson /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-686180914/events]" command=kubectl env="map[]"
ERRO[2022-01-14T08:17:01+01:00] Error executing command                       args="[--kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-1862958236/kubeconfig --namespace test-3e9e36c8-6ad3-4af8-bea9-81af9fe9afa9 cp --no-preserve test-3e9e36c8-6ad3-4af8-bea9-81af9fe9afa9/elastic-agent-j78bv:/tmp/beats-events-20220114.ndjson /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-686180914/events]" baseDir=. command=kubectl env="map[]" error="exit status 1" stderr="error: unable to upgrade connection: container not found (\"elastic-agent\")\n"
DEBU[2022-01-14T08:17:01+01:00] Failed to copy events from test-3e9e36c8-6ad3-4af8-bea9-81af9fe9afa9/elastic-agent-j78bv:/tmp/beats-events to /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-686180914/events:

CI log

[2022-01-11T06:16:59.877Z] TRAC[2022-01-11T06:16:59Z] Binary is present                             binary=kubectl path=/usr/local/bin/kubectl
[2022-01-11T06:16:59.877Z] TRAC[2022-01-11T06:16:59Z] Executing command                             args="[--kubeconfig /tmp/test-3897598673/kubeconfig --namespace test-5601550b-715d-441b-a10c-90cb80e2634d cp --no-preserve test-5601550b-715d-441b-a10c-90cb80e2634d/elastic-agent-b4hxq:/tmp/beats-events-20220111.ndjson /tmp/test-1437408640/events]" command=kubectl env="map[]"
[2022-01-11T06:16:59.878Z] ERRO[2022-01-11T06:16:59Z] Error executing command                       args="[--kubeconfig /tmp/test-3897598673/kubeconfig --namespace test-5601550b-715d-441b-a10c-90cb80e2634d cp --no-preserve test-5601550b-715d-441b-a10c-90cb80e2634d/elastic-agent-b4hxq:/tmp/beats-events-20220111.ndjson /tmp/test-1437408640/events]" baseDir=. command=kubectl env="map[]" error="exit status 1" stderr="error: unable to upgrade connection: container not found (\"elastic-agent\")\n"
[2022-01-11T06:16:59.878Z] DEBU[2022-01-11T06:16:59Z] Failed to copy events from test-5601550b-715d-441b-a10c-90cb80e2634d/elastic-agent-b4hxq:/tmp/beats-events to /tmp/test-1437408640/events:

First error build: https://beats-ci.elastic.co/job/e2e-tests/job/e2e-testing-k8s-autodiscovery-daily-mbp/job/main/10/ (6 days ago)
Last successful build: https://beats-ci.elastic.co/job/e2e-tests/job/e2e-testing-k8s-autodiscovery-daily-mbp/job/main/9/ (6 days ago)

It seems the elastic-agent container is not there

@mdelapenya mdelapenya added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jan 11, 2022
@ChrsMark
Copy link
Member

@mdelapenya do we have a way to getting logs from the failing Pod? Without this it's almost impossible to know why the Pod is failing, it can be that Agent is not able to get enrolled, a panic or whatever. Only way to troubleshoot this is by trying to reproduce it locally by running the suite which is quite time consuming.

@jsoriano jsoriano removed their assignment Jan 11, 2022
@mdelapenya
Copy link
Contributor Author

@ChrsMark we are going to work on enabling back the SSH access to the machines in #1997

@ChrsMark
Copy link
Member

Thank you for the update @mdelapenya , that will help a lot with the debugging efforts.

@mdelapenya
Copy link
Contributor Author

BTW it's still possible to reproduce this locally, I'm updating the steps to reproduce in the description

@ChrsMark
Copy link
Member

Running the suite locally with TAGS="elastic-agent" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true ELASTIC_APM_ACTIVE=false PROVIDER=docker make -C e2e/_suites/kubernetes-autodiscover functional-test

I see the following in the failing Agent's Pod:

Requesting service_token from Kibana.
Error: request to get security token from Kibana failed: fail to execute the HTTP POST request: Post "http://kibana:5601/api/fleet/service-tokens": dial tcp 0.0.0.0:5601: connect: connection refused
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.1/fleet-troubleshooting.html

This may be related to elastic/beats#29811 or its related issues.

@mdelapenya
Copy link
Contributor Author

Let me try to run the tests after latest changes

@mdelapenya
Copy link
Contributor Author

mdelapenya commented Feb 10, 2022

Test logs

TRAC[2022-02-10T16:37:56+01:00] Executing command                             args="[--kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-2896719577/kubeconfig --namespace test-44d546a2-f7d9-4626-8252-38b38e6634ec cp --no-preserve test-44d546a2-f7d9-4626-8252-38b38e6634ec/elastic-agent-db97r:/tmp/beats-events-20220210.ndjson /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-4100491230/events]" command=kubectl env="map[]"
ERRO[2022-02-10T16:37:56+01:00] Error executing command                       args="[--kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-2896719577/kubeconfig --namespace test-44d546a2-f7d9-4626-8252-38b38e6634ec cp --no-preserve test-44d546a2-f7d9-4626-8252-38b38e6634ec/elastic-agent-db97r:/tmp/beats-events-20220210.ndjson /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-4100491230/events]" baseDir=. command=kubectl env="map[]" error="exit status 1" stderr="error: unable to upgrade connection: container not found (\"elastic-agent\")\n"
DEBU[2022-02-10T16:37:56+01:00] Failed to copy events from test-44d546a2-f7d9-4626-8252-38b38e6634ec/elastic-agent-db97r:/tmp/beats-events to /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-4100491230/events: 

State of the elastic-agent pod:

$ k --kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-2896719577/kubeconfig --namespace test-44d546a2-f7d9-4626-8252-38b38e6634ec get pods -l k8s-app=elastic-agent 
NAME                  READY   STATUS             RESTARTS   AGE
elastic-agent-db97r   0/1     CrashLoopBackOff   5          6m2s

Pod logs

$ k --kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-2896719577/kubeconfig --namespace test-44d546a2-f7d9-4626-8252-38b38e6634ec logs -l k8s-app=elastic-agent 
github.com/elastic/beats/v7/libbeat/common/transport.DialerFunc.Dial(0x100000000000060, {0x55d22f37783d, 0x118}, {0xc000148cc0, 0x7fcc9f6ae108})
        github.com/elastic/beats/v7/libbeat/common/transport/transport.go:38 +0x33
net/http.(*Transport).customDialTLS(0xc00098fce0, {0x55d22fd126b8, 0xc00098fce0}, {0x55d22f37783d, 0x9d543741dbf85c55}, {0xc000148cc0, 0xfe31e01143f4f1b9})
        net/http/transport.go:1316 +0x6b
net/http.(*Transport).dialConn(0xc0003d3b80, {0x55d22fd126b8, 0xc00098fce0}, {{}, 0x0, {0xc00017a780, 0x5}, {0xc000148cc0, 0x1c}, 0x0})
        net/http/transport.go:1580 +0x3ff
net/http.(*Transport).dialConnFor(0xd, 0xc0006e2e70)
        net/http/transport.go:1446 +0xb0
created by net/http.(*Transport).queueForDial
        net/http/transport.go:1415 +0x3d7

@ChrsMark
Copy link
Member

Thank you @mdelapenya , it looks like a panic when Agent tries to open a connection to ES? Do you have more extensive output of the error so as to see the full stacktrace?

@mdelapenya
Copy link
Contributor Author

Unfortunately no. That is the entire output of the elastic-agent pod.

Pods

$ k --kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-2896719577/kubeconfig --namespace test-44d546a2-f7d9-4626-8252-38b38e6634ec get pods
NAME                             READY   STATUS    RESTARTS   AGE
a-pod                            1/1     Running   0          21m
elastic-agent-db97r              0/1     Error     9          22m
elasticsearch-575cb49b69-hbglx   1/1     Running   0          22m

This situation is easy to reproduce:

# sync code
git pull upstream main
# run tests
TAGS="elastic-agent" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true ELASTIC_APM_ACTIVE=false PROVIDER=docker make -C e2e/_suites/kubernetes-autodiscover functional-test

when you see there are a lot of retries, hit Ctrl+C to abort the execution and start accessing the kind cluster and the pods, simply reading the test logs to find the kubeconfig file name, the namespace, pod name, etc

@ChrsMark
Copy link
Member

This is weird cause now I'm only seeing the following:

(23) Failed writing body
/etc/writer.sh: 12: jq: not found
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13884  100 13884    0     0  1129k      0 --:--:-- --:--:-- --:--:-- 1129k
curl: (23) Failed writing body (0 != 13884)

It seems that the new images do not have this package installed and at the same time when I exec I cannot install it either with apt-get. We need to find an alternative here.

@ChrsMark
Copy link
Member

ChrsMark commented Feb 11, 2022

Solving the jq issue it worked for me locally:


5 scenarios (5 passed)
17 steps (17 passed)
7m5.476183756s
testing: warning: no tests to run
PASS
ok  	github.com/elastic/e2e-testing/e2e/_suites/kubernetes-autodiscover	425.797s

Pushed a fix #2141, let's see if that solves the issue.

@ChrsMark
Copy link
Member

The fix went green @mdelapenya .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants