-
Notifications
You must be signed in to change notification settings - Fork 42
Request 400 error during enrollment #2096
Comments
Adding the logs of the elastic-agent right after the install command failed because of a 400 error during enrollment
|
@mdelapenya @adam-stokes could you also provide metricbeat logs as i can found this:
|
I've ssh'ed into the stack machine and checked fleet-server logs:
where the only error I can check is:
|
@jlind23 here you have the metricbeat log in the agent VM (CentOS 8):
I can see this:
which points to metricbeat not receiving the right elasticsearch host configuration. But I've have expected to see all data sent through the fleet-server instead. Am I right? |
One thing that I see browsing Fleet's UI is that fleet-server is offline: Even though the log of fleet-server says agent enrolled (see #2096 (comment)):
the fleet-server finally fails with a 400 error:
I'm going to bootstrap the fleet-server manually |
@adam-stokes said that it is failing one time out of three. Is there any thing different used on your end? |
I've just verified that the fleet-server container needs the FLEET_SERVER_POLICY_ID env var to be set with the default policy for fleet-server. I manually was able to see the fleet-server as enrolled after manually running:
I've submitted this commit to pass the policy to the fleet-server: be2e305 |
BTW, the variable is a little bit hidden into the agent command: https://github.com/elastic/beats/blob/237937085a5a7337ba06f1268cfc55cd4b869e31/x-pack/elastic-agent/pkg/agent/cmd/container.go#L98 I miss having docs about the env vars. I had to translate the inline help command in this screen to the above env vars: |
With this commit (be2e305) I'm seeing fleet-server online! |
Oh no! Fleet-server moved to offline after a while! Logs
|
@jlind23 I'm concerned that the checking API is not responsive after a while:
Can anybody from the team help us here? |
@michel-laterman would you be able to chime in? |
I'd say that the all or almost all of the current errors in #2064 are caused by this 400 error:
In which scenarios could fleet-server get to the offline status? |
can we see the |
I've just added your public SSH key from Github to the VMs, so you are able now to SSH into the machines. I'll paste here the info:
The build URL for this set of machines is https://beats-ci.elastic.co/blue/organizations/jenkins/e2e-tests%2Fe2e-testing-mbp/detail/PR-2064/43/pipeline/563/ Accessing the stack VMThis VM contains a docker compose starting elasticsearch, kibana and fleet-server.
# ssh into the stack machine
ssh -i ~/.ssh/YOUR_SSH_PRIVATE_KEY -vvvv [email protected]
# change to use root user
sudo su -
# check running containers
root@ip-172-31-38-15:~# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
46fd81d5f9c5 docker.elastic.co/beats/elastic-agent:8.1.0-aa69d697-SNAPSHOT "/usr/bin/tini -- /u…" 35 minutes ago Up 35 minutes 0.0.0.0:8220->8220/tcp, :::8220->8220/tcp fleet_fleet-server_1
ea95dba0d881 docker.elastic.co/kibana/kibana:8.1.0-aa69d697-SNAPSHOT "/bin/tini -- /usr/l…" 36 minutes ago Up 36 minutes (healthy) 0.0.0.0:5601->5601/tcp, :::5601->5601/tcp fleet_kibana_1
6f105bf7a08e docker.elastic.co/elasticsearch/elasticsearch:8.1.0-aa69d697-SNAPSHOT "/bin/tini -- /usr/l…" 37 minutes ago Up 36 minutes (healthy) 0.0.0.0:9200->9200/tcp, :::9200->9200/tcp, 9300/tcp fleet_elasticsearch_1 Accessing any of the nodes
# ssh into the Debian machine
ssh -i ~/.ssh/YOUR_SSH_PRIVATE_KEY -vvvv [email protected]
# change to use root user
sudo su - Fleet Server logshttp://3.15.172.147:5601/app/fleet/agents/f75095a8-11ff-4e58-a06a-2e2527b8f005/logs fleet.yml for Fleet ServerSSH into the stack machine, as described above. ssh -i ~/.ssh/YOUR_SSH_PRIVATE_KEY -vvvv [email protected] Enter the fleet-server container: docker exec -ti fleet_fleet-server_1 bash See fleet.yml contents: cat /usr/share/elastic-agent/state/fleet.yml
agent:
id: f75095a8-11ff-4e58-a06a-2e2527b8f005
monitoring.http:
enabled: false
host: ""
port: 6791
fleet:
enabled: true
access_api_key: LVV4NjAzNEJSSTdIc0RZaEtzaEo6NEs3WXdhSWFUUi1Zc2FKdnRGd05Jdw==
protocol: http
host: 0.0.0.0:8220
ssl:
verification_mode: none
renegotiation: never
timeout: 10m0s
proxy_disable: true
reporting:
threshold: 10000
check_frequency_sec: 30
agent:
id: ""
server:
policy:
id: 499b5aa7-d214-5b5d-838b-3cd76469844e
output:
elasticsearch:
protocol: http
hosts:
- elasticsearch:9200
service_token: 8/LoAxYzOEJQcTJYZFJWLW5TWHRoVG1WWUJBAAAAAAAAAAAA
proxy_disable: false
proxy_headers: {}
host: 0.0.0.0
port: 8220
internal_port: 8221 |
I've looked in the fleet-server logs for the instance and I can immediatly see that the service_token is being rejected (as expired):
|
Through the dev console
For some reason the service_token that fleet-server uses expired (or was marked as expired) |
That makes sense, Michel. As the scenarios are slow (simulating software flows usually is) it could be the case that the default 1200 expire timeout is reached. Is there a way to create the tokens with longer expire time? |
@adam-stokes found: https://www.elastic.co/guide/en/elasticsearch/reference/current/service-accounts.html which states:
And so far my searching of the fleet-server repo does not show any way it can delete a token. |
https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-get-token.html And it's possible to set the max timeout up to 60 mins: https://www.elastic.co/guide/en/elasticsearch/reference/current/security-settings.html#token-service-settings We have pushed a commit to #2064 (fc29242)
On the other hand, where did you find these logs? I was not able to get any occurrence in the Logs UI or in the container log. |
@michel-laterman thanks for your super powers for finding the expiration issue, we were able to move forward and we merged #2064: it fixed the 400 errors, and fixed almost all the tests. There are only 1-2 failures that need our investigation. Will close this issue as solved. Thanks again! |
Thanks @michel-laterman and @mdelapenya I've seen issues on master but I think they should be resolved with a new 8.2.0 snapshot build. |
First 2 successful enrollment occurs, however, once a third and remaining enrollment happens, a client side 400 error is thrown:
The text was updated successfully, but these errors were encountered: