Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.17.0 fleet-server looping error #5323

Closed
samimb opened this issue Feb 3, 2022 · 5 comments
Closed

7.17.0 fleet-server looping error #5323

samimb opened this issue Feb 3, 2022 · 5 comments
Labels
>bug Something isn't working

Comments

@samimb
Copy link

samimb commented Feb 3, 2022

Bug Report

What did you do?

Starting a new fleet-managed instance as described in the documentation
https://www.elastic.co/guide/en/cloud-on-k8s/1.9/k8s-elastic-agent-fleet-quickstart.html

This points to the YAML

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server-quickstart
  namespace: default
spec:
  version: 7.17.0
  kibanaRef:
    name: kibana-quickstart
  elasticsearchRefs:
  - name: elasticsearch-quickstart
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent-quickstart
  namespace: default
spec:
  version: 7.17.0
  kibanaRef:
    name: kibana-quickstart
  fleetServerRef:
    name: fleet-server-quickstart
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana-quickstart
  namespace: default
spec:
  version: 7.17.0
  count: 1
  elasticsearchRef:
    name: elasticsearch-quickstart
  config:
    xpack.fleet.agents.elasticsearch.host: "https://elasticsearch-quickstart-es-http.default.svc:9200"
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-quickstart-agent-http.default.svc:8220"]
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-quickstart
  namespace: default
spec:
  version: 7.17.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
- kind: ServiceAccount
  name: elastic-agent
  namespace: default
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io

What did you expect to see?

Running instances of Elasticsearch, Kibana, fleet-agent-server and fleet-agents.

What did you see instead? Under which circumstances?

Running instances of Elasticsearch and Kibana.
Fleet-agent-server crashes at boot and fleet-agent does not start due to server not started.

fleet-server-agent-6bfb5cc49d-6gfft   0/1     Error               0          2s
elastic-agent-agent-22jkv             0/1     Error               0          2s
fleet-server-agent-74cf959954-99wxt   0/1     Error               0          2s
elastic-agent-agent-z6dqw             0/1     Error               1          2s
fleet-server-agent-6bfb5cc49d-6gfft   0/1     Error               1          4s
elastic-agent-agent-22jkv             0/1     Error               1          3s
fleet-server-agent-74cf959954-99wxt   0/1     Error               1          3s
elastic-agent-agent-z6dqw             0/1     CrashLoopBackOff    1          3s
fleet-server-agent-6bfb5cc49d-6gfft   0/1     CrashLoopBackOff    1          5s
fleet-server-agent-74cf959954-99wxt   0/1     CrashLoopBackOff    1          4s
elastic-agent-agent-22jkv             0/1     CrashLoopBackOff    1          5s
fleet-server-agent-74cf959954-99wxt   0/1     Error               2          15s
elastic-agent-agent-z6dqw             0/1     Error               2          16s
fleet-server-agent-6bfb5cc49d-6gfft   0/1     Error               2          17s
elastic-agent-agent-22jkv             0/1     Error               2          16s
elastic-agent-agent-w7w2b             0/1     Error               0          23s
elastic-agent-agent-w7w2b             0/1     Error               1          24s
elastic-agent-agent-w7w2b             0/1     CrashLoopBackOff    1          25s
elastic-agent-agent-qfcwb             0/1     Error               0          27s
elastic-agent-agent-qfcwb             0/1     Error               1          28s

Fleet-agent-server logs seems to point no non-existent directories for CA

$ kubectl  logs -n eck fleet-server-agent-6bfb5cc49d-6gfft 
cp: cannot create regular file '/etc/pki/ca-trust/source/anchors/': No such file or directory

Environment

Test environment of k3s nodes

 get nodes     
NAME         STATUS   ROLES                  AGE    VERSION
k3s-node-1   Ready    worker                 323d   v1.20.4+k3s1
k3s-master   Ready    control-plane,master   323d   v1.20.4+k3s1
k3s-node-2   Ready    worker                 323d   v1.20.4+k3s1
k3s-node-3   Ready    worker                 323d   v1.20.4+k3s1
  • ECK version:

    1.9.1

@botelastic botelastic bot added the triage label Feb 3, 2022
@thbkrkr thbkrkr added the >bug Something isn't working label Feb 3, 2022
@botelastic botelastic bot removed the triage label Feb 3, 2022
@thbkrkr
Copy link
Contributor

thbkrkr commented Feb 3, 2022

This is a fairly recent bug (#5250) that also occurs with version 8.x and is fixed for the upcoming 2.1 release (#5268).

A workaround is to override the entrypoint of the agent containers to fix how the CA store is updated for the entire container (/etc/pki/ca-trust/source/anchors/ became /usr/local/share/ca-certificates and update-ca-trust became update-ca-certificates with the base image change from Ubuntu to Centos introduced from version 7.17).

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
...
spec:
  ...
  deployment:
    ...
    podTemplate:
      spec:
        ...
        containers:
        - name: agent
          command:
          - bash
          - -c 
          - |
            #!/usr/bin/env bash
            set -e
            if [[ -f /mnt/elastic-internal/elasticsearch-association/<agent-ns>/<es-name>/certs/ca.crt ]]; then
              cp /mnt/elastic-internal/elasticsearch-association/<agent-ns>/<es-name>/certs/ca.crt /usr/local/share/ca-certificates
              update-ca-certificates
            fi
            /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e
Full YAML
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server-quickstart
  namespace: default
spec:
  version: 7.17.0
  kibanaRef:
    name: kibana-quickstart
  elasticsearchRefs:
  - name: elasticsearch-quickstart
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
        containers:
        - name: agent
          command:
          - bash
          - -c 
          - |
            #!/usr/bin/env bash
            set -e
            if [[ -f /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt ]]; then
              cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt /usr/local/share/ca-certificates
              update-ca-certificates
            fi
            /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent-quickstart
  namespace: default
spec:
  version: 7.17.0
  kibanaRef:
    name: kibana-quickstart
  fleetServerRef:
    name: fleet-server-quickstart
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
        containers:
        - name: agent
          command:
          - bash
          - -c 
          - |
            #!/usr/bin/env bash
            set -e
            if [[ -f /mnt/elastic-internal/elasticsearch-association/agent-ns/elasticsearch/certs/ca.crt ]]; then
              cp /mnt/elastic-internal/elasticsearch-association/agent-ns/elasticsearch/certs/ca.crt /etc/pki/ca-trust/source/anchors/
              update-ca-trust
            fi
            /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e          
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana-quickstart
  namespace: default
spec:
  version: 7.17.0
  count: 1
  elasticsearchRef:
    name: elasticsearch-quickstart
  config:
    xpack.fleet.agents.elasticsearch.host: "https://elasticsearch-quickstart-es-http.default.svc:9200"
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-quickstart-agent-http.default.svc:8220"]
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-quickstart
  namespace: default
spec:
  version: 7.17.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
- kind: ServiceAccount
  name: elastic-agent
  namespace: default
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io

I'm sorry for the inconvenience.

@thbkrkr thbkrkr closed this as completed Feb 3, 2022
@samimb
Copy link
Author

samimb commented Feb 3, 2022

This is a fairly recent bug (#5250) which is fixed for the upcoming 2.1 release (#5268).

A workaround is to override the entrypoint of the agent containers to copy the CA file in the right directory.

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
...
spec:
  ...
  deployment:
    ...
    podTemplate:
      spec:
        ...
        containers:
        - name: agent
          command:
          - bash
          - -c 
          - |
            #!/usr/bin/env bash
            set -e
            if [[ -f /mnt/elastic-internal/elasticsearch-association/<agent-ns>/<es-name>/certs/ca.crt ]]; then
              cp /mnt/elastic-internal/elasticsearch-association/<agent-ns>/<es-name>/certs/ca.crt /usr/local/share/ca-certificates
              update-ca-certificates
            fi
            /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e

Full YAML
I'm sorry for the inconvenience.

Does this conditional not only check if the CA is there? Seem in my image, the directory that holds the actual CA is not present. It is easy enough to take what you provided and do a mkdir in there as well, just wondering if it really is the same problem?

Also, thank you for the response :)

@thbkrkr
Copy link
Contributor

thbkrkr commented Feb 3, 2022

Does this conditional not only check if the CA is there?

No, if a ca.crt is present (it is optional, e.g. the ES cert is signed by a well-known certificate authority), it is copied to /usr/local/share/ca-certificates and update-ca-certificates is executed.

Starting 7.17.0, the base image changed from CentOS to Ubuntu and then the path /etc/pki/ca-trust/source/anchors/ used by the operator is no longer correct, it is now /usr/local/share/ca-certificates and the command to use to update the CA store for the entire container is update-ca-certificates instead of update-ca-trust. So the fix doesn't consist to create the missing directory but to copy the CA in the right directory and to run the right command.

@markniemeijer
Copy link

Hi, we upgraded the operator to 2.0.0 but we have exactly the same error. Seems bug is not fixed in 2.0.0?

@thbkrkr
Copy link
Contributor

thbkrkr commented Mar 7, 2022

Hi, we upgraded the operator to 2.0.0 but we have exactly the same error. Seems bug is not fixed in 2.0.0?

Yes you are right. This is unfortunately a known issue in ECK 2.0, https://www.elastic.co/guide/en/cloud-on-k8s/master/release-highlights-2.0.0.html#k8s-200-known-issues. The fix did not make it into the release by mistake. You still need to use the workaround.

I've updated the comment above to say it is fixed for the upcoming 2.1 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants