Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to disable Kubernetes Watchers and Leader Election for Fleet Managed Agents #5558

Closed
btrieger opened this issue Sep 18, 2024 · 25 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@btrieger
Copy link

For confirmed bugs, please report:

  • Version: 8.15.1
  • Operating System: Linux on Kuberentes
  • Steps to Reproduce:

Attempting to deploy elastic agent on kubernetes to run threat intel integration or other api integrations. I am receiving errors related to missing permissions in kubernetes. As I am not running the kubernetes integration or monitoring kubernetes itself I shouldn't need access to the kubernetes api server to watch nodes, namespaces, and pods. I also should not need to create a lease without the kubernetes integration.

I am attempting to run fleet managed agents on kubernetes. I have deployed the following yamls:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elastic-agent-k8s-test
  namespace: elastic
  labels:
    app: elastic-agent-k8s-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: elastic-agent-k8s-test
  template:
    metadata:
      labels:
        app: elastic-agent-k8s-test
    spec:
      serviceAccountName: elastic-agent
      hostNetwork: false
      dnsPolicy: ClusterFirst
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
        - name: elastic-agent
          image: docker.elastic.co/beats/elastic-agent:8.15.1
          args: ["-c", "/etc/elastic-agent/agent.yml"]
          env:
            # Set to 1 for enrollment into Fleet server. If not set, Elastic Agent is run in standalone mode
            - name: FLEET_ENROLL
              value: "1"
            # Set to true to communicate with Fleet with either insecure HTTP or unverified HTTPS
            - name: FLEET_INSECURE
              value: "false"
            # Fleet Server URL to enroll the Elastic Agent into
            # FLEET_URL can be found in Kibana, go to Management > Fleet > Settings
            - name: FLEET_URL
              value: "https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443"
            # Elasticsearch API key used to enroll Elastic Agents in Fleet (https://www.elastic.co/guide/en/fleet/current/fleet-enrollment-tokens.html#fleet-enrollment-tokens)
            # If FLEET_ENROLLMENT_TOKEN is empty then KIBANA_HOST, KIBANA_FLEET_USERNAME, KIBANA_FLEET_PASSWORD are needed
            - name: FLEET_ENROLLMENT_TOKEN
              value: "CHANGEME"
            - name: FLEET_SERVER_POLICY_ID
              value: "8c813cf6-a816-4722-be51-7341a192ba2e"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: STATE_PATH
              value: "/usr/share/elastic-agent/state"
            # The following ELASTIC_NETINFO:false variable will disable the netinfo.enabled option of add-host-metadata processor. This will remove fields host.ip and host.mac.
            # For more info: https://www.elastic.co/guide/en/beats/metricbeat/current/add-host-metadata.html
            - name: ELASTIC_NETINFO
              value: "false"
          securityContext:
            runAsUser: 1000
            runAsGroup: 1000
          resources:
            limits:
              memory: 700Mi
            requests:
              cpu: 100m
              memory: 400Mi
          volumeMounts:
            - name: agent-data
              mountPath: /usr/share/elastic-agent/state
            - name: datastreams
              mountPath: /etc/elastic-agent/agent.yml
              subPath: agent.yml
      volumes:
      - name: datastreams
        configMap:
          defaultMode: 0640
          name: agent-node-datastreams
  volumeClaimTemplates:
  - metadata:
      name: agent-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: elastic
  labels:
    k8s-app: elastic-agent
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.resources.node.enabled: false
    providers.kubernetes.resources.pod.enabled: false
    fleet.enabled: true
---

I have also tried with the configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.enabled: false
    fleet.enabled: true

and:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes:
      add_resource_metadata:
        node.enabled: false
        namespace.enabled: false
    fleet.enabled: true

The first time the pod starts it fails with the following error:

Policy selected for enrollment:  8c813cf6-a816-4722-be51-7341a192ba2e
{"log.level":"info","@timestamp":"2024-09-18T15:56:40.779Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-09-18T15:56:41.857Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":524},"message":"1st enrollment attempt failed, retrying enrolling to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/ with exponential backoff (init 1s, max 10s)","ecs.version":"1.6.0"}
Error: fail to enroll: failed to store agent config: could not save enrollment information: could not backup /etc/elastic-agent/agent.yml: rename /etc/elastic-agent/agent.yml /etc/elastic-agent/agent.yml.2024-09-18T15-56-41.857.bak: permission denied

After it restarts the pod runs without lease election but repeatedly throws the below errors:

{"log.level":"error","@timestamp":"2024-09-18T15:58:13.293Z","message":"W0918 15:58:13.293281      85 reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-18T15:58:13.293Z","message":"E0918 15:58:13.293334      85 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Node: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"ecs.version":"1.6.0"}

I would expect the pod to not need to throw an error and restart to disable leader election and to be able to disable these watchers.

@btrieger btrieger added the bug Something isn't working label Sep 18, 2024
@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Sep 19, 2024

so I was able to reproduce the issue of the configmap on my end as well.

  1. In your case @btrieger the first issue is that you run elastic-agent under 1000:1000 but the mount of configmap is under 0:0 and your mode for it is 0640, thus other groups (e.g. the one of elastic-agent) has no access at all to that (this is why you see permission denied error).
  2. Even if the mode was 0644, during fleet enrollment we first try to rotate the config file to a bak by renaming it. However, renaming a mountpoint is not possible and we would get an error of device or resource is busy
  3. Even if the above wasn't like that, the replaceWith invoked by ReplaceOnSuccessStore here is having a value of application.DefaultAgentFleetConfig thus I don't see how a custom supplied config merges with the fleet one, but maybe I am missing something?!

Thus to support this feature we should:

  1. cp any external to the agent config inside the state folder as we have permissions to write there and work our way from there
  2. If it doesn't exist already; fabricate a merge policy of statically defined config with the one that comes from fleet?!

I am gonna follow also about the list Nodes permissions 🙂

@btrieger
Copy link
Author

Ah my apologies on the 0640. Does fsGroup not mount it as 0:1000 instead of 0:0 so that I would have access to it? I appear to be able to read it since it is disabling the leader election after the 1st restart.

In any regards I figure the solutions is to find a way to add configs that can be passed down to an agent from fleet or merge the configmap with what fleet provides.

@cmacknz
Copy link
Member

cmacknz commented Sep 19, 2024

Even if the above wasn't like that, the replaceWith invoked by ReplaceOnSuccessStore here is having a value of application.DefaultAgentFleetConfig thus I don't see how a custom supplied config merges with the fleet one, but maybe I am missing something?!

#4166 a change was made that if the new config contains the content of the default fleet config we won't do the replacement by rotation to help. This is not really obvious at all unless you know this PR exists. The default fleet config only contains fleet.enabled: true so having that should have been enough to get past that. Quoting the PR:

Skipping replacing the current agent configuration with default fleet configuration upon enrollment, in case the current configuration already contains the configuration from the default fleet configuration.

@btrieger
Copy link
Author

I added a volume mount to set the /etc/elastic-agent folder to be owned by 0:1000 so that I would be able to write to it and I can confirm the device is busy is the result. I also udpated the mount for the configmap to be 0660 and confirmed elastic-agent is the group so it has read and write.

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Sep 19, 2024

oh I see @cmacknz it is the other way around from what I understood the diff to be! I validated that by adding fleet.enabled: true in my static config I get no rotation and thus I get no error. However all my testing is with elastic-agent:8.16.0-SNAPSHOT which does "extra things" in the agent-state path. @btrieger I can see that you already have fleet.enabled: true in your config so could you send me the exact error you are seeing and maybe in the parallel give 8.16.0-SNAPSHOT a go? 🙂

@btrieger
Copy link
Author

Yeah I can share the error. I read through the code and then update my config to be:

fleet:
  enabled: true

instead of

fleet.enabled: true

and that made it skip the replace. Both are valid yamls but would cause the diff to to not work.

@btrieger
Copy link
Author

Here is the error on 8.15.0:

Policy selected for enrollment:  8c813cf6-a816-4722-be51-7341a192ba2e
{"log.level":"info","@timestamp":"2024-09-19T17:20:28.581Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-09-19T17:20:29.591Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":524},"message":"1st enrollment attempt failed, retrying enrolling to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/ with exponential backoff (init 1s, max 10s)","ecs.version":"1.6.0"}
Error: fail to enroll: failed to store agent config: could not save enrollment information: could not backup /etc/elastic-agent/agent.yml: rename /etc/elastic-agent/agent.yml /etc/elastic-agent/agent.yml.2024-09-19T17-20-29.5911.bak: device or resource busy

and here is the configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    fleet.enabled: true
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.resources.node.enabled: false
    providers.kubernetes.resources.pod.enabled: false
---

When I did:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    fleet:
      enabled: true
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.resources.node.enabled: false
    providers.kubernetes.resources.pod.enabled: false
---

It did not throw the error

@btrieger
Copy link
Author

I can't currently test 8.16.0 as my Elastic Cloud cluster is 8.15.1 which is latest available.

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Sep 19, 2024

hmmm yep I think that by default gopkg.in/yaml.v3 (package used to calculate the diff) does not split keys by dots when unmarshalling YAML into a map[string]interface{}. So it makes sense

I can't currently test 8.16.0 as my Elastic Cloud cluster is 8.15.1 which is latest available.

No need to since the solution to this issue is the one you already figured out 🙂

@btrieger
Copy link
Author

I still unfortunately haven't been able to get the node watcher turned off either. Tried a bunch of different settings to disable it.

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Sep 19, 2024

@btrieger after looking at the code I found a really puzzling (to me) piece of validation code here which seems to cause the creation of watchers for pods and nodes always by default. Now I understand that this might be indeed the desired default behaviour but there should be a top level key enabled in the kubernetes provider that when the user specifies as false nothing gets enabled. @cmacknz thoughts?

Now @btrieger since you mentioned about nodes permissions only and not pod ones 😄 the following config will help get the nodes list error away, but we can't get away at the same with both that and the pods watcher

      providers:
        kubernetes_leaderelection:
          enabled: false
        kubernetes:
          resources:
            pod:
              enabled: true
            node:
              enabled: false

@btrieger
Copy link
Author

btrieger commented Sep 20, 2024 via email

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Sep 20, 2024

Ah so there is a bug where I can't disable both only one? It appeared it also needed get watch and list access for namespaces.

then let's try to go even more aggressive on disabling 🙂

      providers:
        kubernetes_leaderelection:
          enabled: false
        kubernetes:
          add_resource_metadata:
            node:
              enabled: false
            namespace:
              enabled: false
            deployment: false
            cronjob: false
          resources:
            pod:
              enabled: true
            node:
              enabled: false

@btrieger
Copy link
Author

btrieger commented Sep 20, 2024 via email

@pkoutsovasilis
Copy link
Contributor

I will try it again when I am back at my computer. So if I want to disable pods, namespaces, and nodes? Is that doable? Or is there a bug? Essentially want to just disable the kubernetes provider.

From what I am seeing currently no, it's not doable. But hey I missed the fleet.enabled: true above so 🤞 I missed something here as well!?!

However, I am thinking that even if we manage to disable them at the agent-level, the add_kubernetes_metadata processor is enabled by default in filebeat and metribeat when invoked by elastic-agent. And I am afraid that the same permission errors will surface from there. But I am getting ahead myself 🙂 But there is definitely something to discuss with the team here, ty for you understanding and patience

@btrieger
Copy link
Author

I got a chance to test it. Looks like still getting the errors but it could be the add_kubernetes_metadata processor not sure:

{"log.level":"error","@timestamp":"2024-09-20T13:12:37.123Z","message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/state/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock: connect: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:37.123Z","message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/state/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock: connect: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.014Z","message":"W0920 13:12:38.011880      53 reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.014Z","message":"E0920 13:12:38.011927      53 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Node: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.021Z","message":"W0920 13:12:38.021120      28 reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"httpjson-default","type":"httpjson"},"log":{"source":"httpjson-default"},"ecs.version":"1.6.0"}

@pkoutsovasilis
Copy link
Contributor

yep I think this is coming from metricbeat now ..."component":{"binary":"metricbeat"...

@cmacknz
Copy link
Member

cmacknz commented Sep 25, 2024

Yes that is the add_kubernetes_metadata processor which currently can't be turned off. This requires #4670.

@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 26, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Sep 26, 2024

A faster way around #4670 since it is complex would be to expose the parts we need to configure via an env var. Something similar was done with the add_cloud_metadata processor.

@strawgate
Copy link
Contributor

strawgate commented Sep 27, 2024

We have proposed that the customer set automountServiceAccountToken: false in the Kubernetes manifest. This appears to prevent the K8s metadata and providers from starting and is ideal when the customer does not want to monitor K8s with their agent pods (for example when they are running an S3/SQS workload).

See the last line in this partial snippet for the location of the addition:

---
# For more information https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-managed-by-fleet.html
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  namespace: kube-system
  labels:
    app: elastic-agent
spec:
  selector:
    matchLabels:
      app: elastic-agent
  template:
    metadata:
      labels:
        app: elastic-agent
    spec:
      # Tolerations are needed to run Elastic Agent on Kubernetes control-plane nodes.
      # Agents running on control-plane nodes collect metrics from the control plane components (scheduler, controller manager) of Kubernetes
      automountServiceAccountToken: false

@strawgate
Copy link
Contributor

The above proposed solution solved the customer's problem as removing the service token prevents the provider from starting in the first place

@ycombinator
Copy link
Contributor

@pkoutsovasilis Will this issue be completely resolved with #5912 and #5593 or are you expecting to do some more work?

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Nov 12, 2024

hey @ycombinator 👋 So the solution to disable everything related to the kubernetes providers is, as @strawgate provided here, to set automountServiceAccountToken: false. Now these PRs #5912, #5593, and #5939 are the necessary Helm chart knobs so a user can achieve disabling all the kubernetes providers and leader election. With that said, I believe that this issue is indeed satisfied 🙂

@ycombinator
Copy link
Contributor

Awesome, thanks @pkoutsovasilis! With those three PRs merged, I am now closing this issue. Should it arise again, we have an answer for customers via the Helm chart improvements you've made. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants