Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes) #1909

Closed
mac-abdon opened this issue Feb 18, 2022 · 30 comments

Comments

@mac-abdon
Copy link

Describe the bug

We upgraded from falco:0.28.1 to falco:0.31.0 due to this bug in large k8s environments and we seem to have hit a new runtime error. We're now seeing:

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
* Running falco-driver-loader with: driver=module, compile=yes, download=yes
* Unloading falco module, if present
* Looking for a falco module locally (kernel 5.4.149-73.259.amzn2.x86_64)
* Trying to download a prebuilt falco module from https://download.falco.org/driver/319368f1ad778691164d33d59945e00c5752cd27/falco_amazonlinux2_5.4.149-73.259.amzn2.x86_64_1.ko
* Download succeeded
* Success: falco module found and inserted
Rules match ignored syscall: warning (ignored-evttype):
         loaded rules match the following events: access,brk,close,cpu_hotplug,drop,epoll_wait,eventfd,fcntl,fstat,fstat64,futex,getcwd,getdents,getdents64,getegid,geteuid,getgid,getpeername,getresgid,getresuid,getrlimit,getsockname,getsockopt,getuid,infra,k8s,llseek,lseek,lstat,lstat64,mesos,mmap,mmap2,mprotect,munmap,nanosleep,notification,page_fault,poll,ppoll,pread,preadv,procinfo,pwrite,pwritev,read,readv,recv,recvmmsg,select,semctl,semget,semop,send,sendfile,sendmmsg,setrlimit,shutdown,signaldeliver,splice,stat,stat64,switch,sysdigevent,timerfd_create,write,writev;
         but these events are not returned unless running falco with -A
2022-02-17T22:44:13+0000: Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

We downgraded to falco:0.30.0 which does not have the runtime error.

How to reproduce it

Upgrade to falco:0.31.0 and scale your Kubernetes cluster to around 400 nodes.

Expected behaviour

No runtime error

Screenshots

Environment

  • Falco version: 0.31.0
  • System info:
  • Cloud provider or hardware configuration: EKS v1.21.2 / ec2 instance size - r5dn.4xlarge
  • OS: Amazon Linux 2
  • Kernel: 5.4.149-73.259.amzn2.x86_64
  • Installation method: Kubernetes

Additional context

@Diliz
Copy link

Diliz commented May 4, 2022

I actually got the same issue even with less nodes (25/30):
Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

Environment:

Falco version: 0.31.1
Openshift: 4.8.35

@vnandha
Copy link

vnandha commented May 12, 2022

I am seeing similar issue,

2022-05-12T01:16:36+0000: Runtime error: SSL Socket handler (k8s_namespace_handler_state): Connection closed.. Exiting.
  • Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
  • Running falco-driver-loader with: driver=bpf, compile=yes, download=yes

@jasondellaluce
Copy link
Contributor

Did you try running this with Falco's --k8s-node option?

@Diliz
Copy link

Diliz commented Jun 14, 2022

Did you try running this with Falco's --k8s-node option?

Yep, already tried with and without --k8s-node option, usually the falco service is crashing on the first event fetching, so it launch, wait 1 minute, then crash

EDIT: Still happening in 0.32.0

@jimbobby5
Copy link

I too am experiencing this.

@Diliz
Copy link

Diliz commented Jul 22, 2022

Hello there! This was due to the falco operator which does not set the node value correctly, the --k8s-node option is set, but the nodes were not fetched correctly by the operator...

EDIT: I switched to the official helm chart since this message was posted (can be found here: https://github.com/falcosecurity/charts )

@IanRobertson-wpe
Copy link
Contributor

Just landed here after seeing this in my environment as well, with 0.32.1.

@epcim
Copy link

epcim commented Jul 26, 2022

Are there any instructions how to debug?

version 0.32.1, custom crafted manifests

  • OK, on physical hw, custom k8s deployment
  • FAILS on Google cloud with BPF enabled as below
  "system_info": {
    "machine": "x86_64",
    "nodename": "falco-7xnlh",
    "release": "5.4.170+",
    "sysname": "Linux",
    "version": "#1 SMP Sat Apr 2 10:06:05 PDT 2022"
  },
  "version": "0.32.1"
❯ k logs -n monitoring falco-x58wb -f -c falco --tail 300 -p
* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* Skipping compilation, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* eBPF probe located in /root/.falco/falco_cos_5.4.170+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Mon Jul 25 15:49:17 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.
    spec:
      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token
        - -k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk
        env:
        - name: FALCO_BPF_PROBE
          value: ""
        - name: FALCO_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName

When changed to

        - -k
        - http://$(KUBERNETES_SERVICE_HOST):$(KUBERNETES_SERVICE_PORT)

Then it works without problem. I do have custom image. Obviously there is some issue with /var/run/secrets/kubernetes.io/serviceaccount/token.

  -K, --k8s-api-cert (<bt_file> | <cert_file>:<key_file[#password]>[:<ca_cert_file>])
                                Use the provided files names to authenticate user and (optionally) verify the K8S API server identity. Each 
                                entry must specify full (absolute, or relative to the current directory) path to the respective file. 
                                Private key password is optional (needed only if key is password protected). CA certificate is optional. 
                                For all files, only PEM file format is supported. Specifying CA certificate only is obsoleted - when single 
                                entry is provided for this option, it will be interpreted as the name of a file containing bearer token. 
                                Note that the format of this command-line option prohibits use of files whose names contain ':' or '#' 
                                characters in the file name.

Is there any way how to get more debug info from the K8s auth?

@jasondellaluce
Copy link
Contributor

Hi @epcim, since Falco 0.32.1 you can have more debug info by adding the following args to Falco:

-o libs_logger.enabled=true -o libs_logger.severity=trace

@epcim
Copy link

epcim commented Jul 26, 2022

@jasondellaluce ok, so we know it has k8s connectivity and it reads some resources

FYI:

  • I realised, that with 0.32.1 the falco pod limits needs to be increased from 512MB to 1Gi (due OOM Kill)
  • The cluster I run the Falco has 25 nodes, 850 pods, but (some nodes has as only as 30 pods). So not small.
  • It reads and add [libs]: K8s [ADDED, Service the Kinds, Pod, Namespace, Service, ReplicaSet
  • Fails on daemonsets ?

Output from container:

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* Skipping compilation, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* eBPF probe located in /root/.falco/falco_cos_5.4.170+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Tue Jul 26 10:52:14 2022: [libs]: starting live capture
Tue Jul 26 10:52:15 2022: [libs]: cri: CRI runtime: containerd 1.4.8
Tue Jul 26 10:52:15 2022: [libs]: docker_async: Creating docker async source
Tue Jul 26 10:52:15 2022: [libs]: docker_async (6a20bc616b5a): No existing container info
Tue Jul 26 10:52:15 2022: [libs]: docker_async (6a20bc616b5a): Looking up info for container via socket /var/run/docker.sock
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): Fetching url
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): http_code=200
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): returning RESP_OK
...
...
...
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-786f859c79, c17e3f24-d389-4ddc-8c24-7531f9cb2682]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-7c46688b79, ca481853-3f88-4179-a90c-96f84256f5cb]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-854cc8b9cf, 9d75ddc1-faff-41c0-925e-0f0fbd1d9919]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-87d55bd54, 73149029-2d60-460d-af34-b5d7b6e47a37]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-f78486b56, 0d185513-64ef-40fe-baca-ef13ab283536]
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data(), checking connection to https://10.127.0.1
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data(), connected to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) check_enabled() enabling socket in collector
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data() [https://10.127.0.1], requesting data from /apis/apps/v1/daemonsets?pretty=false... m_blocking_socket=1, m_watching=0
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) sending request to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) socket=169, m_ssl_connection=61592496
Tue Jul 26 10:42:49 2022: [libs]: GET /apis/apps/v1/daemonsets?pretty=false HTTP/1.1
User-Agent: falcosecurity-libs
Host: 10.127.0.1:443
Accept: */*
Authorization: Bearer eyJhbGciOiJSUzI1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) Retrieving all data in blocking mode ...
Tue Jul 26 10:42:49 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.
Tue Jul 26 10:42:49 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.
Tue Jul 26 10:42:49 2022: [libs]: docker_async: Source destructor
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_deployment_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/deployments?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_replicaset_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/replicasets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_service_handler_state) closing connection to https://10.127.0.1/api/v1/services?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_replicationcontroller_handler_state) closing connection to https://10.127.0.1/api/v1/replicationcontrollers?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_pod_handler_state) closing connection to https://10.127.0.1/api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded,spec.nodeName=gke-gc01-int-ves-io-gc01-int-ves-io-p-cf25a10e-w3qi&pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_namespace_handler_state) closing connection to https://10.127.0.1/api/v1/namespaces?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_node_handler_state) closing connection to https://10.127.0.1/api/v1/nodes?pretty=false

shall not be due rights, but for clarity, the ClusterRole:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: {{ default "falco" .falco_app_name }}-read
  namespace: {{ if .falco_namespace }}{{ .falco_namespace }}{{ else }}monitoring{{ end }}
  labels:
    app: falco
    component: falco
    role: security
rules:
  - apiGroups:
      - extensions
      - ""
    resources:
      - nodes
      - namespaces
      - pods
      - replicationcontrollers
      - replicasets
      - services
      - daemonsets
      - deployments
      - events
      - configmaps
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - apps
    resources:
      - daemonsets
      - deployments
      - replicasets
      - statefulsets
    verbs:
      - get
      - list
      - watch
  - nonResourceURLs:
      - /healthz
      - /healthz/*
    verbs:
      - get


Here is some order of things that happens

❯ k logs -n monitoring $POD -c falco -p | egrep '(ready: |Error )'
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_node_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_node_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_namespace_handler_state) dependency (k8s_node_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_namespace_handler_state) dependency (k8s_node_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_pod_handler_state) dependency (k8s_namespace_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_pod_handler_state) dependency (k8s_namespace_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_replicationcontroller_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_replicationcontroller_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_service_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:12 2022: [libs]: k8s_handler (k8s_service_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:12 2022: [libs]: k8s_handler (k8s_replicaset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:48 2022: [libs]: k8s_handler (k8s_replicaset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:49 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.

@epcim
Copy link

epcim commented Jul 26, 2022

Well the query to k8s to daemonsets works, so there must be some issue with processing in falco IMO.

Steps to reproduce

k exec -ti -n monitoring falco-zzfmk -c falco -- sh

TOKEN="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"; 
CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
curl -s -H "Authorization: Bearer $TOKEN" --cacert $CACERT  https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/apis/apps/v1/daemonsets?pretty=false | head


{"kind":"DaemonSetList","apiVersion":"apps/v1","metadata":{"resourceVersion":"427824185"},"items":[{"metadata":{"name":"gke-metadata-server","namespace":"kube-system","uid":"452584fa-614e-40bd-8477-3b0781ce9dfc","resourceVersion":"417631995","generation":10,"creationTimestamp":"2020-12-16T10:33:18Z","labels":{"addonmanager.kubernetes.io/mode":"Reconcile","k8s-app":"gke-metadata-server"},
...
...

btw, I claimed it works like http://$(KUBERNETES_SERVICE_HOST):$(KUBERNETES_SERVICE_PORT) but that was not true, as it must be always https. The only difference is that falco was not crashing when http:// was used. (which I would call bug or not understand fully - as k8s annotation must fail at all, but nothing is written to logs on "info" level)

KUBERNETES_SERVICE_HOST=10.127.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443

as it basically freeze here on http query

<docker collector works here..>
...
...
Tue Jul 26 11:57:45 2022: [libs]: k8s_handler (k8s_api_handler_state)::collect_data() [http://10.127.0.1:443], requesting data from /api?pretty=false... m_blocking_socket=1, m_watching=0
Tue Jul 26 11:57:45 2022: [libs]: k8s_handler (k8s_api_handler_state) sending request to http://10.127.0.1:443/api?pretty=false
Tue Jul 26 11:57:45 2022: [libs]: Socket handler (k8s_api_handler_state) socket=153, m_ssl_connection=0
Tue Jul 26 11:57:45 2022: [libs]: GET /api?pretty=false HTTP/1.1
User-Agent: falcosecurity-libs
Host: 10.127.0.1:443
Accept: */*

Tue Jul 26 11:57:45 2022: [libs]: Socket handler (k8s_api_handler_state) Retrieving all data in blocking mode ...

additionally, I though that with this in place, falco will only query metadata for it's own node..

        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)

but insteady reading everything everywhere, ie: /apis/apps/v1/daemonsets?pretty=false

@jefimm
Copy link

jefimm commented Aug 13, 2022

I was able to solve this issue by cleaning up the old replicasets (had about 5k of these)

@jjettenCamunda
Copy link

We have the same issue with the replicasets failing, we currently have >6k replicasets. But I am not sure deleting them is practical for us. Installing different versions of Falco I have narrowed it down to the following:

Works: Falco 0.32.0 Helm chart version 1.19.4
Fails: Falco 0.32.1 Helm chart version 2.0.0

@ranjithmr
Copy link

ranjithmr commented Aug 24, 2022

We have the same issue with 0.32.1
Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting

@yyvess
Copy link

yyvess commented Aug 26, 2022

Same issue with 0.32.2 on a Azure cluster with 20 nodes

@epcim
Copy link

epcim commented Sep 6, 2022

either, have the same experience with 5.9.2022 :latest and 0.32.2 - GCP cluster, 25nodes.

The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!

I am using --k8s-node filter option, but I suspect falco does not reflect that when reading these replicasets.. see

# parsing these lines.. [libs]: K8s [ADDED, ReplicaSet, ...........
❯ k logs -n monitoring falco-bk4t5 -c falco -p |grep ReplicaSet | wc -l
2071

Some other pods managed to read only 506 RS and then fails.

The only error it throws and not recover is:

Mon Sep  5 12:57:15 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.
Mon Sep  5 12:57:15 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.

Again the setup:

    spec:
      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token
        - -k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk
        - -o
        - libs_logger.enabled=true
        - -o
        - libs_logger.severity=info
        env:
        - name: FALCO_BPF_PROBE
        - name: FALCO_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - ```

@jasondellaluce could this have some attention? As this is real blocker. Even it works now on older , even small, clusters this will break any falco deployemnt.

@mac-abdon could you please remove (400+ nodes from title and mention it somewhere else).
 

@jasondellaluce
Copy link
Contributor

@jasondellaluce could this have some attention? As this is real blocker. Even it works now on older , even small, clusters this will break any falco deployemnt.

@epcim this is high priority in the project's roadmap. We're still in the process of figuring out what's the optimal way for mitigating this.

@IanRobertson-wpe
Copy link
Contributor

I discussed this issue on the Falco Community Call today, so I'm sharing some of the information from that call for others who may be impacted.

As a workaround, you can consider removing the "-k " command-line option. I was under the impression that this option was used to grab all the (non-audit) k8s.* metadata, but this is not the case. With or without this switch, Falco will pull a subset of information from the local kubelet API (perhaps based on the uppercase -K switch, but I'm unsure). Without the lowercase "-k" switch, Falco will not be able to retrieve some metadata that is only available from the cluster API, which I believe to be the following field types (from https://falco.org/docs/rules/supported-fields/):
k8s.rc.*
k8s.svc.*
k8s.rs.*
k8s.deployment.*

Check your rules to determine whether you are using any of these, and if not, you can probably remove that switch as a workaround and get yourself back up and running until this is fixed.

@leogr
Copy link
Member

leogr commented Sep 8, 2022

Hey @epcim

The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!

What was the exact status of those ReplicaSet (eg, availableReplicas, fullyLabeledReplicas, readyReplicas, replicas`, etc...)?

I guess the metadata of those ReplicaSets were not useful for Falco, so I'm trying to discover if we can use a fieldSelector in the query to filter out unuseful resources.

Btw,

I am using --k8s-node filter option, but I suspect falco does not reflect that when reading these replicasets.. see

The --k8s-node filter option works only for Pods, since other resources are not bound to a node. So it can't help for ReplicaSets or DaemonSets.

@EigoOda
Copy link

EigoOda commented Sep 8, 2022

As a workaround, you can consider removing the "-k " command-line option.

This workaround worked for us. Thanks!

falco: v0.32.2
Kubernetes(EKS): v1.21.14

@leogr
Copy link
Member

leogr commented Sep 8, 2022

Hey @IanRobertson-wpe

just a note
-K (uppercase) is used only if -k (lowercase) is present. So you can remove both -k and -K.

Btw, as a clarification: the Kubelet metadata is annotated on the container labels, Falco will directly fetch the metadata from the container runtime, and no connection to the Kubelet is needed.

@poiana
Copy link
Contributor

poiana commented Dec 22, 2022

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@poiana
Copy link
Contributor

poiana commented Jan 21, 2023

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

@poiana
Copy link
Contributor

poiana commented Feb 20, 2023

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

@poiana
Copy link
Contributor

poiana commented Feb 20, 2023

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@VF-mbrauer
Copy link

Has that issue been solved?
I am still getting with huge, loaded nodes the error:

Defaulted container "falco" out of: falco, falcoctl-artifact-follow, falco-driver-loader (init), falcoctl-artifact-install (init)
Thu Mar 23 20:09:35 2023: Falco version: 0.34.1 (x86_64)
Thu Mar 23 20:09:35 2023: Falco initialized with configuration file: /etc/falco/falco.yaml
Thu Mar 23 20:09:35 2023: Loading rules from file /etc/falco/falco_rules.yaml
Thu Mar 23 20:09:36 2023: Loading rules from file /etc/falco/rules.d/falco-custom.yaml
Thu Mar 23 20:09:36 2023: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Thu Mar 23 20:09:36 2023: Starting health webserver with threadiness 16, listening on port 8765
Thu Mar 23 20:09:36 2023: Enabled event sources: syscall
Thu Mar 23 20:09:36 2023: Opening capture with Kernel module
k8s_handler (k8s_replicaset_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
k8s_handler (k8s_deployment_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_deployment_handler_state): Connection closed.
k8s_handler (k8s_replicaset_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, K8s k8s_handler::receive_response(): invalid call (request not sent).
k8s_handler (k8s_deployment_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, K8s k8s_handler::receive_response(): invalid call (request not sent).

@jasondellaluce , @leogr : FYI.

@jasondellaluce
Copy link
Contributor

@VF-mbrauer I see. Does it also cause Falco to terminate?
cc @alacuku

@alacuku
Copy link
Member

alacuku commented Mar 24, 2023

@VF-mbrauer, does it recover at some point? or keeps erroring? At startup time all the falco instances connect to the api-server and they may be throttled by api-server that's why you are seeing that error.

Anyway we are working on a new k8s-client for falco that should solve the problem we have with the current implementation, please see: #2973

@VF-mbrauer
Copy link

@jasondellaluce, yes it will go first run into an OOM and then Restart, after some time it gets stalled into "CrashLoopBackOff"

falco-7sgbh                     2/2     Running            0               88m
falco-fd9kz                     1/2     CrashLoopBackOff   8 (19s ago)     43m
falco-hwv6s                     1/2     CrashLoopBackOff   8 (88s ago)     56m
falco-jj5vj                     1/2     CrashLoopBackOff   9 (2m16s ago)   49m
falco-nj6mn                     1/2     CrashLoopBackOff   6 (2m17s ago)   53m
falco-q4247                     2/2     Running            0               88m
falco-q6hwl                     1/2     CrashLoopBackOff   8 (4s ago)      52m
falco-qmgmh                     1/2     CrashLoopBackOff   10 (50s ago)    57m
falco-s4v9n                     1/2     CrashLoopBackOff   8 (3m42s ago)   54m
falco-shn6m                     2/2     Running            6 (3m51s ago)   45m
falco-tbs94                     1/2     CrashLoopBackOff   8 (75s ago)     47m
falco-vvd49                     1/2     CrashLoopBackOff   7 (4m11s ago)   51m
falco-w5tc4                     1/2     CrashLoopBackOff   4 (25s ago)     34m

So to workaround that I increased the Memory a bit.

@s7an-it
Copy link

s7an-it commented Apr 2, 2023

There is some disconnection between code and config - for me metadata_download mb setting does nothing and reading the code makes sense it doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests