Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes) #1909

mac-abdon · 2022-02-18T17:12:41Z

Describe the bug

We upgraded from falco:0.28.1 to falco:0.31.0 due to this bug in large k8s environments and we seem to have hit a new runtime error. We're now seeing:

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
* Running falco-driver-loader with: driver=module, compile=yes, download=yes
* Unloading falco module, if present
* Looking for a falco module locally (kernel 5.4.149-73.259.amzn2.x86_64)
* Trying to download a prebuilt falco module from https://download.falco.org/driver/319368f1ad778691164d33d59945e00c5752cd27/falco_amazonlinux2_5.4.149-73.259.amzn2.x86_64_1.ko
* Download succeeded
* Success: falco module found and inserted
Rules match ignored syscall: warning (ignored-evttype):
         loaded rules match the following events: access,brk,close,cpu_hotplug,drop,epoll_wait,eventfd,fcntl,fstat,fstat64,futex,getcwd,getdents,getdents64,getegid,geteuid,getgid,getpeername,getresgid,getresuid,getrlimit,getsockname,getsockopt,getuid,infra,k8s,llseek,lseek,lstat,lstat64,mesos,mmap,mmap2,mprotect,munmap,nanosleep,notification,page_fault,poll,ppoll,pread,preadv,procinfo,pwrite,pwritev,read,readv,recv,recvmmsg,select,semctl,semget,semop,send,sendfile,sendmmsg,setrlimit,shutdown,signaldeliver,splice,stat,stat64,switch,sysdigevent,timerfd_create,write,writev;
         but these events are not returned unless running falco with -A
2022-02-17T22:44:13+0000: Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

We downgraded to falco:0.30.0 which does not have the runtime error.

How to reproduce it

Upgrade to falco:0.31.0 and scale your Kubernetes cluster to around 400 nodes.

Expected behaviour

No runtime error

Screenshots

Environment

Falco version: 0.31.0

System info:

Cloud provider or hardware configuration: EKS v1.21.2 / ec2 instance size - r5dn.4xlarge
OS: Amazon Linux 2

Kernel: 5.4.149-73.259.amzn2.x86_64

Installation method: Kubernetes

Additional context

The text was updated successfully, but these errors were encountered:

Diliz · 2022-05-04T15:02:50Z

I actually got the same issue even with less nodes (25/30):
Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

Environment:

Falco version: 0.31.1
Openshift: 4.8.35

vnandha · 2022-05-12T01:18:57Z

I am seeing similar issue,

2022-05-12T01:16:36+0000: Runtime error: SSL Socket handler (k8s_namespace_handler_state): Connection closed.. Exiting.

Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
Running falco-driver-loader with: driver=bpf, compile=yes, download=yes

jasondellaluce · 2022-06-06T13:05:26Z

Did you try running this with Falco's --k8s-node option?

Diliz · 2022-06-14T07:56:50Z

Did you try running this with Falco's --k8s-node option?

Yep, already tried with and without --k8s-node option, usually the falco service is crashing on the first event fetching, so it launch, wait 1 minute, then crash

EDIT: Still happening in 0.32.0

jimbobby5 · 2022-07-22T15:01:56Z

I too am experiencing this.

Diliz · 2022-07-22T15:19:20Z

Hello there! This was due to the falco operator which does not set the node value correctly, the --k8s-node option is set, but the nodes were not fetched correctly by the operator...

EDIT: I switched to the official helm chart since this message was posted (can be found here: https://github.com/falcosecurity/charts )

IanRobertson-wpe · 2022-07-22T19:49:15Z

Just landed here after seeing this in my environment as well, with 0.32.1.

epcim · 2022-07-26T08:12:31Z

Are there any instructions how to debug?

version 0.32.1, custom crafted manifests

OK, on physical hw, custom k8s deployment
FAILS on Google cloud with BPF enabled as below

  "system_info": {
    "machine": "x86_64",
    "nodename": "falco-7xnlh",
    "release": "5.4.170+",
    "sysname": "Linux",
    "version": "#1 SMP Sat Apr 2 10:06:05 PDT 2022"
  },
  "version": "0.32.1"

❯ k logs -n monitoring falco-x58wb -f -c falco --tail 300 -p
* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* Skipping compilation, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* eBPF probe located in /root/.falco/falco_cos_5.4.170+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Mon Jul 25 15:49:17 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.

    spec:
      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token
        - -k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk
        env:
        - name: FALCO_BPF_PROBE
          value: ""
        - name: FALCO_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName

When changed to

        - -k
        - http://$(KUBERNETES_SERVICE_HOST):$(KUBERNETES_SERVICE_PORT)

Then it works without problem. I do have custom image. Obviously there is some issue with /var/run/secrets/kubernetes.io/serviceaccount/token.

  -K, --k8s-api-cert (<bt_file> | <cert_file>:<key_file[#password]>[:<ca_cert_file>])
                                Use the provided files names to authenticate user and (optionally) verify the K8S API server identity. Each 
                                entry must specify full (absolute, or relative to the current directory) path to the respective file. 
                                Private key password is optional (needed only if key is password protected). CA certificate is optional. 
                                For all files, only PEM file format is supported. Specifying CA certificate only is obsoleted - when single 
                                entry is provided for this option, it will be interpreted as the name of a file containing bearer token. 
                                Note that the format of this command-line option prohibits use of files whose names contain ':' or '#' 
                                characters in the file name.

Is there any way how to get more debug info from the K8s auth?

jasondellaluce · 2022-07-26T08:56:10Z

Hi @epcim, since Falco 0.32.1 you can have more debug info by adding the following args to Falco:

-o libs_logger.enabled=true -o libs_logger.severity=trace

epcim · 2022-07-26T11:05:17Z

@jasondellaluce ok, so we know it has k8s connectivity and it reads some resources

FYI:

I realised, that with 0.32.1 the falco pod limits needs to be increased from 512MB to 1Gi (due OOM Kill)
The cluster I run the Falco has 25 nodes, 850 pods, but (some nodes has as only as 30 pods). So not small.
It reads and add [libs]: K8s [ADDED, Service the Kinds, Pod, Namespace, Service, ReplicaSet
Fails on daemonsets ?

Output from container:

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* Skipping compilation, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* eBPF probe located in /root/.falco/falco_cos_5.4.170+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Tue Jul 26 10:52:14 2022: [libs]: starting live capture
Tue Jul 26 10:52:15 2022: [libs]: cri: CRI runtime: containerd 1.4.8
Tue Jul 26 10:52:15 2022: [libs]: docker_async: Creating docker async source
Tue Jul 26 10:52:15 2022: [libs]: docker_async (6a20bc616b5a): No existing container info
Tue Jul 26 10:52:15 2022: [libs]: docker_async (6a20bc616b5a): Looking up info for container via socket /var/run/docker.sock
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): Fetching url
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): http_code=200
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): returning RESP_OK
...
...
...
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-786f859c79, c17e3f24-d389-4ddc-8c24-7531f9cb2682]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-7c46688b79, ca481853-3f88-4179-a90c-96f84256f5cb]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-854cc8b9cf, 9d75ddc1-faff-41c0-925e-0f0fbd1d9919]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-87d55bd54, 73149029-2d60-460d-af34-b5d7b6e47a37]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-f78486b56, 0d185513-64ef-40fe-baca-ef13ab283536]
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data(), checking connection to https://10.127.0.1
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data(), connected to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) check_enabled() enabling socket in collector
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data() [https://10.127.0.1], requesting data from /apis/apps/v1/daemonsets?pretty=false... m_blocking_socket=1, m_watching=0
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) sending request to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) socket=169, m_ssl_connection=61592496
Tue Jul 26 10:42:49 2022: [libs]: GET /apis/apps/v1/daemonsets?pretty=false HTTP/1.1
User-Agent: falcosecurity-libs
Host: 10.127.0.1:443
Accept: */*
Authorization: Bearer eyJhbGciOiJSUzI1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) Retrieving all data in blocking mode ...
Tue Jul 26 10:42:49 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.
Tue Jul 26 10:42:49 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.
Tue Jul 26 10:42:49 2022: [libs]: docker_async: Source destructor
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_deployment_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/deployments?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_replicaset_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/replicasets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_service_handler_state) closing connection to https://10.127.0.1/api/v1/services?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_replicationcontroller_handler_state) closing connection to https://10.127.0.1/api/v1/replicationcontrollers?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_pod_handler_state) closing connection to https://10.127.0.1/api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded,spec.nodeName=gke-gc01-int-ves-io-gc01-int-ves-io-p-cf25a10e-w3qi&pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_namespace_handler_state) closing connection to https://10.127.0.1/api/v1/namespaces?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_node_handler_state) closing connection to https://10.127.0.1/api/v1/nodes?pretty=false

shall not be due rights, but for clarity, the ClusterRole:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: {{ default "falco" .falco_app_name }}-read
  namespace: {{ if .falco_namespace }}{{ .falco_namespace }}{{ else }}monitoring{{ end }}
  labels:
    app: falco
    component: falco
    role: security
rules:
  - apiGroups:
      - extensions
      - ""
    resources:
      - nodes
      - namespaces
      - pods
      - replicationcontrollers
      - replicasets
      - services
      - daemonsets
      - deployments
      - events
      - configmaps
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - apps
    resources:
      - daemonsets
      - deployments
      - replicasets
      - statefulsets
    verbs:
      - get
      - list
      - watch
  - nonResourceURLs:
      - /healthz
      - /healthz/*
    verbs:
      - get

Here is some order of things that happens

❯ k logs -n monitoring $POD -c falco -p | egrep '(ready: |Error )'
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_node_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_node_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_namespace_handler_state) dependency (k8s_node_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_namespace_handler_state) dependency (k8s_node_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_pod_handler_state) dependency (k8s_namespace_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_pod_handler_state) dependency (k8s_namespace_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_replicationcontroller_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_replicationcontroller_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_service_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:12 2022: [libs]: k8s_handler (k8s_service_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:12 2022: [libs]: k8s_handler (k8s_replicaset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:48 2022: [libs]: k8s_handler (k8s_replicaset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:49 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.

epcim · 2022-07-26T11:43:37Z

Well the query to k8s to daemonsets works, so there must be some issue with processing in falco IMO.

Steps to reproduce

k exec -ti -n monitoring falco-zzfmk -c falco -- sh

TOKEN="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"; 
CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
curl -s -H "Authorization: Bearer $TOKEN" --cacert $CACERT  https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/apis/apps/v1/daemonsets?pretty=false | head


{"kind":"DaemonSetList","apiVersion":"apps/v1","metadata":{"resourceVersion":"427824185"},"items":[{"metadata":{"name":"gke-metadata-server","namespace":"kube-system","uid":"452584fa-614e-40bd-8477-3b0781ce9dfc","resourceVersion":"417631995","generation":10,"creationTimestamp":"2020-12-16T10:33:18Z","labels":{"addonmanager.kubernetes.io/mode":"Reconcile","k8s-app":"gke-metadata-server"},
...
...

btw, I claimed it works like http://$(KUBERNETES_SERVICE_HOST):$(KUBERNETES_SERVICE_PORT) but that was not true, as it must be always https. The only difference is that falco was not crashing when http:// was used. (which I would call bug or not understand fully - as k8s annotation must fail at all, but nothing is written to logs on "info" level)

KUBERNETES_SERVICE_HOST=10.127.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443

as it basically freeze here on http query

<docker collector works here..>
...
...
Tue Jul 26 11:57:45 2022: [libs]: k8s_handler (k8s_api_handler_state)::collect_data() [http://10.127.0.1:443], requesting data from /api?pretty=false... m_blocking_socket=1, m_watching=0
Tue Jul 26 11:57:45 2022: [libs]: k8s_handler (k8s_api_handler_state) sending request to http://10.127.0.1:443/api?pretty=false
Tue Jul 26 11:57:45 2022: [libs]: Socket handler (k8s_api_handler_state) socket=153, m_ssl_connection=0
Tue Jul 26 11:57:45 2022: [libs]: GET /api?pretty=false HTTP/1.1
User-Agent: falcosecurity-libs
Host: 10.127.0.1:443
Accept: */*

Tue Jul 26 11:57:45 2022: [libs]: Socket handler (k8s_api_handler_state) Retrieving all data in blocking mode ...

additionally, I though that with this in place, falco will only query metadata for it's own node..

        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)

but insteady reading everything everywhere, ie: /apis/apps/v1/daemonsets?pretty=false

jefimm · 2022-08-13T09:00:18Z

I was able to solve this issue by cleaning up the old replicasets (had about 5k of these)

jjettenCamunda · 2022-08-22T15:40:38Z

We have the same issue with the replicasets failing, we currently have >6k replicasets. But I am not sure deleting them is practical for us. Installing different versions of Falco I have narrowed it down to the following:

Works: Falco 0.32.0 Helm chart version 1.19.4
Fails: Falco 0.32.1 Helm chart version 2.0.0

ranjithmr · 2022-08-24T07:27:18Z

We have the same issue with 0.32.1
Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting

yyvess · 2022-08-26T14:52:43Z

Same issue with 0.32.2 on a Azure cluster with 20 nodes

epcim · 2022-09-06T13:58:24Z

either, have the same experience with 5.9.2022 :latest and 0.32.2 - GCP cluster, 25nodes.

The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!

I am using --k8s-node filter option, but I suspect falco does not reflect that when reading these replicasets.. see

# parsing these lines.. [libs]: K8s [ADDED, ReplicaSet, ...........
❯ k logs -n monitoring falco-bk4t5 -c falco -p |grep ReplicaSet | wc -l
2071

Some other pods managed to read only 506 RS and then fails.

The only error it throws and not recover is:

Mon Sep  5 12:57:15 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.
Mon Sep  5 12:57:15 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.

Again the setup:

    spec:
      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token
        - -k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk
        - -o
        - libs_logger.enabled=true
        - -o
        - libs_logger.severity=info
        env:
        - name: FALCO_BPF_PROBE
        - name: FALCO_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - ```

@jasondellaluce could this have some attention? As this is real blocker. Even it works now on older , even small, clusters this will break any falco deployemnt.

@mac-abdon could you please remove (400+ nodes from title and mention it somewhere else).

jasondellaluce · 2022-09-07T08:06:00Z

@jasondellaluce could this have some attention? As this is real blocker. Even it works now on older , even small, clusters this will break any falco deployemnt.

@epcim this is high priority in the project's roadmap. We're still in the process of figuring out what's the optimal way for mitigating this.

IanRobertson-wpe · 2022-09-07T21:17:51Z

I discussed this issue on the Falco Community Call today, so I'm sharing some of the information from that call for others who may be impacted.

As a workaround, you can consider removing the "-k " command-line option. I was under the impression that this option was used to grab all the (non-audit) k8s.* metadata, but this is not the case. With or without this switch, Falco will pull a subset of information from the local kubelet API (perhaps based on the uppercase -K switch, but I'm unsure). Without the lowercase "-k" switch, Falco will not be able to retrieve some metadata that is only available from the cluster API, which I believe to be the following field types (from https://falco.org/docs/rules/supported-fields/):
k8s.rc.*
k8s.svc.*
k8s.rs.*
k8s.deployment.*

Check your rules to determine whether you are using any of these, and if not, you can probably remove that switch as a workaround and get yourself back up and running until this is fixed.

leogr · 2022-09-08T13:20:06Z

Hey @epcim

The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!

What was the exact status of those ReplicaSet (eg, availableReplicas, fullyLabeledReplicas, readyReplicas, replicas`, etc...)?

I guess the metadata of those ReplicaSets were not useful for Falco, so I'm trying to discover if we can use a fieldSelector in the query to filter out unuseful resources.

Btw,

I am using --k8s-node filter option, but I suspect falco does not reflect that when reading these replicasets.. see

The --k8s-node filter option works only for Pods, since other resources are not bound to a node. So it can't help for ReplicaSets or DaemonSets.

EigoOda · 2022-09-08T14:25:24Z

As a workaround, you can consider removing the "-k " command-line option.

This workaround worked for us. Thanks!

falco: v0.32.2
Kubernetes(EKS): v1.21.14

leogr · 2022-09-08T16:50:37Z

Hey @IanRobertson-wpe

just a note
-K (uppercase) is used only if -k (lowercase) is present. So you can remove both -k and -K.

Btw, as a clarification: the Kubelet metadata is annotated on the container labels, Falco will directly fetch the metadata from the container runtime, and no connection to the Kubelet is needed.

poiana · 2022-12-22T09:40:10Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana · 2023-01-21T09:46:08Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana · 2023-02-20T09:51:32Z

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

poiana · 2023-02-20T09:51:34Z

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

VF-mbrauer · 2023-03-23T20:14:23Z

Has that issue been solved?
I am still getting with huge, loaded nodes the error:

Defaulted container "falco" out of: falco, falcoctl-artifact-follow, falco-driver-loader (init), falcoctl-artifact-install (init)
Thu Mar 23 20:09:35 2023: Falco version: 0.34.1 (x86_64)
Thu Mar 23 20:09:35 2023: Falco initialized with configuration file: /etc/falco/falco.yaml
Thu Mar 23 20:09:35 2023: Loading rules from file /etc/falco/falco_rules.yaml
Thu Mar 23 20:09:36 2023: Loading rules from file /etc/falco/rules.d/falco-custom.yaml
Thu Mar 23 20:09:36 2023: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Thu Mar 23 20:09:36 2023: Starting health webserver with threadiness 16, listening on port 8765
Thu Mar 23 20:09:36 2023: Enabled event sources: syscall
Thu Mar 23 20:09:36 2023: Opening capture with Kernel module
k8s_handler (k8s_replicaset_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
k8s_handler (k8s_deployment_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_deployment_handler_state): Connection closed.
k8s_handler (k8s_replicaset_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, K8s k8s_handler::receive_response(): invalid call (request not sent).
k8s_handler (k8s_deployment_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, K8s k8s_handler::receive_response(): invalid call (request not sent).

@jasondellaluce , @leogr : FYI.

jasondellaluce · 2023-03-24T08:01:32Z

@VF-mbrauer I see. Does it also cause Falco to terminate?
cc @alacuku

alacuku · 2023-03-24T08:10:21Z

@VF-mbrauer, does it recover at some point? or keeps erroring? At startup time all the falco instances connect to the api-server and they may be throttled by api-server that's why you are seeing that error.

Anyway we are working on a new k8s-client for falco that should solve the problem we have with the current implementation, please see: #2973

VF-mbrauer · 2023-03-24T08:29:50Z

@jasondellaluce, yes it will go first run into an OOM and then Restart, after some time it gets stalled into "CrashLoopBackOff"

falco-7sgbh                     2/2     Running            0               88m
falco-fd9kz                     1/2     CrashLoopBackOff   8 (19s ago)     43m
falco-hwv6s                     1/2     CrashLoopBackOff   8 (88s ago)     56m
falco-jj5vj                     1/2     CrashLoopBackOff   9 (2m16s ago)   49m
falco-nj6mn                     1/2     CrashLoopBackOff   6 (2m17s ago)   53m
falco-q4247                     2/2     Running            0               88m
falco-q6hwl                     1/2     CrashLoopBackOff   8 (4s ago)      52m
falco-qmgmh                     1/2     CrashLoopBackOff   10 (50s ago)    57m
falco-s4v9n                     1/2     CrashLoopBackOff   8 (3m42s ago)   54m
falco-shn6m                     2/2     Running            6 (3m51s ago)   45m
falco-tbs94                     1/2     CrashLoopBackOff   8 (75s ago)     47m
falco-vvd49                     1/2     CrashLoopBackOff   7 (4m11s ago)   51m
falco-w5tc4                     1/2     CrashLoopBackOff   4 (25s ago)     34m

So to workaround that I increased the Memory a bit.

s7an-it · 2023-04-02T00:34:52Z

There is some disconnection between code and config - for me metadata_download mb setting does nothing and reading the code makes sense it doesn't.

mac-abdon added the kind/bug label Feb 18, 2022

leogr mentioned this issue Sep 9, 2022

mitigation for falco#1909 falcosecurity/libs#591

Merged

2 tasks

alacuku mentioned this issue Sep 15, 2022

mitigation for falco#1909 - fix(k8s-client): handle network related exceptions falcosecurity/libs#610

Merged

poiana added the lifecycle/stale label Dec 22, 2022

poiana added lifecycle/rotten and removed lifecycle/stale labels Jan 21, 2023

poiana closed this as completed Feb 20, 2023

alacuku mentioned this issue Dec 15, 2023

K8S client issues #2973

Closed

epcim mentioned this issue Apr 14, 2023

OOM on physical servers #2495

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes) #1909

Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes) #1909

mac-abdon commented Feb 18, 2022

Diliz commented May 4, 2022

vnandha commented May 12, 2022

jasondellaluce commented Jun 6, 2022

Diliz commented Jun 14, 2022 •

edited

Loading

jimbobby5 commented Jul 22, 2022

Diliz commented Jul 22, 2022 •

edited

Loading

IanRobertson-wpe commented Jul 22, 2022

epcim commented Jul 26, 2022

jasondellaluce commented Jul 26, 2022

epcim commented Jul 26, 2022

epcim commented Jul 26, 2022 •

edited

Loading

jefimm commented Aug 13, 2022

jjettenCamunda commented Aug 22, 2022

ranjithmr commented Aug 24, 2022 •

edited

Loading

yyvess commented Aug 26, 2022 •

edited

Loading

epcim commented Sep 6, 2022 •

edited

Loading

jasondellaluce commented Sep 7, 2022

IanRobertson-wpe commented Sep 7, 2022

leogr commented Sep 8, 2022

EigoOda commented Sep 8, 2022

leogr commented Sep 8, 2022 •

edited

Loading

poiana commented Dec 22, 2022

poiana commented Jan 21, 2023

poiana commented Feb 20, 2023

poiana commented Feb 20, 2023

VF-mbrauer commented Mar 23, 2023

jasondellaluce commented Mar 24, 2023

alacuku commented Mar 24, 2023

VF-mbrauer commented Mar 24, 2023

s7an-it commented Apr 2, 2023

Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes) #1909

Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes) #1909

Comments

mac-abdon commented Feb 18, 2022

Diliz commented May 4, 2022

Environment:

vnandha commented May 12, 2022

jasondellaluce commented Jun 6, 2022

Diliz commented Jun 14, 2022 • edited Loading

jimbobby5 commented Jul 22, 2022

Diliz commented Jul 22, 2022 • edited Loading

IanRobertson-wpe commented Jul 22, 2022

epcim commented Jul 26, 2022

jasondellaluce commented Jul 26, 2022

epcim commented Jul 26, 2022

epcim commented Jul 26, 2022 • edited Loading

jefimm commented Aug 13, 2022

jjettenCamunda commented Aug 22, 2022

ranjithmr commented Aug 24, 2022 • edited Loading

yyvess commented Aug 26, 2022 • edited Loading

epcim commented Sep 6, 2022 • edited Loading

jasondellaluce commented Sep 7, 2022

IanRobertson-wpe commented Sep 7, 2022

leogr commented Sep 8, 2022

EigoOda commented Sep 8, 2022

leogr commented Sep 8, 2022 • edited Loading

poiana commented Dec 22, 2022

poiana commented Jan 21, 2023

poiana commented Feb 20, 2023

poiana commented Feb 20, 2023

VF-mbrauer commented Mar 23, 2023

jasondellaluce commented Mar 24, 2023

alacuku commented Mar 24, 2023

VF-mbrauer commented Mar 24, 2023

s7an-it commented Apr 2, 2023

Diliz commented Jun 14, 2022 •

edited

Loading

Diliz commented Jul 22, 2022 •

edited

Loading

epcim commented Jul 26, 2022 •

edited

Loading

ranjithmr commented Aug 24, 2022 •

edited

Loading

yyvess commented Aug 26, 2022 •

edited

Loading

epcim commented Sep 6, 2022 •

edited

Loading

leogr commented Sep 8, 2022 •

edited

Loading