-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes) #1909
Comments
I actually got the same issue even with less nodes (25/30): Environment:Falco version: 0.31.1 |
I am seeing similar issue,
|
Did you try running this with Falco's |
Yep, already tried with and without EDIT: Still happening in 0.32.0 |
I too am experiencing this. |
Hello there! This was due to the falco operator which does not set the node value correctly, the --k8s-node option is set, but the nodes were not fetched correctly by the operator... EDIT: I switched to the official helm chart since this message was posted (can be found here: https://github.com/falcosecurity/charts ) |
Just landed here after seeing this in my environment as well, with 0.32.1. |
Are there any instructions how to debug? version 0.32.1, custom crafted manifests
"system_info": {
"machine": "x86_64",
"nodename": "falco-7xnlh",
"release": "5.4.170+",
"sysname": "Linux",
"version": "#1 SMP Sat Apr 2 10:06:05 PDT 2022"
},
"version": "0.32.1"
spec:
containers:
- args:
- /usr/bin/falco
- --cri
- /run/containerd/containerd.sock
- --cri
- /run/crio/crio.sock
- -K
- /var/run/secrets/kubernetes.io/serviceaccount/token
- -k
- https://$(KUBERNETES_SERVICE_HOST)
- --k8s-node
- $(FALCO_K8S_NODE_NAME)
- -pk
env:
- name: FALCO_BPF_PROBE
value: ""
- name: FALCO_K8S_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName When changed to - -k
- http://$(KUBERNETES_SERVICE_HOST):$(KUBERNETES_SERVICE_PORT) Then it works without problem. I do have custom image. Obviously there is some issue with
Is there any way how to get more debug info from the K8s auth? |
Hi @epcim, since Falco 0.32.1 you can have more debug info by adding the following args to Falco:
|
@jasondellaluce ok, so we know it has k8s connectivity and it reads some resources FYI:
Output from container:
shall not be due rights, but for clarity, the ClusterRole:
Here is some order of things that happens
|
Well the query to k8s to daemonsets works, so there must be some issue with processing in falco IMO. Steps to reproduce k exec -ti -n monitoring falco-zzfmk -c falco -- sh
TOKEN="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)";
CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
curl -s -H "Authorization: Bearer $TOKEN" --cacert $CACERT https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/apis/apps/v1/daemonsets?pretty=false | head
{"kind":"DaemonSetList","apiVersion":"apps/v1","metadata":{"resourceVersion":"427824185"},"items":[{"metadata":{"name":"gke-metadata-server","namespace":"kube-system","uid":"452584fa-614e-40bd-8477-3b0781ce9dfc","resourceVersion":"417631995","generation":10,"creationTimestamp":"2020-12-16T10:33:18Z","labels":{"addonmanager.kubernetes.io/mode":"Reconcile","k8s-app":"gke-metadata-server"},
...
... btw, I claimed it works like
as it basically freeze here on http query
additionally, I though that with this in place, falco will only query metadata for it's own node..
but insteady reading everything everywhere, ie: |
I was able to solve this issue by cleaning up the old replicasets (had about 5k of these) |
We have the same issue with the replicasets failing, we currently have >6k replicasets. But I am not sure deleting them is practical for us. Installing different versions of Falco I have narrowed it down to the following: Works: Falco 0.32.0 Helm chart version 1.19.4 |
We have the same issue with 0.32.1 |
Same issue with 0.32.2 on a Azure cluster with 20 nodes |
either, have the same experience with 5.9.2022 The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option! I am using
Some other pods managed to read only The only error it throws and not recover is:
Again the setup:
|
@epcim this is high priority in the project's roadmap. We're still in the process of figuring out what's the optimal way for mitigating this. |
I discussed this issue on the Falco Community Call today, so I'm sharing some of the information from that call for others who may be impacted. As a workaround, you can consider removing the "-k " command-line option. I was under the impression that this option was used to grab all the (non-audit) k8s.* metadata, but this is not the case. With or without this switch, Falco will pull a subset of information from the local kubelet API (perhaps based on the uppercase -K switch, but I'm unsure). Without the lowercase "-k" switch, Falco will not be able to retrieve some metadata that is only available from the cluster API, which I believe to be the following field types (from https://falco.org/docs/rules/supported-fields/): Check your rules to determine whether you are using any of these, and if not, you can probably remove that switch as a workaround and get yourself back up and running until this is fixed. |
Hey @epcim
What was the exact status of those ReplicaSet (eg, I guess the metadata of those ReplicaSets were not useful for Falco, so I'm trying to discover if we can use a Btw,
The |
This workaround worked for us. Thanks! falco: v0.32.2 |
just a note Btw, as a clarification: the Kubelet metadata is annotated on the container labels, Falco will directly fetch the metadata from the container runtime, and no connection to the Kubelet is needed. |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh with Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue with Mark the issue as fresh with Provide feedback via https://github.com/falcosecurity/community. |
@poiana: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Has that issue been solved?
@jasondellaluce , @leogr : FYI. |
@VF-mbrauer I see. Does it also cause Falco to terminate? |
@VF-mbrauer, does it recover at some point? or keeps erroring? At startup time all the Anyway we are working on a new |
@jasondellaluce, yes it will go first run into an OOM and then Restart, after some time it gets stalled into "CrashLoopBackOff"
So to workaround that I increased the Memory a bit. |
There is some disconnection between code and config - for me metadata_download mb setting does nothing and reading the code makes sense it doesn't. |
Describe the bug
We upgraded from falco:0.28.1 to falco:0.31.0 due to this bug in large k8s environments and we seem to have hit a new runtime error. We're now seeing:
We downgraded to falco:0.30.0 which does not have the runtime error.
How to reproduce it
Upgrade to falco:0.31.0 and scale your Kubernetes cluster to around 400 nodes.
Expected behaviour
No runtime error
Screenshots
Environment
Additional context
The text was updated successfully, but these errors were encountered: