-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mitigation for falco#1909 - fix(k8s-client): handle network related exceptions #610
mitigation for falco#1909 - fix(k8s-client): handle network related exceptions #610
Conversation
Updated the PR description with some info regarding the testing. |
/milestone 0.9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, only a few people have the guts to stand up to our k8s client :)!
userspace/libsinsp/socket_handler.h
Outdated
// When the connection is closed we reset the connection state of the handler. | ||
// It could happen when the api-server throttles our requests. | ||
m_connecting = false; | ||
m_connected = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a stupid question: Do we have to do the same stuff also in get_all_data_secure
or it is enough to do it into get_all_data_unsecure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a leftover from previous experiments, nice catch!
We do not need to reset the state of the socket_handler
. We know that we are able to contact the api-server
through that socket. We keep it alive, and next time falco will try to collect the k8s data then it will switch to the watch mode
. It does not need to pre-fetch all the data but the api-server will send all the existing data as ADDED
events and our logic will be able to construct the k8s state.
Another solution could be to remove the socket_handler
, then falco will recreate the handler
and so on. But it would be more expensive in case the api-server
continues to throttle the request.
As discussed with @Andreagit97 and @FedeDP /milestone 0.10.0 |
It could happen that k8s api-server throttles the initial requests done to retrieve the k8s metadata. When this happens we do not exit, but catch the exception generated by the temporary error and handle it. We reset the state of the socket_handler, meaning that before it is used the connection has to be initialized again and the same is done for the k8s_handler associated with the socket_handler. The involved k8s_handler is set in the initial state where it needs to connect to the api-server before using it to retrieve the data. Another solution could be to destroy and recreate the involved handlers but it would be to much expensive since the throttling could persist for relatively long periods. Signed-off-by: Aldo Lacuku <[email protected]>
28c5ea8
to
ff33177
Compare
Same as #591 I would give it an attempt, let's see if during the release process we are enough confident to merge it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👍
LGTM label has been added. Git tree hash: 4e7375b74cc9640b6475ddf6cd26c548c57d9654
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alacuku, leogr, LucaGuerra The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Aldo Lacuku [email protected]
What type of PR is this?
/kind bug
Any specific area of the project related to this PR?
/area libsinsp
Does this PR require a change in the driver versions?
What this PR does / why we need it:
When Falco is configured to fetch the metadata from a k8s cluster it fetches all the metadata(pods, deployments, nodes, namespaces, replicasets, daemonsets, replicationcontrollers and services) at start-up time and then watches for new events from the api-server.
For each k8s object, there is a different handler that opens a connection toward the api-server. It could happen that sometimes the api-server throttles the requests causing falco to crash falcosecurity/falco#1909.
This PR is an initial effort to handle the case when the api-server throttles the initial meta-data fetching by handling the error and retrying next time when Falco collects the k8s metadata. The proposed solution is lightweight since it does not delete and recreate all the handlers (k8s_handler and socket_handler) but just resets their state in order to be reused (switch to
watching mode
) the next time Falco collects the k8s metadata (the default options set the k8s metadata collection to be done every 1 second).Testing:
I was able to set up a k8s cluster which throttles the falco requests when retrieving pods' metadata:
We can see that the error is logged but falco did not exit. It retries later and after the initial metadata fetching it switches to the
watching
mode for the k8s_pod_handler and process all the pods..UPDATE tests:
I have teste these changes also in k8s cluster where the
api-server
throttles the initial requests made by falco. The following snippet shows how Falco behaves in case of error and that it is able to construct the k8s state. Furthermore, it shows that when triggering a rule that uses the k8s fields everything works as it should.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
The branch is temporarily based on the 0.7.0 to easily test it with Falco 0.32.2 (a released version affected by the bug).
Does this PR introduce a user-facing change?: