Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

virt-handler: Restrict write access to nodes #246

Merged
merged 2 commits into from
May 10, 2024

Conversation

xpivarc
Copy link
Member

@xpivarc xpivarc commented Oct 10, 2023

This PR is addressing a security issue [0][1] that exists because virt-handler is a daemonset that can patch the node. This proposal is presenting possible proposals to reduce virt-handler's node patching abilities to mitigate this issue.

[0] GHSA-cp96-jpmq-xrr2
[1] kubevirt/kubevirt#9109

Signed-off-by: Luboslav Pivarc <[email protected]>
@kubevirt-bot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Oct 10, 2023

Kubevirt/Kubevirt

# Design

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative design that was proposed the sig-scale call, that may solve the same problem is as follows:

  1. Introduce a new controller, lets call it token-distributor-controller.
  2. The token-distributor-controller has a RBAC Template CR that creates a unique ClusterRole, ClusterRolebinding and ServiceAccount for each virt-handler pod. Example of the template from kubevirt-handler clusterrole
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
  - watch
  - get
- apiGroups:
  - ""
  resources:
  - nodes
  resourceNames:
  - <node-name>
  verbs:
  - patch
  1. The token-distributor-controller will then create a token for the pod via TokenRequest API [1][2] pod on the node and deliver the token on the node.
  2. The virt-handler pod will use a local volume mount to receive the token. An init container could be used to make sure the file is delivered by token-distributor-controller
  3. Virt-handler will start with restricted permissions that could be carefully modified so that it will work on all the objects that are related to the node

This could be a generic solution that could help all daemonsets that are deployed with privileged RBAC and could be a separate project all together.

Href:

  1. https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#manually-create-an-api-token-for-a-serviceaccount
  2. https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/

Copy link
Member

@rmohr rmohr Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is addressing the probelm fully. There are two thing which need to be fixed:

  1. being able to modify other nodes (that can be addressed like above)
  2. Not being able to modify non-kubevirt-owned fields on the node (this includes not being able to change the spec or any label/annotation which is not stricty kubevirt owned)

(2) still allows me to easily lure in privileged CP components and then just steal their credentials.

We could definitely create unique roles for each node, and in addition deploy a node webhook which would ensure that a service account only, that is also what the apiserver does internally for each kubelet, but that has the severe disadvantage that we would now need a required webhook on a core k8s resource. If our node webhook is ever unavailable, no kubelet, or anythig else can actually continue to modify the node. That's a no-go.

THere are other positive side-effects if we fix it with the "shadow nodes":

  • heartbeat can be moved to the shadow node, meaning we create much less load on all the controllers watching nodes.
  • we finally have a clear api to express our node-needs and don't need to press our API in annotations and labels on nodes.

One may now think that now virt-controller still can modify all nodes, and that is true. However now we only have one component which can do that instead of n, further a token-distributor-controller controller which would create those rbac roles has to have the exact same permissions as well, otherwise it would not be able to create the roles in the first place ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmohr thanks for summarizing the reasons.

@xpivarc if it is helpful, please add this to alternative designs considered for future us to refer back at.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we includes this in the gola section?

@RamLavi @xpivarc ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alaypatel07 thanks for bringing this up.
Added the alternative to the proposal body.
Added to the goals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmohr we have to address a third element

  1. Prevent that the daemonset can modify other KubeVirt owned CRs related to other nodes

IOW 3. Restrict the damemonset to only modify KubeVirt owned CRs related to it's own node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that third element is true, but I would qualify it as a follow-up step, not needed to address the CVE. Now virt-handler cannot blindly modify the node topology anymore. It can only modify hardware capability availability. It can not lure in privileged or sensitive workloads anymore by just applying labels which they are waiting for, or by making a node a control-plane node.

The follow up would be either cert based auth for virt-handler, or maybe we can still use tokens and get the IP in the webhooks for verification. Something like this ...

Let's add it to the doc as follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand AFAIK the only kubevirt CR virt-handler can update is virtualmachineinstances. Is this what you are referring to?

@rmohr I will add a Follow-up Steps section in the end of the doc and include:

  • (the one just suggested) Restrict the virt-handler daemonset to only modify KubeVirt owned CRs related to it's own node
  • Restrict the virt-handler daemonse from being able to update/patch other nodes

Is that alright with you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RamLavi yes - if this is the only one, then yes. And: generally speaking: handler should just be able to touch anything which a) it should be permitted to touch and b) only if it's own it's "own node"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. DONE.

@xpivarc xpivarc marked this pull request as ready for review January 29, 2024 09:14
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2024
Copy link

@alaypatel07 alaypatel07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xpivarc since it is convincing from all the discussions that the ShadowNode API is the preferred design alternative, some nits/questions. PTAL, thanks


## Goals

Mitigate the advisory by default.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please expand this to add explicit goals? I think @rmohr summarized the goals clearly in the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 please expand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

1. [run](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L195) - irrelevant as this is read access

2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT...
This is triggered both by interval (3 min) and on changes of cluster config.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the consumers for each of the above node labels and annotations should be mentioned alongside?

For example, the heartbeat in point 2. is consumed by the NodeController in virt-controller-app to check if node is healthy and raise an event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the consumers will not necessarily add clarity here. I did rephrase to make it more high level.


## Update/Rollback Compatibility

Virt-handler is going to continue writting to Node object, therefore update should be without compatibility issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When upgrade rolls out, virt-controllers are upgraded first then the daemonsets followed by virt-api. There will be a timeperiod where new virt-controllers (with ShadowNode code) will work with old virt-handlers and old-apiserver without ShadowNode API. Can we mention here clear steps how:

  1. new virt-controllers will work with older virt-handlers and olde virt-api
  2. new virt-controllers and newer virt-handlers will work with older virt-api

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alaypatel07, I'll be taking over addressing the issue this proposal is trying to mitigate. I plan to take the same course this proposal is choosing. I will answer your question in the next reply, but if you think I should re-admit the proposal with the answers addressed in them - please tell me and I will do so.

Copy link
Contributor

@RamLavi RamLavi Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regard to your question, I think the upgrade order is a bit different. Please keep me honest, but according to this - virt-operator upgrade order is like this:

CRDs --> RBACs --> Daemonsets (=virt-handler) --> Controller Deployments (=virt-controller)--> API Deployments (=virt-API)

Also note that:

  • As mentioned above, the virt-handler will still keep trying to patch the node alongside the shadowNode, and at some point during the upgrade, it would start being blocked due to the removal of the patch ability. In order to mitigate this, the node patch will start ignoring RBAC "forbidden" error.
  • Normally, new CRDs are blocked until the upgrade rollout is over. The shadowNode CRD is going to be excluded in order to avoid upgrade issues that could cause hearbeat timeout.

Having said the above, this is a specific mention to each upgrade scenario - before and after the upgrade itself:

During upgrade rollout:

  • New CRD; old RBAC; old virt-handler; old virt-controller; old virt-API:
    • Old RBAC: no upgrade issue
    • Old virt-handler: no upgrade issue
    • Old virt-controllers: no upgrade issue
    • Old Virt-API: no upgrade issue
  • New CRD, new RBAC; old virt-handler; old virt-controller; old virt-API:
    • Old virt-handler: during upgrade, the old RBACs are kept as backup until upgrade is done. Hence, patching node should not fail so there is no upgrade issue.
    • Old virt-controllers: no upgrade issue, virt controller RBAC is not changed due to this solution.
    • Old Virt-API: no upgrade issue.
  • New CRD; new RBAC; new virt-handler; old virt-controller; old virt-API:
    • new virt-handler: shadowNode is written to but it is not synced to the node yet. However, virt-handler keeps patching the node directly, so there are no upgrade issue.
    • [same as before] Old controllers: no upgrade issue, virt controller RBAC is not changed.
      [same as before] Old Virt-API: no upgrade issue.
  • New CRD; new RBAC; new virt-handler; new virt-controller; old virt-API:
    • New virt-handler: no upgrade issue. virt-handler patches both node and shadow node.
    • New controllers: will start getting shadowNode requests once they are created. no upgrade issue.
    • [same as before] Old Virt-API: no upgrade issue.

After upgrade rollout:
virt-handler will keep trying (and failing) to patch the node since the backup RBACs are now removed. Virt-handler will ignore these cases by ignoring RBAC "forbidden" errors. This behavior should be kept until we no longer support upgrades from the current release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion, if possible, would be the following:

  • Introduce a new status field in the KubeVirt CR
  • Let virt-handlers continue patching the node as well
  • After virt-handlers and virt-controller are rolled over, set in the kubevirt CR a status field indicating that shadow nodes are now the way to go
  • virt-handler would pick it up right away over our config logic
  • finally remove the RBAC permissions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some neglectible overlap between virt-handlers picking up the config change and the rbac removal then, but that should worst case just result in a few rbac permission errors which we should then not mask.

Copy link
Contributor

@RamLavi RamLavi Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmohr, I respectfully disagree. Using SelfSubjectAccessReview isn't just masking errors; it's a direct and intentional way of asking the RBAC server, "Can I patch the node?" rather than just hoping things are right. It cheaper than patching the actual node then failing, and ensures actions are only taken when permitted, adding a layer of security.

I believe relying on the upgrade mechanism to signal a status field once ready could prove to be more costly in the long run, considering the potential corner scenarios mentioned earlier.

How about if I stop checking with the RBAC server after the first time I see that the node patch is forbidden (i.e. SelfSubjectAccessReview returns Allowed=false). --> This way each virt-handler will stop re-sending SelfSubjectAccessReview requests soon after node-patching is disabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmohr, I respectfully disagree. Using SelfSubjectAccessReview isn't just masking errors; it's a direct and intentional way of asking the RBAC server, "Can I patch the node?" rather than just hoping things are right.

I am not sure. I tried to express a knowledge gap which we try to overcome with checking for forbidden error or the SelfSubjectAccessReview. In both cases we don't know if it is intentional and correct that we don't have the permission. With a clear config switch we know: Now I should do both, and from now on I should do one. Having a very brief period regarding to eventual consistency is fine where maybe one or two update requests would fail (like a low amount of conflict REST calls is fine too).

The information gap is what I consider "masking". Therefore the reviews add up quite a big front of additional uncertainties without making anything better (actually imo much more complex).

It cheaper than patching the actual node then failing, and ensures actions are only taken when permitted, adding a layer of security.

I am not sure what "cheap" means here, and giving additional permissions which may leak info which we did not have access to before opens security questions which I don't even want to have to think about.

Copy link
Contributor

@RamLavi RamLavi Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information gap is what I consider "masking". Therefore the reviews add up quite a big front of additional uncertainties without making anything better (actually imo much more complex).

ah you and I interpret "masking errors" differently. I get what you mean now. "masking errors" in my book is just the fact that we are ignoring a specific error, which is in general no good practice.

I am not sure what "cheap" means here, and giving additional permissions which may leak info which we did not have access to before opens security questions which I don't even want to have to think about.

I take back the "cheap" part - we are not sending the entire node object on the SET action, so the difference should be negligible.
I get your concern in adding a new permission that may leak info in the future.

@rmohr I am still against using the upgrade as a sign to know if we can stop patching the node. The upgrade process is complex and state-full, and relying on it may introduce corner cases. In a way, Lubo's "try and ignore if fails on specific error" is much more stateless and clean.
Can we agree to continue with his approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alaypatel07 I updated the upgrade scenarios on the proposal's body as you suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmohr I am still against using the upgrade as a sign to know if we can stop patching the node. The upgrade process is complex and state-full, and relying on it may introduce corner cases. In a way, Lubo's "try and ignore if fails on specific error" is much more stateless and clean.
Can we agree to continue with his approach?

It should basically be a check like this:

if freshInstall && supportsShadowNodes()  {
  kubevirtCR.Status.UseShadowNode = True
}

if fully installed && !switchedToShadowNode && supportsShadowNodes() {
  kubevirtCR.Status.UseShadowNode = True
} else {
  kubevirtCR.Status.UseShadowNode = False
}

where supportsShadowNodes() can basically just check if the install strategy of the current version includes the shadow node CR, to ensure that the version we are upgrading downgrading to, has them.

Copy link
Contributor

@RamLavi RamLavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xpivarc, I have a question.
If we create a shadow node that the virt-handler writes to, then what prevents a bad person that took control over virt-handler to write to shadow nodes of other Nodes? Couldn't this bad person still take control of the cluster by manipulating it?

@rmohr
Copy link
Member

rmohr commented Jan 30, 2024

Hey @xpivarc, I have a question.
If we create a shadow node that the virt-handler writes to, then what prevents a bad person that took control over virt-handler to write to shadow nodes of other Nodes? Couldn't this bad person still take control of the cluster by manipulating it?

To a degree yes, but it can only influence kubevirt.io/ scheduling constraints which solves the most important problem, that one can lure in CP workloads. VMs can still be tricked. If (and I am not sure if we get that info on an AdmissionReview) we also get the IP address in userinfo when eventuallly switching to cert based auth (which would containe the client IP like kubelet certs), we can also enforce by webhook that it can only write to its own node in a next step. That would now be possible because the webhook would only block our shadow nodes and not a core k8s entity.

@xpivarc
Copy link
Member Author

xpivarc commented Feb 4, 2024

@RamLavi
100% what Roman mentioned. We have control over what we propagate and we can also solve the problem of impersonation if we want because the scalability will not be an issue anymore.

@fabiand
Copy link
Member

fabiand commented Feb 28, 2024

/hold

I think there is no clear agreement if this approach is the right one.

I'm putting a hold on it to signal that this requires more discussion and an agreed design.

Important as @RamLavi is working on an implementation.

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2024
Copy link
Member

@fabiand fabiand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we explore what other projects did in order to tackle this problem?

What was the feedback from K8s SIG API (I think this idea got discussed there)?

Not that I am proposing it, but I wonder if tools like OPA would be able to protect us fomr issues like this.


## Goals

Mitigate the advisory by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 please expand


Kubevirt/Kubevirt

# Design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we includes this in the gola section?

@RamLavi @xpivarc ?

2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT...
This is triggered both by interval (3 min) and on changes of cluster config.

## Shadow node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmohr @xpivarc @RamLavi do we know how other project shave solved this problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand I was not present during the time discussions on how other projects solved the issue. For obvious reasons, this is not something a project will advertise, so perhaps @xpivarc and @rmohr could shed more light on this. However, I did come across what I think is cilium's fix on the matter, and it looks like they went by the CRD approach, same as us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'd expect any author of such a feature to do some field research. Thus, yes, maybe you joined later, but that should not stop you from doing your own research :)

2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT...
This is triggered both by interval (3 min) and on changes of cluster config.

## Shadow node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xpivarc iirc then this idea was taken to kube sig api- has it? what was the outcome of their discussion?

Kubevirt/Kubevirt

# Design
(This should be brief and concise. We want just enough to get the point across)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me a high level design is missing here.

It could be something like: The security issue is being addressed by introducing ShadowNodes for every Kubernetes node. virt-handlers are only interacting with ShadowNodes - which allows us to drop any node access privieleges from the handler's SA - a controller will then map these ShadowNode requests to the real kubernetes nodes, concentrating the required permissions on the controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! DONE.

@kubevirt-bot kubevirt-bot added size/L and removed size/M labels Mar 3, 2024
@RamLavi
Copy link
Contributor

RamLavi commented Mar 3, 2024

Change: Addressing reviewers' comments.
@alaypatel07 @rmohr @xpivarc @fabiand PTAL

* [same as before] Old Virt-API: no upgrade issue.

### After upgrade rollout:
Virt-handler will keep trying (and failing) to patch the node since the backup RBACs are now removed. Virt-handler will ignore these cases by ignoring RBAC "forbidden" errors. This behavior should be kept until we no longer support upgrades from the current release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads like you would never stop trying to patch. In case I did not misread this, let's suppose we have 100 nodes, we update the node every 10s (can't remember the default interval), and we may be able to remove this code path in 12 months, that means we would now patch and fail for 12 months with 10 calls/second. So that is definitely not an option. Just think about metrics collection for instance which will record the failed call on virt-handler as well as on the API server.

The only way would be to stop patching when you get the "forbidden" error, but then we have again the issue, that we can't differentiate a config error from a real desire to no longer patch it.

Copy link
Contributor

@RamLavi RamLavi Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC interval of heartbeat is 1m, and nodeLabeller is 3m
Indeed stopping after first "forbidden" error is possible, but you are right that we can't differentiate a config error from a real desire to no longer patch it.

kind: ShadowNode
metadata:
name: <Equivalent to existing name>
# Labels and Annotations only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great opportunity, to create a real spec for all that info. The reason we use for instance for heartbeat metadata is because we don't have a place on the k8s node object to properly express information. Now we can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the pros list to the CRD approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, having a typed spec with a well defined list of things that we want on the node, looks to be an opportunity here.

Copy link
Contributor

@RamLavi RamLavi Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmohr how about if we make it a bit wider. Instead of setting a CRD for this very specific node sync purpose, define a virt-handler CRD, that would hold this status field nodeInfo (such as the labels and annotations) and also future stuff that could be used in future virt-handler communication with other components...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine either way.

@fabiand
Copy link
Member

fabiand commented Mar 12, 2024

@RamLavi will oyu be updating this spec according to Roman's comment about a a real spec?

@RamLavi
Copy link
Contributor

RamLavi commented Mar 12, 2024

@RamLavi will oyu be updating this spec according to Roman's comment about a a real spec?

Hey, I wasn't online this past week (was sick, now better), so I'm not sure what spec update is needed. Will take a closer look later today

Virt-handler will keep read access to nodes but will lose write access. Virt-handler will instead write to the ShadowNodes object, and this will subsequently be selectively copied to the node by a new controller. This controller will be the only entity who writes to the node object.

Pros:
- Since the new controller runs virt-controller context which only runs on the control-place nodes, the risk of taking over the cluster is greatly mitigated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Since the new controller runs virt-controller context which only runs on the control-place nodes, the risk of taking over the cluster is greatly mitigated.
- Since the new controller runs virt-controller context which only runs on the control-plane nodes, the risk of taking over the cluster is reduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


Pros:
- Avoid introducing a new CRD to the project.
- Allows for future expansions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have specific expansions in mind? This statement seems generic.

I would suggest to add here a security benefit: virt-handler does not need to access kube api at all, and cannot modify unrelated nodes or shadow nodes. virt-controller does all the risky work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have specific expansions in mind? This statement seems generic.

it is. Will remove.

I would suggest to add here a security benefit: virt-handler does not need to access kube api at all, and cannot modify unrelated nodes or shadow nodes. virt-controller does all the risky work.

Added.

Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The research done as part of this proposal is impressive.
Thank you for sharing all the options and details explored.
I learned a lot from reviewing it.


## Non Goals

Redesign/optimize existing virt-hanlder features to not update/patch the node in the first place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is the proper solution for the long-run. Being the goal Kubevirt should strive to in order to be secure.

Several of the existing solutions that require patching the node should be reevaluated and optimized not to require access to nodes (or other cluster level core component).

Said that, I understand this may take more resources and time to reach, therefore the solutions provided in this proposal are still required. I just hope to see a continuation to this effort.

Copy link
Contributor

@RamLavi RamLavi May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that some of the solutions made here should occur anyway in parallel, as part of the goal of keeping kubevirt working properly.
However, these suggestions deserve a separate proposal, and for this specific CVE they are not strictly required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both of you :) it should be explored but a bit out-of-scope.

One thing we should consider is whether the CPU information on the nodes is essential (I think the vast majority of our labels are dedicated to CPU info). After all, this information is fixed by nature and stays the same for each node, maybe except for very rare cases. Currently we use these labels as selectors, but we can save this information elsewhere (e.g. a configmap?), then virt-controller would target nodes by name (or the hostname label?) before creating the launcher pod.

But as you said it's out-of-scope for this PR, just something to keep in mind.

3. Virt-handler/node labeller [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) various information which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT...
* This is triggered both by interval (3 min) and on changes of cluster config. However, as these labels usually don't change—this intervalled reconciliation is usually no-op.

## Solution option #1: Use Validating Admission Policy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
This seems the most suitable solution out of the list presented here.

It will be great to see a PoC for it.

- (same set of parameters for annotations)
- [matchConditions](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#matching-requests-matchconditions) rules (only if all are "true" then the validation will be evaluated):
- `isPartitionedServiceAccount` == true
- [validation](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#creating-a-validatingadmissionpolicy) rules (written in pseudocode to keep things readable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to express something like: "reject-all-changes except these"?
I mean, without the need to compare other labels/annotation we do not care about.

Copy link
Contributor

@RamLavi RamLavi May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it was nice if we only got the "changes" we made, but it doesn't work like that AFAIK: When setting a POST request (this is true for both webhooks and validationAdmissionPolicies) you can add validation rules to validate these entities (such as object, oldObject, request - see more details here to learn how they look like)
so it's logical that you first have to gather the kubevirt-labels on both {object,oldObject} (or the non-kubevirt-own, as shown) , and then make sure that we "reject-all-changes except these"..

Please review this CEL example that models the validations that the valiationAdmissionPolicy will have here
You can play with it and if you find a better way to do it I'm all ears of course :)

Comment on lines 77 to 78
validationAdmissionPolicy [going GA](https://kubernetes.io/blog/2024/03/12/kubernetes-1-30-upcoming-changes/) on k8s 1.30, so it should be OK for deployment on Kubevirt main and next minor release branches.
However, since kubevirt needs to be supported up to 2 k8s versions back, then the deployment of these new objects needs to be conditioned by a new feature gate "node restriction". HCO main and v1.3 branches will enable this feature gate by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we program the component that deploys Kubevirt (virt-operator?) to condition what and how to setup the cluster based on its version/capability?
Clusters that can support the new policy (e.g. in v1.30 or in v1.29 where the K8S FG is enabled), will have Kubevirt deployed in one way and in other clusters in the existing manner.
The condition will be dropped once we drop support for K8S 1.29.

Having a FG protecting this change (or any other) will drag this for at least 2 version (6 months), requiring users to act on making their cluster more secure. I hope we can take that burden on ourself by just giving them the best possible option given the setup.

If this is not possible, could you please elaborate on why not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's possible, will need to check.
But theoretically speaking, isn't that problematic to introduce such a dependency to a k8s version? It sounds unusual..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking it through, I agree with you.
a FG is not a good fit for this, since this is not something we want a user (=cluster-admin) to need to switch on, as this is not a user facing feature.
We have 2 alternatives then:

  • add a logic on runtime to check k8s version and fail only if the vesion is above 0.30.0
  • always deploy and ignore on fail.

I prefer the former. Will fix the proposal accordingly

```
!`requestHasNodeName` OR (object.Name == request.nodeName)
```
However, this field is [beta](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#additional-metadata-in-pod-bound-tokens) on k8s 1.30 and is currently protected by the [ServiceAccountTokenPodNodeInfo](https://github.com/kubernetes/kubernetes/blob/c4bce63d9886e5f1fc00f8c3b5a13ea0d2bdf772/pkg/features/kube_features.go#L753) k8s feature gate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is considered a cons. Can you add it as such?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@RamLavi RamLavi force-pushed the nodeshadow branch 2 times, most recently from 049b850 to b2d541c Compare May 2, 2024 13:19
Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @RamLavi and @xpivarc for this proposal! It's indeed impressive!

I tend to vote +1 to the Validating Admission Policy approach. I think it would be most straight forward and elegant solution to this CVE.

However I deeply agree with @EdDev that as a follow-up to this effort it would be great to see a discussion on how to reduce our dependency on the node object in general. As most of our node info is static anyhow I think we can find better ways to store the most of it (e.g. a config map). The heartbeat issue, as @RamLavi said before, can be solved otherwise with a readiness/liveliness probe.


2. virt-handler/heartbeat:
* [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L96) the node with `kubevirt.io/schedulable=false` when the stop channel closes (i.e. virt-handler exits).
* [labels and annotates](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

Suggested change
* [labels and annotates](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute.
* [labels and annotations](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! DONE

|-----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| object.spec == oldObject.spec | virt-handler user cannot modify spec of the nodes | |
| object.status == oldObject.status | virt-handler user cannot modify status of the nodes | |
| object.(not labels/annotations/resourceVersion/managedFields) == oldObject.(not labels/annotations/resourceVersion/managedFields) | virt-handler user can only change allowed sub-metadata fields. | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is managedFields necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, managed fields is a legitimate way to configure the object, therefore it shouldn't be restricted..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So virt-handler can change the fields managed by a node object? When would this be valuable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I'm not so sure anymore it is needed. @xpivarc - can you re-explain why this is needed?

Comment on lines 76 to 87
### Backwards compatibility:
validationAdmissionPolicy [going GA](https://kubernetes.io/blog/2024/03/12/kubernetes-1-30-upcoming-changes/) on k8s 1.30, so it should be OK for deployment on Kubevirt main and next minor release branches.
However, since kubevirt needs to be supported up to 2 k8s versions back, then the deployment of these new objects needs to be conditioned by the k8s version deployed.
The virt-operator deployment logic should be:
```
if k8s-ver >= v1.30
then
deploy the objects
else
deploy, but only log on failure
```
Once k8s v1.29 and below is not supported on kubevirt, this logic should be removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if Kubevirt is being installed on k8s 1.29 (where validationAdmissionPolicy is beta), but the feature gate is enabled? I think it's reasonable to deploy the objects on such scenario, and it might even be an acceptable workaround for Kubevirt users until they're able to upgrade thier k8s version.

Perhaps we can take a different approach of using dry-run to determine if the object is known or not.
That is:

  • Create the objects with a dry-run option. Check if an error is returned.
    • If a nil error is returned - re-create without the dry-run option.
    • If an "object unknown" error is returned (this one?), log and continue.
    • If a different error is returned - fail.

We can obviously drop the second step once we drop support for k8s 1.29.

Copy link
Contributor

@RamLavi RamLavi May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I'm not sure why we need to dry run - if the object is not installed it'll not deploy anyway right? .

But we do want to distinguish two behaviors:
For 1.29-, we want to deploy (or at least try), if we get error, log it.
If it's 1.30+, we do want to fail if errors, and not only log.
Ignoring an "object not found" error in the general case is wrong imo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring an "object not found" error in the general case is wrong imo

I agree, but relying on k8s version is (as someone also said here 😉) also "unusual" :) The question is what's worse.

But TBH I initially thought that for 1.29- we're going to not even to deploy the objects, which sounded problematic. If you're going to simply try to create the objects on a best-effort basis and log on error it sounds better to me.

Thanks!

```
!`requestHasNodeName` OR (object.Name == request.nodeName)
```
However, this field is [beta](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#additional-metadata-in-pod-bound-tokens) on k8s 1.30 and is currently protected by the [ServiceAccountTokenPodNodeInfo](https://github.com/kubernetes/kubernetes/blob/c4bce63d9886e5f1fc00f8c3b5a13ea0d2bdf772/pkg/features/kube_features.go#L753) k8s feature gate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suggest in one of the other comments, WDYT about supporting it on a best-effort basis? Admins that run older k8s versions might want to consider using the FG as a workaround.

Copy link
Contributor

@RamLavi RamLavi May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote at that comment, but note that's a minor implementation detail, we can continue this important discussion on the future PR.

Comment on lines +106 to +101
## Solution option #2: ShadowNode CRD
The security issue is being addressed by introducing a new light weight CRD: "ShadowNode" for each Kubernetes node.
Virt-handler will keep read access to nodes but will lose write access. Virt-handler will instead write to the ShadowNodes object, and this will subsequently be selectively copied to the node by a new controller. This controller will be the only entity who writes to the node object.
Copy link
Member

@Barakmor1 Barakmor1 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm considering whether this solution could be viable, particularly if it's used alongside the scheduling readiness feature of k8s, which is expected to be GAed in k8s 1.30.
The flow I've been thinking about goes like this:

  1. virt-handler is deployed and sets the ShadowNode Objects with the proper annotations and labels, just like you mentioned.
  2. Whenever a vmi is created, virt-api will mutate the virt-launcher pod with a dedicated SchedulingGate.
  3. virt-controller will detect virt-launcher pods with the SchedulingGate and do the following for each launcher pod:
    • Calculate suitable nodes for scheduling from the kubevirt perspective, let's mark it as <nodeList>.
    • Remove the SchedulingGate from the launcher pod and add node affinity to allow scheduling only to nodes in <nodeList>.

By leveraging scheduling readiness, we might be able to avoid node modifications entirely within KubeVirt, as suggested in this comment, and will also allow us to remove all the boilerplate and implementation details of KubeVirt from the nodes. Additionally, it might address concerns about users with node access gaining visibility into hardware information.
Any thoughts?

FYI @vladikr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Barakmor1 I think this is a very interesting idea! I'm not 100% sure we need a gate.
Virt-controller is the one that creates the pods, so it can calculate the list of suitable nodes before posting the Pod - and do the same during migration.
One possible issue is (as @xpivarc pointed out offline) that this list may become extensive.

@rmohr FYI

Copy link
Member

@Barakmor1 Barakmor1 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Barakmor1 I think this is a very interesting idea! I'm not 100% sure we need a gate.
Virt-controller is the one that creates the pods, so it can calculate the list of suitable nodes before posting the Pod - and do the same during migration.

But if no node matches a launcher, it will remain pending forever, as the <nodeList> will remain empty. Alternatively, it may be possible to sync the affinity when a launcher is in a scheduling state.

One possible issue is (as @xpivarc pointed out offline) that this list may become extensive.

I think that the maximum number of nodes is 5000. I wonder if affinity can handle this. Also, we could set a maximum number of nodes for the affinity and choose random nodes when we reach the maximum.

Copy link
Contributor

@RamLavi RamLavi May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents: +1 to all efforts to use alternative ways to avoid the need to patch the node.
If we could find good alternatives to all the places virt-handler is patching the node - then we could remove the patch permission entirely.
But even if we remove the node-labeller component we currently still have other places where we patch the node (for example heartbeat) and until we remove them all - the need to restrict the node patches is still relevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even if we remove the node-labeller component we currently still have other places where we patch the node (for example heartbeat) and until we remove them all - the need to restrict the node patches is still relevant.

Can't we patch the shadowNode with heartbeat labels as well?

Copy link
Member

@Barakmor1 Barakmor1 May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr @xpivarc i tested it and seems like kubernetes could handle pod with affinity for 5000 nodes this is how i generated the manifest:

#!/bin/bash

cat <<EOF > pod_with_node_affinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-pod-with-resources
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
EOF

for i in {1..5000}; do
  echo "          - matchFields:" >> pod_with_node_affinity.yaml
  echo "              - key: metadata.name" >> pod_with_node_affinity.yaml
  echo "                operator: In" >> pod_with_node_affinity.yaml
  echo "                values:" >> pod_with_node_affinity.yaml
  echo "                  - node${i}" >> pod_with_node_affinity.yaml
done

echo "  containers:" >> pod_with_node_affinity.yaml
echo "  - name: my-container" >> pod_with_node_affinity.yaml
echo "    image: busybox" >> pod_with_node_affinity.yaml
echo "    command: [\"sleep\", \"3600\"]" >> pod_with_node_affinity.yaml
echo "    resources:" >> pod_with_node_affinity.yaml
echo "      requests:" >> pod_with_node_affinity.yaml
echo "        memory: \"256Mi\"" >> pod_with_node_affinity.yaml
echo "        cpu: \"200m\"" >> pod_with_node_affinity.yaml

Edit: I found that nodeAffinity cannot be updated in Pending state so it would probably be to complex to support as we might have to delete pending launchers if the nodes change :\

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even if we remove the node-labeller component we currently still have other places where we patch the node (for example heartbeat) and until we remove them all - the need to restrict the node patches is still relevant.

Can't we patch the shadowNode with heartbeat labels as well?

We can, but I believe we can solve it in a simpler, more k8s approach. Will issue a proposal about it in the near future.

@RamLavi
Copy link
Contributor

RamLavi commented May 8, 2024

Change:

  • updated the chosen the solution: validationAdmissionPolicy, following the positive feedback and POC done offline.
  • updated the backwards compatibility approach, and added an appendix to further discuss it.

Note: This proposal is now mature enough and is waiting for approval.

@fabiand
Copy link
Member

fabiand commented May 8, 2024

Thanks for the update Ram.

@RamLavi please resolve all open conversations in this document in order to move forward.

@RamLavi
Copy link
Contributor

RamLavi commented May 8, 2024

Thanks for the update Ram.

@RamLavi please resolve all open conversations in this document in order to move forward.

Technically speaking, I don't think I can resolve conversations I didn't open, but I made sure I answered them.

@iholder101
Copy link
Contributor

Thanks a lot @RamLavi for this proposal!
As said above, I vote +1 to the Validating Admission Policy approach.

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2024
@vladikr
Copy link
Member

vladikr commented May 9, 2024

In general, the ShadowNode approach with the variation that @Barakmor1 suggested, would be my preference.
My main fear is that Validating Admission Policy for virt-handler will make a debugging, when something goes work, challenging (to say the least)
Moreover, it seems like a missed opportunity that we are not taking care of all the labels on the node virt-handler is setting - that is not isolating it on a separate object.

However, I'm not a blocking this - especially if @iholder101 and @jean-edouard are happy to support this.
so
/lgtm
from me.

@iholder101
Copy link
Contributor

Hey @vladikr!
Before merging this, let's make sure we discuss all the points you've raised.

In general, the ShadowNode approach with the variation that @Barakmor1 suggested, would be my preference.

@Barakmor1 and I were discussing this approach offline.
One of the problems with this approach is that pods are (mostly) immutable, and we can't change affinities nor .spec.nodeName after the pod is created.

The steps @Barakmor1 was mentioning above are:

  1. virt-handler is deployed and sets the ShadowNode Objects with the proper annotations and labels, just like you mentioned.
  2. Whenever a vmi is created, virt-api will mutate the virt-launcher pod with a dedicated SchedulingGate.
  3. virt-controller will detect virt-launcher pods with the SchedulingGate and do the following for each launcher pod:
    • Calculate suitable nodes for scheduling from the kubevirt perspective, let's mark it as .
    • Remove the SchedulingGate from the launcher pod and add node affinity to allow scheduling only to nodes in .

Unfortunately, the last bullet under (3) is not possible, since after a pod is created, we simply cannot set new affinities.

As you said rightfully, we can skip the step of adding a scheduling gate. But even then we're facing corner cases, such as:

  • A virt-launcher pod is created, targeting a single node N1 that fits.
  • Just before the pod is scheduled, N1 is being removed from the cluster.
  • Right after that, a new node N2 (that is fitting for the launcher pod) joins the cluster.
  • The pod remains scheduling forever.

In this (and similar) cases we would have to delete the pod and re-create it.

Bottom line is: to support such an approach we'd have to introduce a mechanism that monitors the scheduling pods and re-creates them when necessary.
Let's continue discussing this approach further if you think it's viable/valuable.

My main fear is that Validating Admission Policy for virt-handler will make a debugging, when something goes work, challenging (to say the least)

That's a fair point.
@xpivarc Any thoughts?

Moreover, it seems like a missed opportunity that we are not taking care of all the labels on the node virt-handler is setting - that is not isolating it on a separate object.

I 100% agree. This would definitely be ideal. The sad truth is that the vast vast majority of our ~200 labels are completely static in nature, and it's very tempting to save them elsewhere (either the shadow node CRD or even a configmap).

What concerns me around this is that we essentially have two options AFAICT:

  1. label the node, create virt-launcher (i.e. what we're doing now), let the scheduler handle scheduling.
  2. as said above, create a mechanism to monitor and re-create the launcher pod whenever necessary. Essentially, we get into the business of scheduling.

@vladikr what are your thoughts regarding that? Again if you think it's a reasonable path forward, let's discuss it!

@RamLavi
Copy link
Contributor

RamLavi commented May 9, 2024

Hey @vladikr! Before merging this, let's make sure we discuss all the points you've raised.

In general, the ShadowNode approach with the variation that @Barakmor1 suggested, would be my preference.

@Barakmor1 and I were discussing this approach offline. One of the problems with this approach is that pods are (mostly) immutable, and we can't change affinities nor .spec.nodeName after the pod is created.

The steps @Barakmor1 was mentioning above are:

  1. virt-handler is deployed and sets the ShadowNode Objects with the proper annotations and labels, just like you mentioned.

  2. Whenever a vmi is created, virt-api will mutate the virt-launcher pod with a dedicated SchedulingGate.

  3. virt-controller will detect virt-launcher pods with the SchedulingGate and do the following for each launcher pod:

    • Calculate suitable nodes for scheduling from the kubevirt perspective, let's mark it as .
    • Remove the SchedulingGate from the launcher pod and add node affinity to allow scheduling only to nodes in .

Unfortunately, the last bullet under (3) is not possible, since after a pod is created, we simply cannot set new affinities.

As you said rightfully, we can skip the step of adding a scheduling gate. But even then we're facing corner cases, such as:

  • A virt-launcher pod is created, targeting a single node N1 that fits.
  • Just before the pod is scheduled, N1 is being removed from the cluster.
  • Right after that, a new node N2 (that is fitting for the launcher pod) joins the cluster.
  • The pod remains scheduling forever.

In this (and similar) cases we would have to delete the pod and re-create it.

Bottom line is: to support such an approach we'd have to introduce a mechanism that monitors the scheduling pods and re-creates them when necessary. Let's continue discussing this approach further if you think it's viable/valuable.

My main fear is that Validating Admission Policy for virt-handler will make a debugging, when something goes work, challenging (to say the least)

That's a fair point. @xpivarc Any thoughts?

Moreover, it seems like a missed opportunity that we are not taking care of all the labels on the node virt-handler is setting - that is not isolating it on a separate object.

I 100% agree. This would definitely be ideal. The sad truth is that the vast vast majority of our ~200 labels are completely static in nature, and it's very tempting to save them elsewhere (either the shadow node CRD or even a configmap).

What concerns me around this is that we essentially have two options AFAICT:

  1. label the node, create virt-launcher (i.e. what we're doing now), let the scheduler handle scheduling.
  2. as said above, create a mechanism to monitor and re-create the launcher pod whenever necessary. Essentially, we get into the business of scheduling.

@vladikr what are your thoughts regarding that? Again if you think it's a reasonable path forward, let's discuss it!

I want to stress again that I am all forward to any change that will reduce patched to the node. IMO virt-handler should not patch the node at all. But the fact is it does patch it (even if we remove node-labeller), and in order to solve the CVE, the most direct way to solve it is to add a VAP.
Can we agree to:

  1. Add the VAP as part of this proposal, solving the CVE.
  2. In parallel, propose ways to remove all the different scenarios where virt-handler patches the node (there is more than just node-labeller). For example, I plan to POC the removal of the heartbeat mechanism.. Other suggestions to remove other patches were already suggested on this PR (see solution 3), more suggestion are always welcome.
  3. Once all patches are gone, we can say goodbye to this VAP, and also the RBAC to patch the node.

@vladikr
Copy link
Member

vladikr commented May 9, 2024

/approve

@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vladikr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 9, 2024
@xpivarc
Copy link
Member Author

xpivarc commented May 10, 2024

/hold cancel
@fabiand Seems we have agreement, I hope I did not miss anyone. Thanks everyone for the discussion, hopefully we can do more than fix CVEs in the future ;)

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2024
@kubevirt-bot kubevirt-bot merged commit a98f4f0 into kubevirt:main May 10, 2024
1 of 2 checks passed
@fabiand
Copy link
Member

fabiand commented May 23, 2024

Awesome

10x

🚀

@xpivarc we still need to follow up to limit a handlers powers to the workloads on it's own node. Riht? Can VAP help here as well?

@xpivarc
Copy link
Member Author

xpivarc commented May 27, 2024

Awesome

10x

🚀

@xpivarc we still need to follow up to limit a handlers powers to the workloads on it's own node. Riht? Can VAP help here as well?

Yes the VAP could be used here but it is not strict requirement as we still have webhooks for the resources. Therefore it is matter of extending them to check the right handler is making the request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.