Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-18684: Ensure that bootstrap CSR is authenticated against cluster hosts. #895

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

paul-maidment
Copy link
Contributor

Once the minimal control plane has been performed and consists of two nodes, we start the assisted-installer-controller pod and then reboot the bootstrap when this is ready. The bootstrap will attempt to join the control plane when it boots up. As part of this process, CSR's are submitted. Presently, we just assume that these are meant to be signed and blindly approve them.

This is problematic from a security perspective and could lead to an attack where, with careful timing, someone might be able to add a rogue node to the cluster. To prevent this, we should check that the hostname being presented in the CSR actually matches one of our known hosts.

To do this, requires two steps:

1: Prior to start of the assisted-installer-controller pod, we will save a list of known hosts in a config map called known-hosts, this will contain a single entry content which will be a newline separated list of hostnames
2: During the CSR validation, this will be read and used to verify that the CSR originates from a known host

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 29, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 29, 2024

@paul-maidment: This pull request references MGMT-18684 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.18.0" version, but no target version was set.

In response to this:

Once the minimal control plane has been performed and consists of two nodes, we start the assisted-installer-controller pod and then reboot the bootstrap when this is ready. The bootstrap will attempt to join the control plane when it boots up. As part of this process, CSR's are submitted. Presently, we just assume that these are meant to be signed and blindly approve them.

This is problematic from a security perspective and could lead to an attack where, with careful timing, someone might be able to add a rogue node to the cluster. To prevent this, we should check that the hostname being presented in the CSR actually matches one of our known hosts.

To do this, requires two steps:

1: Prior to start of the assisted-installer-controller pod, we will save a list of known hosts in a config map called known-hosts, this will contain a single entry content which will be a newline separated list of hostnames
2: During the CSR validation, this will be read and used to verify that the CSR originates from a known host

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@paul-maidment paul-maidment marked this pull request as draft August 29, 2024 16:13
@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 29, 2024
@openshift-ci openshift-ci bot requested review from jhernand and omertuc August 29, 2024 16:14
Copy link

openshift-ci bot commented Aug 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2024
Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 55.45%. Comparing base (4635ddf) to head (a85a996).
Report is 28 commits behind head on master.

Current head a85a996 differs from pull request most recent head 843614d

Please upload reports for the commit 843614d to get more accurate results.

Files with missing lines Patch % Lines
src/installer/installer.go 50.00% 6 Missing and 3 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #895      +/-   ##
==========================================
- Coverage   55.70%   55.45%   -0.25%     
==========================================
  Files          15       15              
  Lines        3208     3226      +18     
==========================================
+ Hits         1787     1789       +2     
- Misses       1249     1262      +13     
- Partials      172      175       +3     
Files with missing lines Coverage Δ
src/installer/installer.go 67.35% <50.00%> (-1.79%) ⬇️

Copy link

openshift-ci bot commented Aug 29, 2024

@paul-maidment: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-unit-test a85a996 link true /test edge-unit-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@omertuc
Copy link
Contributor

omertuc commented Aug 30, 2024

/hold Since it's trivial to create CSRs with any hostname, this doesn't improve security

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2024
@paul-maidment
Copy link
Contributor Author

paul-maidment commented Sep 1, 2024

/hold Since it's trivial to create CSRs with any hostname, this doesn't improve security

Thank you for your response Omer. The PR is still in draft as I am contemplating how to truly solve this issue.

I accept the point about this doing little to nothing to advance security as I can see that you can't be sure that the subject of the CSR is exactly the host identified.

However, thinking carefully about the original request in MGMT-18684
I am not actually sure we even have a security issue in the first place or at least the scope of the problem is limited and a lot less critical than we might think.

"This means that there is a window during installation when a rogue node can join the cluster with no verification. "

is partially true but we need to consider how a CSR can be registered with the cluster

This means that they

  1. Need a Kubeconfig
  2. Need to have authority to create CSRs

Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?

As identified in https://issues.redhat.com/browse/MGMT-18684, it's possible that the user may register a CSR for a subject that is not a node and automatically receive approval for it. Ideally we should prevent this.

If so then we only really need to check in a similar fashion to how the cluster-machine-operator does already...
https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L364

Granted we would probably be best to forgo the "pre reboot" information gathering phase that I do in this PR and instead rely on DNS records in the cluster post reboot.

@omertuc
Copy link
Contributor

omertuc commented Sep 1, 2024

This means that they

  1. Need a Kubeconfig
  2. Need to have authority to create CSRs

Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?

  1. "Need a kubconfig" - all you need is the network connectivity and the IP address of the API server. kubeconfig is "needed" only to make oc work, but users can just as well curl the API server or use whatever automation they want, it's just HTTP requests. Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition

  2. Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR approvals are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes

@paul-maidment
Copy link
Contributor Author

paul-maidment commented Sep 1, 2024

This means that they

  1. Need a Kubeconfig
  2. Need to have authority to create CSRs

Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?

1. "Need a kubconfig" - all you need is the network connectivity and the IP address of the API server. `kubeconfig` is "needed" only to make `oc` work, but users can just as well `curl` the API server or use whatever automation they want, it's just HTTP requests. Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition

2. Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR _approvals_ are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes

What are your thoughts on this implementation here?

https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L364

The checks that we perform to ensure that the node with a specific name

  1. Does not already exist
  2. Is resolvable to an IP within the cluster DNS on the cluster side.
  3. Resolves to a known machine.

Are these sufficient to prevent a rogue node from joining or do they still fall short? Is cluster-machine-approver secure enough to be trusted or is it prone to the same issues you highlight?

Is it still viable that an attacker with the right kind of static IP setup and timing could add a rogue node? I think maybe, but certainly a lot harder to pull off...

if an attack on cluster-machine-approver is possible, does this mean we have a broader issue?

 Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition

This sounds like a sizeable security hole, much worse than the timing issue as if I have the KubeConfig I have full control, I don't need to play games to add nodes, I can just do it...

@paul-maidment
Copy link
Contributor Author

paul-maidment commented Sep 1, 2024

"Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR approvals are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes"

I have tried various ways of using curl requests to create and list CSR's as an anonymous user, pretty sure that I am using the correct syntax (which oc will quite happily dump for you if you add a parameter like -v 1000000)

curl -v -k -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: oc/4.16.0 (linux/amd64) kubernetes/e8fb3c0" 'https://api.testcluster.multinode.local:6443/apis/certificates.k8s.io/v1/certificatesigningrequests?limit=500'

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"certificatesigningrequests.certificates.k8s.io is forbidden: User \"system:anonymous\" cannot list resource \"certificatesigningrequests\" in API group \"certificates.k8s.io\" at the cluster scope","reason":"Forbidden","details":{"group":"certificates.k8s.io","kind":"certificatesigningrequests"},"code":403}

I receive similar messages for creation attempts also. It seems that there is not an "anonymous" way to create a CSR over HTTPS, I could be mistaken but I think we should check this in detail. Do you know who might be able to verify this? One of the Node team perhaps?

These requests are being made against the "minimal" control plane that is created just before the bootstrap joins.

@omertuc
Copy link
Contributor

omertuc commented Sep 1, 2024

Are these sufficient to prevent a rogue node from joining or do they still fall short?

if an attack on cluster-machine-approver is possible, does this mean we have a broader issue?

What are your thoughts on this implementation here?

Not sure, seems rather insufficient. But it depends on the requirements for cluster-machine-approver and their threat model, I can't comment on the security of that component without having this context.

I can however comment on assisted-installer, which is supposed to be a user-friendly installer, not meant for power users, so the security requirements should be tight and fool-proof or come with a ton of UI warnings to help users protect themselves from our non-ideal behavior.

Of course your PR is not making anything worse, but if we're pretending to fix our security posture with regards to how we choose to approve CSRs, we should actually do that right and not use some weak measures like comparing easily fakeable fields on a CSR. I don't get the point of that.

Is resolvable to an IP within the cluster DNS on the cluster side

Is it still viable that an attacker with the right kind of static IP setup and timing could add a rogue node? I think maybe, but certainly a lot harder to pull off...

What does "Resolveable to an IP" mean here? If you're referring to the IP from which the request to create the CSR came from, then that definitely makes life harder for an attacker. But I'm not sure that information is even available outside of audit logs. If you're referring to the IP addresses specified in the CSR itself, anyone can set those to anything arbitrarily as far as I'm aware, just like they can set the hostname.

This sounds like a sizeable security hole, much worse than the timing issue as if I have the KubeConfig I have full control, I don't need to play games to add nodes, I can just do it...

It gives you a kubeconfig, not an admin kubeconfig. It's basically a YAML with a URL in it

I receive similar messages for creation attempts also. It seems that there is not an "anonymous" way to create a CSR over HTTPS, I could be mistaken but I think we should check this in detail.

There must be some way to do it, as this is what the kubelet on fresh nodes with 0 credentials does (all they have is RHCOS + the trivially publicly available ignition they got from MCS), so you're probably missing something.

@paul-maidment paul-maidment force-pushed the MGMT-18684 branch 2 times, most recently from 44e6955 to 2685d0d Compare September 4, 2024 19:46
Copy link
Contributor

@omertuc omertuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool

src/common/common.go Outdated Show resolved Hide resolved
src/common/common.go Outdated Show resolved Hide resolved
src/common/common.go Outdated Show resolved Hide resolved
src/common/common.go Outdated Show resolved Hide resolved
src/common/common.go Outdated Show resolved Hide resolved
src/ignition/ignition.go Outdated Show resolved Hide resolved
src/installer/installer.go Outdated Show resolved Hide resolved
src/installer/installer.go Outdated Show resolved Hide resolved
src/installer/installer.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 8, 2024
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 7, 2025
@paul-maidment
Copy link
Contributor Author

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants