WIP: enhancements/machine-config: securing MCS #626

crawford · 2021-02-04T01:10:02Z

We're getting heat for blocking arbitrary ports on the pod network (TCP 22623/22624). This is one potential path forward and is an alternative to #443.

openshift-ci-robot · 2021-02-04T01:10:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign crawford after the PR has been reviewed.
You can assign the PR to them by writing /assign @crawford in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

russellb · 2021-02-04T16:57:35Z

enhancements/machine-config/securing-mcs.md

+
+## Proposal
+
+If it can be ensured that no one is able to access sensitive data through the Machine Config Server, the port restrictions can be safely removed. In order to achieve this, Machine Configs can be split into two groups: pre-node-approved configs and post-node-approved configs. The pre-node-approved configs are ones which do not contain sensitive information and contain just enough to configure new nodes to submit a certificate signing request (CSR) to the API server. The post-node-approved configs contain everything else, including the overwhelming majority of customer-provided Machine Configs. When a new machine is created, it fetches its Ignition config from the Machine Config Server, which only serves the pre-node-approved configs. At this point, the machine has enough information to submit a CSR, but nothing sensitive (note that for the purposes of this proposal, the CSR bootstrap token and pull secret are not considered sensitive). Once the CSR has been approved and a kubeconfig returned, the Machine Config Daemon is able to directly enumerate Machine Configs, including the post-node-approved ones, so it then applies any remaining configuration to the machine and reboots if necessary. At this point, the machine is fully configured.


I think it would be helpful to pull out some of the assumptions and state them very clearly up front. It's included in here, but perhaps before this paragraph, a new paragraph could say:

Security assumptions:

The CSR bootstrap token is not considered sensitive and being accessible to all applications in the cluster is accessible.

The pull secret is not considered sensitive and being accessible to all applications in the cluster is accessible

Contents of MachineConfig resources are considered (potentially) sensitive, so this proposal introduces a change to protect them from unauthorized access through the MCS

Do I have that right?

This proposal would have to depend on some other enhancement that would allow that first assumption to hold. I believe right now you could DoS a cluster with the ability to issue CSRs.

danwinship · 2021-02-05T13:40:08Z

enhancements/machine-config/securing-mcs.md

+
+## Summary
+
+The bootstrapping process for new cluster nodes is a tricky problem, primarily due to the fact that a new nodes start from a position of almost zero knowledge. These nodes need to fetch their machine configuration and authenticate themselves with the cluster before they can access most resources and have workloads scheduled. At the same time, a malicious actor pretending to be a new node cannot be allowed to join the cluster. Otherwise, they would be able to get access to sensitive resources and could have workloads and their secrets scheduled to them. Today, we protect against this by preventing cluster workloads from accessing the Machine Config Server, the component of the Machine Config Operator which is responsible for serving Ignition configs, derived from a set of Machine Configs, to new machines. This approach presents a problem though: workloads on OpenShift are not allowed to make use of TCP ports 22623 and 22624, the ports used by the Machine Config Server.


linebreaks please! (makes it easier to leave review comments)

danwinship · 2021-02-05T13:43:27Z

enhancements/machine-config/securing-mcs.md

+
+## Summary
+
+The bootstrapping process for new cluster nodes is a tricky problem, primarily due to the fact that a new nodes start from a position of almost zero knowledge. These nodes need to fetch their machine configuration and authenticate themselves with the cluster before they can access most resources and have workloads scheduled. At the same time, a malicious actor pretending to be a new node cannot be allowed to join the cluster. Otherwise, they would be able to get access to sensitive resources and could have workloads and their secrets scheduled to them. Today, we protect against this by preventing cluster workloads from accessing the Machine Config Server, the component of the Machine Config Operator which is responsible for serving Ignition configs, derived from a set of Machine Configs, to new machines. This approach presents a problem though: workloads on OpenShift are not allowed to make use of TCP ports 22623 and 22624, the ports used by the Machine Config Server.


This approach presents a problem though: workloads on OpenShift are not allowed to make use of TCP ports 22623 and 22624, the ports used by the Machine Config Server.

(This is specifically because the blocking rules were done as a quick hack. In theory, we could block every pod from accessing only the MCS without blocking anything else, but that would require noticing when masters were added/moved and updating the rules for every pod when that happened. Since this was supposed to have been a temporary hack, we didn't bother, and instead just blocked access to ports 22623 and 22624 on all IPs.)

enhancements/machine-config/securing-mcs.md

danwinship · 2021-02-05T14:13:12Z

enhancements/machine-config/securing-mcs.md

+
+## Proposal
+
+If it can be ensured that no one is able to access sensitive data through the Machine Config Server, the port restrictions can be safely removed. In order to achieve this, Machine Configs can be split into two groups: pre-node-approved configs and post-node-approved configs. The pre-node-approved configs are ones which do not contain sensitive information and contain just enough to configure new nodes to submit a certificate signing request (CSR) to the API server. The post-node-approved configs contain everything else, including the overwhelming majority of customer-provided Machine Configs. When a new machine is created, it fetches its Ignition config from the Machine Config Server, which only serves the pre-node-approved configs. At this point, the machine has enough information to submit a CSR, but nothing sensitive (note that for the purposes of this proposal, the CSR bootstrap token and pull secret are not considered sensitive). Once the CSR has been approved and a kubeconfig returned, the Machine Config Daemon is able to directly enumerate Machine Configs, including the post-node-approved ones, so it then applies any remaining configuration to the machine and reboots if necessary. At this point, the machine is fully configured.


At this point, the machine has enough information to submit a CSR, but nothing sensitive (note that for the purposes of this proposal, the CSR bootstrap token and pull secret are not considered sensitive). Once the CSR has been approved

Wait... we're saying that

the MCS cannot distinguish a real proto-node from an attacker, but

the certificate approver can distinguish a CSR sent by a real proto-node that has received non-secret information from the MCS, from a CSR sent by an attacker that has received non-secret information from the MCS

That doesn't make sense; if the certificate approver can distinguish them, then the MCS ought to be able to distinguish them in the same way. Specifically, the MCS should only respond to connections from IPs that correspond to not-yet-approved Machine resources (or something like that). Right?

The logic there makes sense. If we're willing to automatically approve a CSR, that same logic should be sufficient for accepting a connection to MCS.

However, the MCS doesn't have all the same information that the CSR approver has. It uses the contents of the CSR itself as part of the validation, so we can't copy it easily, unless the MCS request includes more info than it does today.

For reference, see the README in https://github.com/openshift/cluster-machine-approver/ and most of the important logic is in csr_check.go

However, the MCS doesn't have all the same information that the CSR approver has. It uses the contents of the CSR itself as part of the validation, so we can't copy it easily, unless the MCS request includes more info than it does today.

But that other information doesn't matter. If the MCS can see that a request is coming from an IP that isn't the IP of a legitimate "proto-node", then it knows that the requester cannot possibly construct an approvable CSR (that points to itself), and so there is no reason for the MCS to talk to it.

Is the source IP good enough? Let's say untrusted-pod-A on node1 is trying to reach the MCS on node2. Won't that request be SNAT'd on node1, so node2 is going to see node1's IP address and consider it legitimate?

That doesn't make sense; if the certificate approver can distinguish them, then the MCS ought to be able to distinguish them in the same way.

The client for the MCS is Ignition, which has significantly less functionality than the client for the CSR endpoint, the kubelet. Eventually, the approver could be enhanced (by us or the customer) to make use of AWS signed requests or GCP instance identity verification. Keeping the MCS out of the authentication path allows the customer to decide how much security they need for their deployment. If we did everything in the MCS, they wouldn't easily be able to swap out the implementation.

Is the source IP good enough? Let's say untrusted-pod-A on node1 is trying to reach the MCS on node2. Won't that request be SNAT'd on node1, so node2 is going to see node1's IP address and consider it legitimate?

node1's address is not legitimate, because node1 is an actual node, not a "proto-node". An actual functioning node has no reason to talk to the MCS (as I understand it); an IP should only be contacting the MCS if there is a Machine that claims that IP but no Node that claims that IP; ie, if the cluster is expecting a node to come up with that IP, but it knows that the node hasn't actually come up yet.

(Even if node1 gets accidentally nuked and you want to recreate it, then (AIUI), you would have to delete the old node1 Node/CSR first, since otherwise the new node1's new CSR would fail to match and the approver would consider it to be an imposter. So even for reinstalls, the rule that "only IPs that correspond to a Machine but not a Node should talk to the MCS" should work.)

Eventually, the approver could be enhanced (by us or the customer) to make use of AWS signed requests or GCP instance identity verification.

Do you mean adding that to the existing approval mechanism, or removing the existing approval mechanism entirely and switching it to be based entirely on cloud APIs? If you just mean adding, then yes, this would let the CSR approver distinguish "this is actually a real node VM" from "this is someone pretending to be a new node VM". But as long as we still have Machine objects, and the CSR approver still requires that Nodes come up with IPs that match their corresponding Machines, then the MCS can check those IPs itself too, and refuse to talk to any IP that isn't a valid proto-node IP.

An actual functioning node has no reason to talk to the MCS (as I understand it).

Correct, unless an existing node is re-provisioning for some reason (e.g. disk failure), and the Node object still exists. (Is this a case we need to handle? Will the new CSR even be signed?)

...an IP should only be contacting the MCS if there is a Machine that claims that IP but no Node that claims that IP...

Folks can also provision their own instances, without going through the machine API. In that case, they're the hook to handle CSR signing for those machines. With the "MCS includes CSR-approver-style-verification" approach, those folks would need a hook so they could tell the MCS, yes, really serve requests to my new proto-node I manually provisioned".

I think it's a no-brainer to include node IP validation in the MCS - that's what my original PR has in fact here: openshift/machine-config-operator#784

But that said, we also need to care about UPI on-premise installations where the network may be exposed to machines outside the cluster. To me, this is where even a non-rotating auth token is extremely helpful - it provides a really strong first layer of defense.

derekwaynecarr · 2021-02-11T22:35:08Z

enhancements/machine-config/securing-mcs.md

+## Alternatives
+
+  - Do nothing and just remove the network policy. This would expose the contents of (what would be) post-node-approved Machine Configs to everything that has network access to the control plane.
+  - Add support for an authentication token to the Machine Config Server (https://github.com/openshift/enhancements/pull/443). The real downside to this approach is that it makes troubleshooting more difficult, as it would likely need to happen in the initramfs of RHCOS. This is an extremely limited environment and requires console or serial access, which isn't available in a number of environments.


is the experience different here versus when a user provides an invalid pull-secret at install?

What is the difference between screwing up

`https://some-ip/some/value/ignition/master

and

https://some-ip/some/value/ignition/master?token=xxxxxxxxxxxxxx

You either cut and paste the secret correctly or not. At some point we also have to verify the remote ip (if you ignore validation errors on some-ip you are going to get MITM'd anyway, so security is not your problem).

The person provisioning a box is ultimately responsible for establishing a verifiable, secure, and verified channel to the thing that will establish the basic trust foundation for the rest of that box.

Why is troubleshooting the problem? Is that the only argument against requiring a shared secret between provisioner and provisionee to establish mutual verification?

@derekwaynecarr if the user provides an invalid pull-secret at install, the bootstrap machine will spin, trying to pull the release image. Eventually, the install will time out and the installer will collect logs. From the perspective of a single compute node, the two scenarios are indistinguishable, but I would expect an admin to look at more than just that. For example, if they SSH into the bootstrap node and run the recommended journalctl invocation, it will be immediately obvious that the pull secret is incorrect.

@smarterclayton https://some-ip/some/value/ignition/master will never change. https://some-ip/some/value/ignition/master?token=xxxxxxxxxxxxxx will change regularly (specifically, the token). The concern around troubleshooting is the fact that it will invariably be required more often when the Ignition URL is being changed periodically. There is a deeper question in there as well though. How will this token be rotated? This is easy in deployments which make use of the Machine API Operator, but everywhere else, this will have to be a manual process. And as we all know, if there is a periodic, manual task required for security, we can just assume it's not going to happen. I truly believe that the auth token will hurt the security of our clusters, all things considered.

openshift-ci · 2021-02-11T23:55:42Z

@crawford: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/markdownlint	`8ce00b1`	link	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2021-03-19T13:11:58Z

enhancements/machine-config/securing-mcs.md

+  - The pull secret is not considered sensitive and being accessible to all applications in the cluster is acceptable.
+  - The contents of Machine Configs are potentially sensitive, especially if the Machine Configs are customer-provided.
+
+If it can be ensured that no one is able to access sensitive data through the Machine Config Server, the port restrictions can be safely removed. In order to achieve this, Machine Configs can be split into two groups: pre-node-approved configs and post-node-approved configs. The pre-node-approved configs are ones which do not contain sensitive information and contain just enough to configure new nodes to submit a certificate signing request (CSR) to the API server. The post-node-approved configs contain everything else, including the overwhelming majority of customer-provided Machine Configs. When a new machine is created, it fetches its Ignition config from the Machine Config Server, which only serves the pre-node-approved configs. At this point, the machine has enough information to submit a CSR, but nothing sensitive (note that for the purposes of this proposal, the CSR bootstrap token and pull secret are not considered sensitive). Once the CSR has been approved and a kubeconfig returned, the Machine Config Daemon is able to directly enumerate Machine Configs, including the post-node-approved ones, so it then applies any remaining configuration to the machine and reboots if necessary. At this point, the machine is fully configured.


Today we carry effectively 1.63 implementations of Ignition - one in the upstream project that runs in the initramfs, and 63% of Ignition in the MCD that runs in the real root. If do this, then we'd hit all limitations of the 63% MCD-Ignition implementation upfront. For example, while this isn't important to OpenShift, today the MCO doesn't support adding/changing users.

We may hit edge cases around e.g. parent directory creation, file write ordering etc. Those are all bugs that should be fixed of course, but we also tend to not hit them today because use of "day 2" MCs is so limited.

If we go this route it may be worth a spike around how we can have the MCO actually reuse Ignition upstream code directly for this.

cgwalters · 2021-03-19T13:14:54Z

enhancements/machine-config/securing-mcs.md

+
+## Proposal
+
+If it can be ensured that no one is able to access sensitive data through the Machine Config Server, the port restrictions can be safely removed. In order to achieve this, Machine Configs can be split into two groups: pre-node-approved configs and post-node-approved configs. The pre-node-approved configs are ones which do not contain sensitive information and contain just enough to configure new nodes to submit a certificate signing request (CSR) to the API server. The post-node-approved configs contain everything else, including the overwhelming majority of customer-provided Machine Configs. When a new machine is created, it fetches its Ignition config from the Machine Config Server, which only serves the pre-node-approved configs. At this point, the machine has enough information to submit a CSR, but nothing sensitive (note that for the purposes of this proposal, the CSR bootstrap token and pull secret are not considered sensitive). Once the CSR has been approved and a kubeconfig returned, the Machine Config Daemon is able to directly enumerate Machine Configs, including the post-node-approved ones, so it then applies any remaining configuration to the machine and reboots if necessary. At this point, the machine is fully configured.


I think it's a no-brainer to include node IP validation in the MCS - that's what my original PR has in fact here: openshift/machine-config-operator#784

But that said, we also need to care about UPI on-premise installations where the network may be exposed to machines outside the cluster. To me, this is where even a non-rotating auth token is extremely helpful - it provides a really strong first layer of defense.

openshift-bot · 2021-06-17T13:54:18Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-07-17T23:36:01Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-08-17T02:24:12Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-08-17T02:27:36Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

enhancements/machine-config: securing MCS

f7e3afe

openshift-ci-robot requested review from russellb and squeed February 4, 2021 01:10

crawford changed the title ~~enhancements/machine-config: securing MCS~~ WIP: enhancements/machine-config: securing MCS Feb 4, 2021

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2021

russellb reviewed Feb 4, 2021

View reviewed changes

danwinship reviewed Feb 5, 2021

View reviewed changes

derekwaynecarr reviewed Feb 11, 2021

View reviewed changes

FIX: added suggested comments

8ce00b1

larsks mentioned this pull request Mar 8, 2021

Re-install CNV cluster using ACM open-infrastructure-labs/ops-issues#21

Closed

cgwalters reviewed Mar 19, 2021

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 17, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 17, 2021

openshift-ci bot closed this Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: enhancements/machine-config: securing MCS #626

WIP: enhancements/machine-config: securing MCS #626

crawford commented Feb 4, 2021

openshift-ci-robot commented Feb 4, 2021

russellb Feb 4, 2021

danwinship Feb 5, 2021

danwinship Feb 5, 2021

danwinship Feb 5, 2021

russellb Feb 5, 2021

danwinship Feb 8, 2021

russellb Feb 9, 2021

crawford Feb 9, 2021

danwinship Feb 12, 2021

squeed Feb 16, 2021

wking Mar 18, 2021

cgwalters Mar 19, 2021

derekwaynecarr Feb 11, 2021

smarterclayton Feb 11, 2021

crawford Feb 11, 2021

crawford Feb 11, 2021

openshift-ci bot commented Feb 11, 2021

cgwalters Mar 19, 2021

cgwalters Mar 19, 2021

openshift-bot commented Jun 17, 2021

openshift-bot commented Jul 17, 2021

openshift-bot commented Aug 17, 2021

openshift-ci bot commented Aug 17, 2021


		## Proposal

		If it can be ensured that no one is able to access sensitive data through the Machine Config Server, the port restrictions can be safely removed. In order to achieve this, Machine Configs can be split into two groups: pre-node-approved configs and post-node-approved configs. The pre-node-approved configs are ones which do not contain sensitive information and contain just enough to configure new nodes to submit a certificate signing request (CSR) to the API server. The post-node-approved configs contain everything else, including the overwhelming majority of customer-provided Machine Configs. When a new machine is created, it fetches its Ignition config from the Machine Config Server, which only serves the pre-node-approved configs. At this point, the machine has enough information to submit a CSR, but nothing sensitive (note that for the purposes of this proposal, the CSR bootstrap token and pull secret are not considered sensitive). Once the CSR has been approved and a kubeconfig returned, the Machine Config Daemon is able to directly enumerate Machine Configs, including the post-node-approved ones, so it then applies any remaining configuration to the machine and reboots if necessary. At this point, the machine is fully configured.


		## Summary

		The bootstrapping process for new cluster nodes is a tricky problem, primarily due to the fact that a new nodes start from a position of almost zero knowledge. These nodes need to fetch their machine configuration and authenticate themselves with the cluster before they can access most resources and have workloads scheduled. At the same time, a malicious actor pretending to be a new node cannot be allowed to join the cluster. Otherwise, they would be able to get access to sensitive resources and could have workloads and their secrets scheduled to them. Today, we protect against this by preventing cluster workloads from accessing the Machine Config Server, the component of the Machine Config Operator which is responsible for serving Ignition configs, derived from a set of Machine Configs, to new machines. This approach presents a problem though: workloads on OpenShift are not allowed to make use of TCP ports 22623 and 22624, the ports used by the Machine Config Server.

WIP: enhancements/machine-config: securing MCS #626

WIP: enhancements/machine-config: securing MCS #626

Conversation

crawford commented Feb 4, 2021

openshift-ci-robot commented Feb 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Jun 17, 2021

openshift-bot commented Jul 17, 2021

openshift-bot commented Aug 17, 2021

openshift-ci bot commented Aug 17, 2021