From 8a0707721df2de74bd9c20764f81247e4b34133c Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Thu, 21 Sep 2023 11:32:41 +0200 Subject: [PATCH 01/20] Pin and pre-load images This patch adds an enhancement that describes a mechanism to pin and pre-load container images. Related: https://issues.redhat.com/browse/RFE-4482 Related: https://issues.redhat.com/browse/OTA-1001 Related: https://issues.redhat.com/browse/OTA-997 Related: https://github.com/openshift/machine-config-operator/pull/3839 Related: https://github.com/openshift/enhancements/pull/1432 Signed-off-by: Juan Hernandez --- .../machine-config/pin-and-pre-load-images.md | 232 ++++++++++++++++++ 1 file changed, 232 insertions(+) create mode 100644 enhancements/machine-config/pin-and-pre-load-images.md diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md new file mode 100644 index 0000000000..2a9f9e06f9 --- /dev/null +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -0,0 +1,232 @@ +--- +title: pin-and-pre-load-images +authors: +- "@jhernand" +reviewers: +- "@avishayt" +- "@danielerez" +- "@mrunalp" +- "@nmagnezi" +- "@oourfali" +approvers: +- "@sdodson" +- "@zaneb" +- "@LalatenduMohanty" +api-approvers: +- "@sdodson" +- "@zaneb" +- "@deads2k" +- "@JoelSpeed" +creation-date: 2023-09-21 +last-updated: 2023-09-21 +tracking-link: +- https://issues.redhat.com/browse/RFE-4482 +see-also: +- https://github.com/openshift/enhancements/pull/1432 +- https://github.com/openshift/machine-config-operator/pull/3839 +replaces: [] +superseded-by: [] +--- + +# Pin and pre-load images + +## Summary + +Provide an mechanism to pin and pre-load container images. + +## Motivation + +Slow and/or unreliable connections to the image registry servers interfere with +operations that require pulling images. For example, an upgrade may require +pulling more than one hundred images. Failures to pull those images cause +retries that interfere with the upgrade process and may eventually make it +fail. One way to improve that is to pull the images in advance, before they are +actually needed, and ensure that they aren't removed. + +### User Stories + +#### Pre-load and pin upgrade images + +As the administrator of a cluster that has a low bandwidth and/or unreliable +connection to an image registry server I want to pin and pre-load all the +images required for the upgrade in advance, so that when I decide to actually +perform the upgrade there will be no need to contact that slow and/or +unreliable registry server and the upgrade will successfully complete in a +predictable time. + +#### Pre-load and pin application images + +As the administrator of a cluster that has a low bandwidth and/or unreliable +connection to an image registry server I want to pin and pre-load the images +required by my application in advance, so that when I decide to actually deploy +it there will be no need to contact that slow and/or unreliable registry server +and my application will successfully deploy in a predictable time. + +### Goals + +Provide a mechanism that cluster administrators can use to pin and pre-load +container images. + +### Non-Goals + +None. + +## Proposal + +### Workflow Description + +1. The administrator of a cluster uses the `ContainerRuntimeConfig` object to +request that a set of container images are pinned and pre-loaded: + + ```yaml + apiVersion: machineconfiguration.openshift.io/v1 + kind: ContainerRuntimeConfig + metadata: + name: ... + spec: + containerRuntimeConfig: + pinnedImages: + - quay.io/openshift-release-dev/ocp-release@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + ... + ``` + +1. The machine config operators ensures that all the images are pinned and +pulled in all the nodes of the cluster. + +### API Extensions + +There are no new object kinds introduced by this enhancement, but new fields +will be added to existing `ContainerRuntimeConfig` objects. + +The new fields for the `ContainerRuntimeConfig` object are defined in detail in +https://github.com/openshift/machine-config-operator/pull/3839. + +### Implementation Details/Notes/Constraints + +Starting with version 4.14 of OpenShift CRI-O will have the capability to pin +certain images (see [this](https://github.com/cri-o/cri-o/pull/6862) pull +request for details). That capability will be used to pin all the images +required for the upgrade, so that they aren't garbage collected by kubelet and +CRI-O. + +The changes to pin the images will be done in a `/etc/crio/crio.conf.d/pin.conf` +file, something like this: + +```toml +pinned_images=[ + "quay.io/openshift-release-dev/ocp-release@sha256:...", + "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...", + "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...", + ... +] +``` + +The images need to be pre-loaded and the CRI-O service needs to be reloaded +when this configuration changes. To support that a new field will be added to +the `ContainerRuntimeConfig` object: + +```yaml +apiVersion: machineconfiguration.openshift.io/v1 +kind: ContainerRuntimeConfig +metadata: + name: ... +spec: + containerRuntimeConfig: + pinnedImages: + - quay.io/openshift-release-dev/ocp-release@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + ... +``` + +When the new `pinnedImages` field is added or changed the machine config +operator will need to pull those images (with the equivalent of `crictl pull`), +create or update the corresponding `/etc/crio/crio.conf.d/pin.conf` file and ask +CRI-O reload its configuration (with the equivalent of `systemctl reload +crio.service`). + +The machine config operator will then will use the gRPC API of CRI-O to run the +equivalent of `crictl pull` for each of the images. When that is completed the +machine config operator will update the new `status.pinnedImages` field of the +rendered machine config: + +```yaml +status: + pinnedImages: + - quay.io/openshift-release-dev/ocp-release@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + ... +``` + +### Risks and Mitigations + +None. + +### Drawbacks + +This approach requires non trivial changes to the machine config operator. + +## Design Details + +### Open Questions + +None. + +### Test Plan + +We add a CI test that verifies that images are correctly pinned and pre-loaded. + +### Graduation Criteria + +The feature will ideally be introduced as `Dev Preview` in OpenShift 4.X, +moved to `Tech Preview` in 4.X+1 and declared `GA` in 4.X+2. + +#### Dev Preview -> Tech Preview + +- Availability of the CI test. + +- Obtain positive feedback from at least one customer. + +#### Tech Preview -> GA + +- User facing documentation created in +[https://github.com/openshift/openshift-docs](openshift-docs). + +#### Removing a deprecated feature + +Not applicable, no feature will be removed. + +### Upgrade / Downgrade Strategy + +Not applicable. + +### Version Skew Strategy + +Not applicable. + +### Operational Aspects of API Extensions + +Not applicable, there are no API extensions. + +#### Failure Modes + +#### Support Procedures + +## Implementation History + +There is an initial prototype exploring some of the implementation details +described here in this [https://github.com/jhernand/upgrade-tool](repository). + +## Alternatives + +The alternative to this is to manually pull the images in all the nodes of the +cluster, manually create the `/etc/crio/crio.conf.d/pin.conf` file and manually +reload the CRI-O service. + +## Infrastructure Needed + +Infrastructure will be needed to run the CI test described in the test plan +above. From da7d638b2a445f19168c8e0468b1224a393db8aa Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Mon, 25 Sep 2023 18:31:20 +0200 Subject: [PATCH 02/20] Add reviewers comments This patch adds YAML comments explaining what is the reason to add specific reviewers. Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 2a9f9e06f9..ec1bb70ed4 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -3,11 +3,11 @@ title: pin-and-pre-load-images authors: - "@jhernand" reviewers: -- "@avishayt" -- "@danielerez" -- "@mrunalp" -- "@nmagnezi" -- "@oourfali" +- "@avishayt" # To ensure that this will be usable with the appliance. +- "@danielerez" # To ensure that this will be usable with the appliance. +- "@mrunalp" # To ensure that this can be implemented with CRI-O and MCO. +- "@nmagnezi" # To ensure that this will be usable with the appliance. +- "@oourfali" # To ensure that this will be usable with the appliance. approvers: - "@sdodson" - "@zaneb" From ee760a1ee6d6f6ecb8ec1765ea5606dccce58e17 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Tue, 26 Sep 2023 10:11:30 +0200 Subject: [PATCH 03/20] Orchestrating upgrades is a non-goal This patch adds a paragraph to the non-goals section explaining that orchestrating upgrades is not a goal of this enhancement. Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index ec1bb70ed4..56b8f688f5 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -69,7 +69,10 @@ container images. ### Non-Goals -None. +We wish to use the mechanism described in this enhancement to orchestrate +upgrades without a registry server. But that orchestration is not a goal +of this enhancement; it will be part of a separate enhancement based on +parts of https://github.com/openshift/enhancements/pull/1432. ## Proposal From 433c28ead50350aee45f0c3af4aecac9b842851a Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Thu, 28 Sep 2023 11:37:18 +0200 Subject: [PATCH 04/20] Add comment about upgrade time consistency Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 56b8f688f5..e7fc5b71f8 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -41,7 +41,10 @@ operations that require pulling images. For example, an upgrade may require pulling more than one hundred images. Failures to pull those images cause retries that interfere with the upgrade process and may eventually make it fail. One way to improve that is to pull the images in advance, before they are -actually needed, and ensure that they aren't removed. +actually needed, and ensure that they aren't removed. Doing that provides a +more consistent upgrade time in those environments. That is important when +scheduling upgrades into maintenance windows, even if the upgrade might not +otherwise fail. ### User Stories From 8df4279c940b2bad5c23b59edc60bbcf75729c61 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Thu, 28 Sep 2023 13:30:27 +0200 Subject: [PATCH 05/20] Add @sinnykumari and @mrunalp as approvers Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index e7fc5b71f8..63c17c7abc 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -9,9 +9,8 @@ reviewers: - "@nmagnezi" # To ensure that this will be usable with the appliance. - "@oourfali" # To ensure that this will be usable with the appliance. approvers: -- "@sdodson" -- "@zaneb" -- "@LalatenduMohanty" +- "@sinnykumari" +- "@mrunalp" api-approvers: - "@sdodson" - "@zaneb" From 7d2bf338f69ba3f2eaa971e898f577065c4dfa38 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Fri, 29 Sep 2023 10:01:04 +0200 Subject: [PATCH 06/20] Pinned images shouldn't be wiped Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 63c17c7abc..d5f9765ed5 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -116,6 +116,13 @@ request for details). That capability will be used to pin all the images required for the upgrade, so that they aren't garbage collected by kubelet and CRI-O. +In addition when the CRI-O service is upgraded and restarted it removes all the +images. This used to be done by the `crio-wipe` service, but is now done +internally by CRI-O. It can be avoided setting the `version_file_persist` +configuration parameter to "", but that would affect all images, not just the +pinned ones. This behavior needs to be changed so that pinned images aren't +removed, regardless of the value of `version_file_persist`. + The changes to pin the images will be done in a `/etc/crio/crio.conf.d/pin.conf` file, something like this: From fba408f7224505075c151570f352d96f896b1b7b Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Mon, 2 Oct 2023 08:43:55 +0200 Subject: [PATCH 07/20] Disk space exhaustion risk and mitigations Signed-off-by: Juan Hernandez --- .../machine-config/pin-and-pre-load-images.md | 22 ++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index d5f9765ed5..9d9908b63e 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -175,7 +175,27 @@ status: ### Risks and Mitigations -None. +Pre-loading disk images can consume a large amount of disk space. For example, +pre-loading all the images required for an upgrade can consume more 32 GiB. +This is a risk because disk exhaustion can affect other workloads and the +control plane. There is already mechanisms to mitigate that: the kubelet +garbage collection. To ensure disk space issues are reported and that garbage +collection doesn't interfere we will introduced new mechanisms: + +1. If disk exhaustion happens while trying to pre-load images the issues will +be reported explicitly via the status of the `ContainerRuntimeConfig` +status. Note typically these issues are reported as failures to pull images in +the status of pods, but in this case there are no pods pulling the images. CRI-O +will still report these issues in the log (if there is space for that), like +for any other image pull. + +1. If disk exhaustion happens after the images have been pulled, then we will +ensure that the kubelet garbage collector doesn't select these images. That +will be handled by the image pinning support in CRI-O: even if kubelet asks +CRI-O to delete a pinned image CRI-O will not delete it. + +1. The recovery steps in the documentation will be amended to ensure that these +images aren't deleted to recover disk space. ### Drawbacks From 82fc4cf3b191f4d251cd335ffe384e030f1877a1 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Mon, 2 Oct 2023 11:29:38 +0200 Subject: [PATCH 08/20] Add `PinnedImageSet` object This patch changes the enhacement adding a new `PinnedImageSet` object instead of modifying the `ContainerRuntimeConfig` object. Signed-off-by: Juan Hernandez --- .../machine-config/pin-and-pre-load-images.md | 175 +++++++++++++----- 1 file changed, 127 insertions(+), 48 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 9d9908b63e..2057198248 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -80,32 +80,34 @@ parts of https://github.com/openshift/enhancements/pull/1432. ### Workflow Description -1. The administrator of a cluster uses the `ContainerRuntimeConfig` object to +1. The administrator of a cluster uses the new `PinnedImageSet` object to request that a set of container images are pinned and pre-loaded: ```yaml - apiVersion: machineconfiguration.openshift.io/v1 - kind: ContainerRuntimeConfig + apiVersion: machineconfiguration.openshift.io/v1alpha1 + kind: PinnedImageSet metadata: - name: ... + name: my-pinned-images spec: - containerRuntimeConfig: - pinnedImages: - - quay.io/openshift-release-dev/ocp-release@sha256:... - - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... - - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... - ... + nodeSelector: + matchLabels: + node-role.kubernetes.io/control-plane: "" + pinnedImages: + - quay.io/openshift-release-dev/ocp-release@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + ... ``` 1. The machine config operators ensures that all the images are pinned and -pulled in all the nodes of the cluster. +pulled in all the nodes that match the node selector. ### API Extensions -There are no new object kinds introduced by this enhancement, but new fields -will be added to existing `ContainerRuntimeConfig` objects. +A new `PinnedImageSet` object will be added to the +`machineconfiguration.openshift.io` API group. -The new fields for the `ContainerRuntimeConfig` object are defined in detail in +The new object is defined in detail in https://github.com/openshift/machine-config-operator/pull/3839. ### Implementation Details/Notes/Constraints @@ -123,8 +125,34 @@ configuration parameter to "", but that would affect all images, not just the pinned ones. This behavior needs to be changed so that pinned images aren't removed, regardless of the value of `version_file_persist`. -The changes to pin the images will be done in a `/etc/crio/crio.conf.d/pin.conf` -file, something like this: +The changes to pin the images will be done in a file inside the +`/etc/crio/crio.conf.d` directory. To avoid potential conflicts with files +manually created by the administrator the name of this file will be the name of +the `PinnedImageSet` object concatenated with the UUID assiged by the API +server. For example, if the object is this: + +```yaml +apiVersion: machineconfiguration.openshift.io/v1alpha1 +kind: PinnedImageSet +metadata: + name: my-pinned-images + uuid: 550a1d88-2976-4447-9fc7-b65e457a7f42 +spec: + pinnedImages: + - quay.io/openshift-release-dev/ocp-release@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... + ... +``` + +Then the complete path will be this: + +``` +/etc/crio/crio.conf.d/my-pinned-images-550a1d88-2976-4447-9fc7-b65e457a7f42.conf +``` + +The content of the file will the `pinned_images` parameter containing the list +of images: ```toml pinned_images=[ @@ -135,42 +163,83 @@ pinned_images=[ ] ``` -The images need to be pre-loaded and the CRI-O service needs to be reloaded -when this configuration changes. To support that a new field will be added to -the `ContainerRuntimeConfig` object: +In addition to the list of images to be pinned, the `PinnedImageSet` object +will also contain a node selector. This is intended to support different sets +of images for different kinds of nodes. For example, to pin different images +for control plane and worker nodes the user could create two `PinnedImageSet` +objects: ```yaml -apiVersion: machineconfiguration.openshift.io/v1 -kind: ContainerRuntimeConfig +# For control plane nodes: +apiVersion: machineconfiguration.openshift.io/v1alpha1 +kind: PinnedImageSet metadata: - name: ... + name: my-control-plane-pinned-images spec: - containerRuntimeConfig: - pinnedImages: - - quay.io/openshift-release-dev/ocp-release@sha256:... - - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... - - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... - ... + nodeSelector: + matchLabels: + node-role.kubernetes.io/control-plane: "" + pinnedImages: + ... + +--- + +# For worker nodes: +apiVersion: machineconfiguration.openshift.io/v1alpha1 +kind: PinnedImageSet +metadata: + name: my-control-plane-pinned-images +spec: + nodeSelector: + matchLabels: + node-role.kubernetes.io/worker: "" + pinnedImages: + ... ``` -When the new `pinnedImages` field is added or changed the machine config -operator will need to pull those images (with the equivalent of `crictl pull`), -create or update the corresponding `/etc/crio/crio.conf.d/pin.conf` file and ask -CRI-O reload its configuration (with the equivalent of `systemctl reload -crio.service`). +This is specially convenient for pinning images for upgrades: there are many +images that are needed only by the control plane nodes and there is no need to +have them consuming disk space in worker nodes. + +When no node selector is specified the images will be pinned in all the nodes +of the cluster. -The machine config operator will then will use the gRPC API of CRI-O to run the -equivalent of `crictl pull` for each of the images. When that is completed the -machine config operator will update the new `status.pinnedImages` field of the -rendered machine config: +When a `PinnedImageSet` object is added, modified or deleted the machine config +operator will create, modify or delete the configuration file, reload the CRI-O +configuration (with the equivalent of `systemctl reload crio`) and then it will +use the CRI-O gRPC API to run the equivalent of `crictl pull` for each of the +images. + +Note that this will happen in all the nodes of the cluster that match the node +selector. + +When all the images have been succesfully pinned and pulled in all the matching +nodes the `Ready` condition will be set to `True`: + +```yaml +status: + conditions: + - type: Ready + status: "True" + - type: Failed + status: "False" +``` + +If something fails the `Failed` condition will be set to `True`, and the +details of the error will be in the message. For example if `node12` fails to +pull an image: ```yaml status: pinnedImages: - quay.io/openshift-release-dev/ocp-release@sha256:... - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... - - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... - ... + conditions: + - type: Ready + status: "False" + - type: Failed + status: "True" + message: Node 'node12' failed to pull image `quay.io/...` because ... ``` ### Risks and Mitigations @@ -178,16 +247,16 @@ status: Pre-loading disk images can consume a large amount of disk space. For example, pre-loading all the images required for an upgrade can consume more 32 GiB. This is a risk because disk exhaustion can affect other workloads and the -control plane. There is already mechanisms to mitigate that: the kubelet +control plane. There is already a mechanism to mitigate that: the kubelet garbage collection. To ensure disk space issues are reported and that garbage collection doesn't interfere we will introduced new mechanisms: 1. If disk exhaustion happens while trying to pre-load images the issues will -be reported explicitly via the status of the `ContainerRuntimeConfig` -status. Note typically these issues are reported as failures to pull images in -the status of pods, but in this case there are no pods pulling the images. CRI-O -will still report these issues in the log (if there is space for that), like -for any other image pull. +be reported explicitly via the status of the `PinnedImageSet` status. Note +typically these issues are reported as failures to pull images in the status of +pods, but in this case there are no pods pulling the images. CRI-O will still +report these issues in the log (if there is space for that), like for any other +image pull. 1. If disk exhaustion happens after the images have been pulled, then we will ensure that the kubelet garbage collector doesn't select these images. That @@ -233,7 +302,13 @@ Not applicable, no feature will be removed. ### Upgrade / Downgrade Strategy -Not applicable. +Upgrades from versions that don't support the `PinnedImageSet` object don't +require any special handling because the object is optional: there will be no +such objects in the upgraded cluster. + +Downgrades to versions that don't support the `PinnedImageSet` object don't +require any changes. The existing pinned images will be ignored in the +downgraded cluster, and will eventually be garbage collected. ### Version Skew Strategy @@ -241,12 +316,16 @@ Not applicable. ### Operational Aspects of API Extensions -Not applicable, there are no API extensions. - #### Failure Modes +Image pulling may fail due to lack of disk space or other reasons. This will be +reported via the conditions in the `PinnedImageSet` objects. See the risks and +mitigations section for details. + #### Support Procedures +Nothing. + ## Implementation History There is an initial prototype exploring some of the implementation details From 0c6fe1d26ffee08ffac907aad93e853a7922fc8a Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Mon, 2 Oct 2023 11:42:47 +0200 Subject: [PATCH 09/20] Fix lint issues Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 2057198248..2d279fc8c2 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -147,7 +147,7 @@ spec: Then the complete path will be this: -``` +```txt /etc/crio/crio.conf.d/my-pinned-images-550a1d88-2976-4447-9fc7-b65e457a7f42.conf ``` From 54b515b6922621581ddb9d8ba73fdf2495e99662 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Fri, 6 Oct 2023 17:15:38 +0200 Subject: [PATCH 10/20] Move API PR to openshift/api Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 2d279fc8c2..f9b2c12efc 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -22,7 +22,7 @@ tracking-link: - https://issues.redhat.com/browse/RFE-4482 see-also: - https://github.com/openshift/enhancements/pull/1432 -- https://github.com/openshift/machine-config-operator/pull/3839 +- https://github.com/openshift/api/pull/1609 replaces: [] superseded-by: [] --- @@ -108,7 +108,7 @@ A new `PinnedImageSet` object will be added to the `machineconfiguration.openshift.io` API group. The new object is defined in detail in -https://github.com/openshift/machine-config-operator/pull/3839. +https://github.com/openshift/api/pull/1609. ### Implementation Details/Notes/Constraints From 7946c21d8f6972f5a8271297f1fcb99ac0d1c67f Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Fri, 6 Oct 2023 17:22:55 +0200 Subject: [PATCH 11/20] Clarify that image wiping needs to be changed in CRI-O Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index f9b2c12efc..4436085c39 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -122,8 +122,8 @@ In addition when the CRI-O service is upgraded and restarted it removes all the images. This used to be done by the `crio-wipe` service, but is now done internally by CRI-O. It can be avoided setting the `version_file_persist` configuration parameter to "", but that would affect all images, not just the -pinned ones. This behavior needs to be changed so that pinned images aren't -removed, regardless of the value of `version_file_persist`. +pinned ones. This behavior needs to be changed in CRI-O so that pinned images +aren't removed, regardless of the value of `version_file_persist`. The changes to pin the images will be done in a file inside the `/etc/crio/crio.conf.d` directory. To avoid potential conflicts with files From a9ff7e714fb3b978d49884f94f20517cd92c7d91 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Fri, 6 Oct 2023 17:28:39 +0200 Subject: [PATCH 12/20] Explain the new logic will be part of a new MCC sub-controller Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 4436085c39..1083fb19ca 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -242,6 +242,9 @@ status: message: Node 'node12' failed to pull image `quay.io/...` because ... ``` +This logic to handle the pinned image sets and generate the configuration files +will be part of a new sub-controller in MCC. + ### Risks and Mitigations Pre-loading disk images can consume a large amount of disk space. For example, From 2ef880c78ddaad3c1536f8a8ad69c7357bb7eff6 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Mon, 9 Oct 2023 20:24:22 +0200 Subject: [PATCH 13/20] Use "custom resource" instead of "object" This patch changes the document to use "custom resource" instead of "object" to make clear that we will be adding a new `PinnedImageSet` custom resource to the API. Signed-off-by: Juan Hernandez --- .../machine-config/pin-and-pre-load-images.md | 46 +++++++++---------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 1083fb19ca..9d0f2a8efd 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -80,8 +80,8 @@ parts of https://github.com/openshift/enhancements/pull/1432. ### Workflow Description -1. The administrator of a cluster uses the new `PinnedImageSet` object to -request that a set of container images are pinned and pre-loaded: +1. The administrator of a cluster uses the new `PinnedImageSet` custom resource +to request that a set of container images are pinned and pre-loaded: ```yaml apiVersion: machineconfiguration.openshift.io/v1alpha1 @@ -104,10 +104,10 @@ pulled in all the nodes that match the node selector. ### API Extensions -A new `PinnedImageSet` object will be added to the +A new `PinnedImageSet` custom resource definition will be added to the `machineconfiguration.openshift.io` API group. -The new object is defined in detail in +The new custom resource definition is described in detail in https://github.com/openshift/api/pull/1609. ### Implementation Details/Notes/Constraints @@ -128,8 +128,8 @@ aren't removed, regardless of the value of `version_file_persist`. The changes to pin the images will be done in a file inside the `/etc/crio/crio.conf.d` directory. To avoid potential conflicts with files manually created by the administrator the name of this file will be the name of -the `PinnedImageSet` object concatenated with the UUID assiged by the API -server. For example, if the object is this: +the `PinnedImageSet` custom resource concatenated with the UUID assiged by the +API server. For example, if the custom resource is this: ```yaml apiVersion: machineconfiguration.openshift.io/v1alpha1 @@ -163,11 +163,11 @@ pinned_images=[ ] ``` -In addition to the list of images to be pinned, the `PinnedImageSet` object -will also contain a node selector. This is intended to support different sets -of images for different kinds of nodes. For example, to pin different images -for control plane and worker nodes the user could create two `PinnedImageSet` -objects: +In addition to the list of images to be pinned, the `PinnedImageSet` custom +resource will also contain a node selector. This is intended to support +different sets of images for different kinds of nodes. For example, to pin +different images for control plane and worker nodes the user could create two +`PinnedImageSet` custom resources: ```yaml # For control plane nodes: @@ -204,11 +204,11 @@ have them consuming disk space in worker nodes. When no node selector is specified the images will be pinned in all the nodes of the cluster. -When a `PinnedImageSet` object is added, modified or deleted the machine config -operator will create, modify or delete the configuration file, reload the CRI-O -configuration (with the equivalent of `systemctl reload crio`) and then it will -use the CRI-O gRPC API to run the equivalent of `crictl pull` for each of the -images. +When a `PinnedImageSet` custom resource is added, modified or deleted the +machine config operator will create, modify or delete the configuration file, +reload the CRI-O configuration (with the equivalent of `systemctl reload crio`) +and then it will use the CRI-O gRPC API to run the equivalent of `crictl pull` +for each of the images. Note that this will happen in all the nodes of the cluster that match the node selector. @@ -305,12 +305,12 @@ Not applicable, no feature will be removed. ### Upgrade / Downgrade Strategy -Upgrades from versions that don't support the `PinnedImageSet` object don't -require any special handling because the object is optional: there will be no -such objects in the upgraded cluster. +Upgrades from versions that don't support the `PinnedImageSet` custom resource +don't require any special handling because the custom resource is optional: +there will be no such custom resource in the upgraded cluster. -Downgrades to versions that don't support the `PinnedImageSet` object don't -require any changes. The existing pinned images will be ignored in the +Downgrades to versions that don't support the `PinnedImageSet` custom resource +don't require any changes. The existing pinned images will be ignored in the downgraded cluster, and will eventually be garbage collected. ### Version Skew Strategy @@ -322,8 +322,8 @@ Not applicable. #### Failure Modes Image pulling may fail due to lack of disk space or other reasons. This will be -reported via the conditions in the `PinnedImageSet` objects. See the risks and -mitigations section for details. +reported via the conditions in the `PinnedImageSet` custom resource. See the +risks and mitigations section for details. #### Support Procedures From f379ab182383767af963f54ab7721168e63e0af7 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Wed, 11 Oct 2023 20:29:30 +0200 Subject: [PATCH 14/20] CRI-O will need to reset pinned images on reload Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 9d0f2a8efd..706ce7afba 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -210,6 +210,9 @@ reload the CRI-O configuration (with the equivalent of `systemctl reload crio`) and then it will use the CRI-O gRPC API to run the equivalent of `crictl pull` for each of the images. +Note that currently CRI-O doesn't reset pinned images on reload, support for +that will need to be added. + Note that this will happen in all the nodes of the cluster that match the node selector. From a16ccece8f94e179a2ef41be53d554259766d96a Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Fri, 13 Oct 2023 09:13:35 +0200 Subject: [PATCH 15/20] Fix name of worker `PinnedImageSet` Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 706ce7afba..76bce8e932 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -188,7 +188,7 @@ spec: apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: PinnedImageSet metadata: - name: my-control-plane-pinned-images + name: my-worker-pinned-images spec: nodeSelector: matchLabels: From 25fc0321b69a0f8f54b1009a16f7e6fbaeb0147b Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Mon, 16 Oct 2023 10:16:57 +0200 Subject: [PATCH 16/20] Add details about `PinnedImageSetController` sub controller Signed-off-by: Juan Hernandez --- .../machine-config/pin-and-pre-load-images.md | 59 +++++++++++++++---- 1 file changed, 47 insertions(+), 12 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 76bce8e932..c239b3b5bf 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -204,20 +204,35 @@ have them consuming disk space in worker nodes. When no node selector is specified the images will be pinned in all the nodes of the cluster. -When a `PinnedImageSet` custom resource is added, modified or deleted the -machine config operator will create, modify or delete the configuration file, -reload the CRI-O configuration (with the equivalent of `systemctl reload crio`) -and then it will use the CRI-O gRPC API to run the equivalent of `crictl pull` -for each of the images. +The _machine-config-controller_ will grow a new `PinnedImageSetController` sub +controller that will watch the `PinnedImageSet` custom resources. When one of +those is created, updated or deleted the controller will start a daemon set that +will do the following in each node of the cluster: -Note that currently CRI-O doesn't reset pinned images on reload, support for -that will need to be added. +1. Create, modify or delete the CRI-O pinning configuration file. -Note that this will happen in all the nodes of the cluster that match the node -selector. +1. Reload the CRI-O configuration, with the equivalent of `systemctl reload crio`. + + Note that currently CRI-O doesn't recalculate the set of pinned images on + reload, support for that will need to be added. + +1. Use the CRI-O gRPC API to run the equivalent of `crictl pull` for each of the +images. + +This needs to be a daemon set running in each node of the cluster because it +needs to create configuration files on the node, and use the CRI-O gRPC API, +which is only accessible via `localhost`. + +This logic could be part of a new daemon set, specific for this, or else part of +the _machine-config-daemon_. + +The steps above will happen in all the nodes of the cluster. The daemon set will +be provided with enough information to ensure that each pod applies only the +changes required for the node where it runs, according to the node selectors in +the `PinnedImageSet` custom resources. When all the images have been succesfully pinned and pulled in all the matching -nodes the `Ready` condition will be set to `True`: +nodes the `PinnedImageSetController` will set the `Ready` condition to `True`: ```yaml status: @@ -245,8 +260,28 @@ status: message: Node 'node12' failed to pull image `quay.io/...` because ... ``` -This logic to handle the pinned image sets and generate the configuration files -will be part of a new sub-controller in MCC. +Additional information may be needed inside the `status` field of the +`PinnedImageSet` custom resources in order to implement the mechanisms described +above. For example, it may be necessary to have conditions per node, something +like this: + +```yaml +status: + node0: + - type: Ready + status: "False" + - type: Failed + status: "True" + node1: + - type: Ready + status: "True" + - type: Failed + status: "True" + ... +``` + +Those details are explicitly left out of this document, because they are mostly +implementation details, and't not relevant for the user of the API. ### Risks and Mitigations From 0f37ea046361420cc784fe0665026145a66b8a6b Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Wed, 18 Oct 2023 15:26:56 +0200 Subject: [PATCH 17/20] Check disk space before pulling images Signed-off-by: Juan Hernandez --- .../machine-config/pin-and-pre-load-images.md | 58 ++++++++++++++++--- 1 file changed, 49 insertions(+), 9 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index c239b3b5bf..838ce2fead 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -128,7 +128,7 @@ aren't removed, regardless of the value of `version_file_persist`. The changes to pin the images will be done in a file inside the `/etc/crio/crio.conf.d` directory. To avoid potential conflicts with files manually created by the administrator the name of this file will be the name of -the `PinnedImageSet` custom resource concatenated with the UUID assiged by the +the `PinnedImageSet` custom resource concatenated with the UUID assigned by the API server. For example, if the custom resource is this: ```yaml @@ -209,6 +209,9 @@ controller that will watch the `PinnedImageSet` custom resources. When one of those is created, updated or deleted the controller will start a daemon set that will do the following in each node of the cluster: +1. Verify that there is enough disk space available in the machine to pull the + images. + 1. Create, modify or delete the CRI-O pinning configuration file. 1. Reload the CRI-O configuration, with the equivalent of `systemctl reload crio`. @@ -226,12 +229,46 @@ which is only accessible via `localhost`. This logic could be part of a new daemon set, specific for this, or else part of the _machine-config-daemon_. +In order to verify that there is enough disk space we will first check which of +the images in the `PinnedImageSet` have not yet been downloaded, using CRI-O +gRPC API. For those that haven't been download we will then fetch from the +registry server the size of the blobs of the layers (only the size, not the +actual data). + +Calculating the exact total size of the layers once uncompressed and written to +disk is difficult without deep knowledge of the underplaying containers storage +library. Instead of that we will assume that the disk space required is twice +the size of the blobs of the layers. That is an heuristic that works well for +complete OpenShift releases: blobs take 16 GiB and when they are written to disk +they take 32 GiB. + +If the calculate disk space exceeds the available disk space then the +`PinnedImageSet` will be marked as failed and the process will not move forward. +For example: + +```yaml +status: + conditions: + - type: Ready + status: "False" + - type: Failed + status: "True" + message: | + Pulling the images in `node12` requires at least 16 GiB of disk + space, but there are only 10 GiB available +``` + +Note that even with this check it will still be possible (but less likely) to +have failures to pull images due to disk space: the heuristic could be wrong, +and there may be other components pulling images or consuming disk space in some +other way. Those failures will be detected and reported in the status of the `PinnedImageSet` when CRI-O fails to pull the image. + The steps above will happen in all the nodes of the cluster. The daemon set will be provided with enough information to ensure that each pod applies only the changes required for the node where it runs, according to the node selectors in the `PinnedImageSet` custom resources. -When all the images have been succesfully pinned and pulled in all the matching +When all the images have been successfully pinned and pulled in all the matching nodes the `PinnedImageSetController` will set the `Ready` condition to `True`: ```yaml @@ -281,7 +318,7 @@ status: ``` Those details are explicitly left out of this document, because they are mostly -implementation details, and't not relevant for the user of the API. +implementation details, and not relevant for the user of the API. ### Risks and Mitigations @@ -292,12 +329,15 @@ control plane. There is already a mechanism to mitigate that: the kubelet garbage collection. To ensure disk space issues are reported and that garbage collection doesn't interfere we will introduced new mechanisms: -1. If disk exhaustion happens while trying to pre-load images the issues will -be reported explicitly via the status of the `PinnedImageSet` status. Note -typically these issues are reported as failures to pull images in the status of -pods, but in this case there are no pods pulling the images. CRI-O will still -report these issues in the log (if there is space for that), like for any other -image pull. +1. Disk space will be checked before trying to pre-load images, and if there are +issues they will be reported explicitly via the status of the `PinnedImageSet` +status. + +1. If disk space is exhausted while pulling the images, then the issue will be +reported via the status of the `PinnedImageSet` as well. Typically these issues +are reported as failures to pull images in the status of pods, but in this case +there are no pods pulling the images. CRI-O will still report these issues in +the log (if there is space for that), like for any other image pull. 1. If disk exhaustion happens after the images have been pulled, then we will ensure that the kubelet garbage collector doesn't select these images. That From 014d66c29e28f33cb5684ceb86e51abdeb468859 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Thu, 19 Oct 2023 09:05:13 +0200 Subject: [PATCH 18/20] Fix line lengths Signed-off-by: Juan Hernandez --- .../machine-config/pin-and-pre-load-images.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index 838ce2fead..f7e743474e 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -260,13 +260,14 @@ status: Note that even with this check it will still be possible (but less likely) to have failures to pull images due to disk space: the heuristic could be wrong, -and there may be other components pulling images or consuming disk space in some -other way. Those failures will be detected and reported in the status of the `PinnedImageSet` when CRI-O fails to pull the image. - -The steps above will happen in all the nodes of the cluster. The daemon set will -be provided with enough information to ensure that each pod applies only the -changes required for the node where it runs, according to the node selectors in -the `PinnedImageSet` custom resources. +and there may be other components pulling images or consuming disk space in +some other way. Those failures will be detected and reported in the status of +the `PinnedImageSet` when CRI-O fails to pull the image. + +The steps above will happen in all the nodes of the cluster. The daemon set +will be provided with enough information to ensure that each pod applies only +the changes required for the node where it runs, according to the node +selectors in the `PinnedImageSet` custom resources. When all the images have been successfully pinned and pulled in all the matching nodes the `PinnedImageSetController` will set the `Ready` condition to `True`: From a73e6c19a0dcde59a3ed33caff2528b8c5337620 Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Thu, 19 Oct 2023 09:15:24 +0200 Subject: [PATCH 19/20] Clarify what pulls images Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index f7e743474e..a7a68c2e5a 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -99,7 +99,8 @@ to request that a set of container images are pinned and pre-loaded: ... ``` -1. The machine config operators ensures that all the images are pinned and +1. A new `PinnedImageSetController` sub-controller of the +_machine-config-controller_ ensures that all the images are pinned and pulled in all the nodes that match the node selector. ### API Extensions From ca63f0684b33bde71866eefa764c689824535dfc Mon Sep 17 00:00:00 2001 From: Juan Hernandez Date: Thu, 19 Oct 2023 09:27:42 +0200 Subject: [PATCH 20/20] Consider using the new MCO state reporting mechanism Signed-off-by: Juan Hernandez --- enhancements/machine-config/pin-and-pre-load-images.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md index a7a68c2e5a..d1541afa28 100644 --- a/enhancements/machine-config/pin-and-pre-load-images.md +++ b/enhancements/machine-config/pin-and-pre-load-images.md @@ -299,6 +299,10 @@ status: message: Node 'node12' failed to pull image `quay.io/...` because ... ``` +We will consider using the new _machine-config-operator_ state reporting +mechanism introduced [https://github.com/openshift/api/pull/1596](here) to +report additional details about the progress or issues of each node. + Additional information may be needed inside the `status` field of the `PinnedImageSet` custom resources in order to implement the mechanisms described above. For example, it may be necessary to have conditions per node, something