Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PinnedImageSet crd, controller and prefetch manager #4094

Closed
wants to merge 5 commits into from

Conversation

hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Jan 3, 2024

This PR implements openshift/enhancements#1481

This PR adds.

  • PinnedImageSet Controller
  • PinnedImageSet CRD
  • Prefetch Manager

The PinnedImageSetController reconciles against two desired states.

  1. Defining the CRI-O pinned_images configuration via MachineConfig. This is populated via the PinnedImageSet CRD[2].
  2. prefetching images: A secondary controller located in MCD is tasked with ensuring the images defined by PinnedInameSet are pulled. Currently the results of this operation are reported via node annotation. In the future, this will probably be reported via MachineConfigNode status. Once the nodes in the pool defined by the CR have completed the Status is updated to reflect.

The Prefetch Manager worker pool ensures that.

  • Adequate storage is available for the images before they are pulled.
  • Ensures that images are not available locally before it requests they are pulled.
  • One single worker is deployed on control-plane nodes to reduce I/O disruptions.
  • Pull failures are retried max 5
  • Image Pull requests are done via CRI gRPC client using the same method and the Kubelet.
  • Authentication is provided where appropriate for images.

Additional Logic.

  • postAction result of the configuration being written is a CRI-O reload.

Considerations

Because we are pulling possibly a large number of images there is a concern about how that could affect the control-plane. For this reason only a single worker is deployed on a master node. Each image is pulled serial with a 1s cool down period. But this still results in noticeable I/O. This is a basic idle AWS cluster. While this latency on its own is not an issue under load it should be a consideration. Current proposed mitigations include exposing knobs around concurrency and the throttle duration.

image

example CR

apiVersion: machineconfiguration.openshift.io/v1
kind: PinnedImageSet
metadata:
 name: worker-test
spec:
 machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
 pinnedImages:
   - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7aa95f32af51fc7892546a1e028808ec1bab1e507cf671b88d8280d2521e61d6
   - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d98ddbe73bda2ffed4d1aeb52be0500b8f8fe870cb465a8bb0cb113f7ed5ade3

ref.
[1] MCO-838 https://issues.redhat.com//browse/MCO-838
[2] openshift/api#1713

Blocked by
https://issues.redhat.com/browse/OCPNODE-1986

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2024
Copy link
Contributor

openshift-ci bot commented Jan 3, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@hexfusion
Copy link
Contributor Author

/test all

1 similar comment
@hexfusion
Copy link
Contributor Author

/test all

@hexfusion
Copy link
Contributor Author

/test all

@hexfusion hexfusion changed the base branch from master to release-4.16 January 10, 2024 23:27
@hexfusion hexfusion changed the base branch from release-4.16 to master January 10, 2024 23:28
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2024
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks pretty clean, do you think the controller will need any new RBAC?

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2024
@hexfusion hexfusion force-pushed the hack/pinned-set branch 5 times, most recently from 9964213 to cd48caa Compare January 30, 2024 05:55
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2024
@hexfusion hexfusion marked this pull request as ready for review January 30, 2024 06:05
@hexfusion hexfusion changed the title [wip]: PinnedImageSet [MCO-838] Add PinnedImageSet crd, controller and prefetch manager Jan 31, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2024
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! Left a few comments about API calls. Will give this another pass soon.

if isNotFound {
_, err = ctrl.mcfgClient.MachineconfigurationV1().MachineConfigs().Create(context.TODO(), mc, metav1.CreateOptions{})
} else {
_, err = ctrl.mcfgClient.MachineconfigurationV1().MachineConfigs().Update(context.TODO(), mc, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the preferred mechanism is patch I believe. You can look around at how to make a sonmergepatch.CreateThreeWayJSONMergePatch(curJSON, modJSON, curJSON) and then pass this output to .Patch rather than .Update.

There are some scenarios where you want to use update but I am forgetting if this falls into those.

Copy link
Contributor Author

@hexfusion hexfusion Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good I will dig into it to ensure correctness this is copy pasta from an existing controller in this repo.

return nil
}

_, err = ctrl.mcfgClient.MachineconfigurationV1().PinnedImageSets().UpdateStatus(context.TODO(), newImageSet, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateStatus is right here as opposed to patch. Though you might need the rbac for pinnedimagesets/status specifically? I have run into this before where it does not allow me to update status unless I have this role.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok mcc rbac I believe is inclusive but will doublecheck.

- apiGroups: ["machineconfiguration.openshift.io"]
  resources: ["*"]
  verbs: ["*"]

pkg/daemon/daemon.go Outdated Show resolved Hide resolved
pkg/daemon/update.go Outdated Show resolved Hide resolved

// minFreeStorageAfterPrefetch is the minimum amount of storage in bytes available on the root filesystem
// after prefetching images.
minFreeStorageAfterPrefetch int64 = 32 * 1024 * 1024 * 1024 // 32GB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would use cases for this feature look like environment-wise? Is the expectation that if they are resource-limited, they really shouldn't be pre-pulling images?

It's a good to have a safeguard I think but maybe 32 is a bit high?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we already collect some metrics around available free spaces for cluster for which it is targeted. If it exist, that will help to guess this value better. Free space is relative based on what kind of application is running on a cluster. A storage hungry application can run into no space sooner.

@hexfusion
Copy link
Contributor Author

/assign

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the overall controller/daemon logic seems sound so far, some initial questions inline

haven't dug into details onto how the prefetch manager actually works, but I assume it's relatively disruption-proof?

since it somewhat runs independently, I'm just curious what happens if e.g. a machineconfig update comes in mid-way of image pulls and stops the daemon/reboots the node. I assume the aborted pull will just try from the start?


// minFreeStorageAfterPrefetch is the minimum amount of storage in bytes available on the root filesystem
// after prefetching images.
minFreeStorageAfterPrefetch int64 = 32 * 1024 * 1024 * 1024 // 32GB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would use cases for this feature look like environment-wise? Is the expectation that if they are resource-limited, they really shouldn't be pre-pulling images?

It's a good to have a safeguard I think but maybe 32 is a bit high?

}

func (p *PrefetchManager) sync(key string) error {
klog.Infof("Syncing PinnedImageSet %q", key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove before merge or change verbosity


for _, node := range nodes {
if !ctrl.isPrefetchCompleteForNode(node, imageSet) {
// If prefetch is not complete fail fast and requeue the PinnedImageSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand this a bit. In the controller logic, you first ensurePinnedImageSet which deploys the machineconfig, then immediately after that sync this.

I assume expectation is that the image pulls should be taking awhile to complete, so is the expectation that the controller will be in an error state until the daemons are done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the expectation is that it should be in a Progressing state. As the error is expected. So we should adjust that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We may want to show InProgress state to show that ImagePrefetch is progressing instead of an error.

p.taskManager.add(imageSet, cancel)
defer p.taskManager.cancel(imageSet)

err = p.startWorkerPool(ctx, prefetchImages)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I understand this correctly, the daemons are reacting to the pinnedimageset objects directly, and self-determining whether they should be pulling an image. Thus there are two processes happening in parallel:

  1. the MCC rendering the new machineconfig and the MCD main node sync reacting to that
  2. the MCD reading newly added pinnedimagesets and pulling images

Is there a strict dependency on that ordering? Is there any existing guard (sorry if I missed it) for that? And is there any additional pinnedimageset correctness needed to be processed by the controller before the daemon starts?

I guess the thought experiment is a large cluster with hundreds of nodes. While you only reload the crio daemon, each node is still sequentially processing the update with some built it delay of the machineconfig to enable pinnedimagesets (the crio toml file). This can take hours on large enough clusters, but I assume each daemon process running this would start the image pull already and potentially finish by the time the crio config updates.

}

// getMachineConfigKey returns the managed key for the machine config
func getMachineConfigKey(pool *mcfgv1.MachineConfigPool, client mcfgclientset.Interface, imageSetOrig *mcfgv1.PinnedImageSet) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will work like the crio/kubelet configuration rendering, meaning that custom pool config > worker pool config now (but if you don't define a pinned image set for, say, your infra node, it will inherit worker configs and still try to pull as if it was a worker).

That's probably the expected behaviour but wanted to check explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right since today. configs are deployed on a pool level I don't feel it makes sense for this controller to act in a different way. My understanding is that you can create a custom pool for dedicated to a certain purpose "infra"? In that case the config could be deployed to only those nodes which are pool members.

go.mod Show resolved Hide resolved
pkg/daemon/update.go Outdated Show resolved Hide resolved
Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great. Few overall question as I may have missed while briefly skimming through code:

  1. What happens to prefecthed pinnedImages which are no longer reference in the when user removes few from PinnedImageSet CRO?
  2. Do we want to add some sort of validation check to ensure that all images referenced in the PinnedImageSet are by hash and not tag?
    3.This can happen in a separate PR, how about adding an e2e test for this feature? It can go in existing e2e-gcp-op. If it adds considerable amount of time for the test, we can do it a separate e2e test.

@hexfusion
Copy link
Contributor Author

hexfusion commented Feb 26, 2024

1.) What happens to prefecthed pinnedImages which are no longer reference in the when user removes few from PinnedImageSet CRO?

Unpinned images are subject to future pruning/wipe. The scope of this feature does not include a pruning mechanism.

2.) Do we want to add some sort of validation check to ensure that all images referenced in the PinnedImageSet are by hash and not tag?

This is built into the API-level validation pattern.

// +kubebuilder:validation:Pattern:=`@sha256:[a-fA-F0-9]{64}$`
type PinnedImageRef string

3.This can happen in a separate PR, how about adding an e2e test for this feature? It can go in existing e2e-gcp-op. If it adds considerable amount of time for the test, we can do it a separate e2e test.

Sounds good

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2024
Copy link
Contributor

openshift-ci bot commented Feb 28, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hexfusion
Once this PR has been reviewed and has the lgtm label, please assign sinnykumari for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Feb 28, 2024

@hexfusion: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit e177445 link true /test unit
ci/prow/verify e177445 link true /test verify
ci/prow/okd-scos-images e177445 link true /test okd-scos-images
ci/prow/e2e-aws-ovn-upgrade e177445 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn e177445 link true /test e2e-aws-ovn
ci/prow/okd-images e177445 link false /test okd-images
ci/prow/e2e-hypershift e177445 link true /test e2e-hypershift
ci/prow/images e177445 link true /test images
ci/prow/e2e-gcp-op-single-node e177445 link true /test e2e-gcp-op-single-node
ci/prow/e2e-gcp-op e177445 link true /test e2e-gcp-op
ci/prow/e2e-azure-ovn-upgrade-out-of-change e177445 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/okd-scos-e2e-aws-ovn e177445 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-upgrade-out-of-change e177445 link false /test e2e-aws-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op-techpreview e177445 link false /test e2e-gcp-op-techpreview
ci/prow/bootstrap-unit e177445 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hexfusion
Copy link
Contributor Author

updating api deps

@rioliu-rh
Copy link

rioliu-rh commented Mar 12, 2024

FYI, when the code is ready for testing, let's us know @sergiordlr @rioliu-rh @ptalgulk01 and hold this PR, THX

@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2024
@hexfusion
Copy link
Contributor Author

This PR was a WIP test. A new PR with updated apis and intent will follow shortly

/close

@openshift-ci openshift-ci bot closed this Mar 13, 2024
Copy link
Contributor

openshift-ci bot commented Mar 13, 2024

@hexfusion: Closed this PR.

In response to this:

This PR was a WIP test. A new PR with updated apis and intent will follow shortly

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion hexfusion changed the title MCO-1017: MCO-1018 MCO-1019: MCO-1020: MCO-1021 Add PinnedImageSet crd, controller and prefetch manager Add PinnedImageSet crd, controller and prefetch manager Apr 15, 2024
@openshift-ci-robot openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 15, 2024
@openshift-ci-robot
Copy link
Contributor

@hexfusion: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

This PR implements openshift/enhancements#1481

This PR adds.

  • PinnedImageSet Controller
  • PinnedImageSet CRD
  • Prefetch Manager

The PinnedImageSetController reconciles against two desired states.

  1. Defining the CRI-O pinned_images configuration via MachineConfig. This is populated via the PinnedImageSet CRD[2].
  2. prefetching images: A secondary controller located in MCD is tasked with ensuring the images defined by PinnedInameSet are pulled. Currently the results of this operation are reported via node annotation. In the future, this will probably be reported via MachineConfigNode status. Once the nodes in the pool defined by the CR have completed the Status is updated to reflect.

The Prefetch Manager worker pool ensures that.

  • Adequate storage is available for the images before they are pulled.
  • Ensures that images are not available locally before it requests they are pulled.
  • One single worker is deployed on control-plane nodes to reduce I/O disruptions.
  • Pull failures are retried max 5
  • Image Pull requests are done via CRI gRPC client using the same method and the Kubelet.
  • Authentication is provided where appropriate for images.

Additional Logic.

  • postAction result of the configuration being written is a CRI-O reload.

Considerations

Because we are pulling possibly a large number of images there is a concern about how that could affect the control-plane. For this reason only a single worker is deployed on a master node. Each image is pulled serial with a 1s cool down period. But this still results in noticeable I/O. This is a basic idle AWS cluster. While this latency on its own is not an issue under load it should be a consideration. Current proposed mitigations include exposing knobs around concurrency and the throttle duration.

image

example CR

apiVersion: machineconfiguration.openshift.io/v1
kind: PinnedImageSet
metadata:
name: worker-test
spec:
machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/worker: ""
pinnedImages:
  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7aa95f32af51fc7892546a1e028808ec1bab1e507cf671b88d8280d2521e61d6
  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d98ddbe73bda2ffed4d1aeb52be0500b8f8fe870cb465a8bb0cb113f7ed5ade3

ref.
[1] MCO-838 https://issues.redhat.com//browse/MCO-838
[2] openshift/api#1713

Blocked by
https://issues.redhat.com/browse/OCPNODE-1986

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants