-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PinnedImageSet crd, controller and prefetch manager #4094
Conversation
Skipping CI for Draft Pull Request. |
/test all |
1 similar comment
/test all |
3dbd001
to
1741a6e
Compare
/test all |
ddcc3e0
to
4667239
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks pretty clean, do you think the controller will need any new RBAC?
4667239
to
2881ef8
Compare
9964213
to
cd48caa
Compare
cd48caa
to
7ff9e78
Compare
7ff9e78
to
15f703d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! Left a few comments about API calls. Will give this another pass soon.
if isNotFound { | ||
_, err = ctrl.mcfgClient.MachineconfigurationV1().MachineConfigs().Create(context.TODO(), mc, metav1.CreateOptions{}) | ||
} else { | ||
_, err = ctrl.mcfgClient.MachineconfigurationV1().MachineConfigs().Update(context.TODO(), mc, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the preferred mechanism is patch
I believe. You can look around at how to make a sonmergepatch.CreateThreeWayJSONMergePatch(curJSON, modJSON, curJSON)
and then pass this output to .Patch
rather than .Update
.
There are some scenarios where you want to use update but I am forgetting if this falls into those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good I will dig into it to ensure correctness this is copy pasta from an existing controller in this repo.
return nil | ||
} | ||
|
||
_, err = ctrl.mcfgClient.MachineconfigurationV1().PinnedImageSets().UpdateStatus(context.TODO(), newImageSet, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpdateStatus
is right here as opposed to patch. Though you might need the rbac for pinnedimagesets/status specifically? I have run into this before where it does not allow me to update status unless I have this role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok mcc rbac I believe is inclusive but will doublecheck.
- apiGroups: ["machineconfiguration.openshift.io"]
resources: ["*"]
verbs: ["*"]
65cf948
to
e235223
Compare
pkg/daemon/daemon.go
Outdated
|
||
// minFreeStorageAfterPrefetch is the minimum amount of storage in bytes available on the root filesystem | ||
// after prefetching images. | ||
minFreeStorageAfterPrefetch int64 = 32 * 1024 * 1024 * 1024 // 32GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would use cases for this feature look like environment-wise? Is the expectation that if they are resource-limited, they really shouldn't be pre-pulling images?
It's a good to have a safeguard I think but maybe 32 is a bit high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we already collect some metrics around available free spaces for cluster for which it is targeted. If it exist, that will help to guess this value better. Free space is relative based on what kind of application is running on a cluster. A storage hungry application can run into no space sooner.
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the overall controller/daemon logic seems sound so far, some initial questions inline
haven't dug into details onto how the prefetch manager actually works, but I assume it's relatively disruption-proof?
since it somewhat runs independently, I'm just curious what happens if e.g. a machineconfig update comes in mid-way of image pulls and stops the daemon/reboots the node. I assume the aborted pull will just try from the start?
pkg/daemon/daemon.go
Outdated
|
||
// minFreeStorageAfterPrefetch is the minimum amount of storage in bytes available on the root filesystem | ||
// after prefetching images. | ||
minFreeStorageAfterPrefetch int64 = 32 * 1024 * 1024 * 1024 // 32GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would use cases for this feature look like environment-wise? Is the expectation that if they are resource-limited, they really shouldn't be pre-pulling images?
It's a good to have a safeguard I think but maybe 32 is a bit high?
pkg/daemon/prefetch_manager.go
Outdated
} | ||
|
||
func (p *PrefetchManager) sync(key string) error { | ||
klog.Infof("Syncing PinnedImageSet %q", key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to remove before merge or change verbosity
|
||
for _, node := range nodes { | ||
if !ctrl.isPrefetchCompleteForNode(node, imageSet) { | ||
// If prefetch is not complete fail fast and requeue the PinnedImageSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand this a bit. In the controller logic, you first ensurePinnedImageSet
which deploys the machineconfig, then immediately after that sync this.
I assume expectation is that the image pulls should be taking awhile to complete, so is the expectation that the controller will be in an error state until the daemons are done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the expectation is that it should be in a Progressing state. As the error is expected. So we should adjust that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We may want to show InProgress state to show that ImagePrefetch is progressing instead of an error.
pkg/daemon/prefetch_manager.go
Outdated
p.taskManager.add(imageSet, cancel) | ||
defer p.taskManager.cancel(imageSet) | ||
|
||
err = p.startWorkerPool(ctx, prefetchImages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I understand this correctly, the daemons are reacting to the pinnedimageset objects directly, and self-determining whether they should be pulling an image. Thus there are two processes happening in parallel:
- the MCC rendering the new machineconfig and the MCD main node sync reacting to that
- the MCD reading newly added pinnedimagesets and pulling images
Is there a strict dependency on that ordering? Is there any existing guard (sorry if I missed it) for that? And is there any additional pinnedimageset correctness needed to be processed by the controller before the daemon starts?
I guess the thought experiment is a large cluster with hundreds of nodes. While you only reload the crio daemon, each node is still sequentially processing the update with some built it delay of the machineconfig to enable pinnedimagesets (the crio toml file). This can take hours on large enough clusters, but I assume each daemon process running this would start the image pull already and potentially finish by the time the crio config updates.
} | ||
|
||
// getMachineConfigKey returns the managed key for the machine config | ||
func getMachineConfigKey(pool *mcfgv1.MachineConfigPool, client mcfgclientset.Interface, imageSetOrig *mcfgv1.PinnedImageSet) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will work like the crio/kubelet configuration rendering, meaning that custom pool config > worker pool config now (but if you don't define a pinned image set for, say, your infra node, it will inherit worker configs and still try to pull as if it was a worker).
That's probably the expected behaviour but wanted to check explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right since today. configs are deployed on a pool level I don't feel it makes sense for this controller to act in a different way. My understanding is that you can create a custom pool for dedicated to a certain purpose "infra"? In that case the config could be deployed to only those nodes which are pool members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great. Few overall question as I may have missed while briefly skimming through code:
- What happens to prefecthed pinnedImages which are no longer reference in the when user removes few from PinnedImageSet CRO?
- Do we want to add some sort of validation check to ensure that all images referenced in the PinnedImageSet are by hash and not tag?
3.This can happen in a separate PR, how about adding an e2e test for this feature? It can go in existing e2e-gcp-op. If it adds considerable amount of time for the test, we can do it a separate e2e test.
Unpinned images are subject to future pruning/wipe. The scope of this feature does not include a pruning mechanism.
This is built into the API-level validation pattern.
Sounds good |
- add gRPC-go - add k8s.io/cri-api Signed-off-by: Sam Batschelet <[email protected]>
b1cd17c
to
c8355a6
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hexfusion The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@hexfusion: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
updating api deps |
FYI, when the code is ready for testing, let's us know @sergiordlr @rioliu-rh @ptalgulk01 and hold this PR, THX |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR was a WIP test. A new PR with updated apis and intent will follow shortly /close |
@hexfusion: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@hexfusion: No Jira issue is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This PR implements openshift/enhancements#1481
This PR adds.
The PinnedImageSetController reconciles against two desired states.
pinned_images
configuration via MachineConfig. This is populated via the PinnedImageSet CRD[2].MachineConfigNode
status. Once the nodes in the pool defined by the CR have completed the Status is updated to reflect.The Prefetch Manager worker pool ensures that.
Additional Logic.
Considerations
Because we are pulling possibly a large number of images there is a concern about how that could affect the control-plane. For this reason only a single worker is deployed on a
master
node. Each image is pulled serial with a 1s cool down period. But this still results in noticeable I/O. This is a basic idle AWS cluster. While this latency on its own is not an issue under load it should be a consideration. Current proposed mitigations include exposing knobs around concurrency and the throttle duration.example CR
ref.
[1] MCO-838 https://issues.redhat.com//browse/MCO-838
[2] openshift/api#1713
Blocked by
https://issues.redhat.com/browse/OCPNODE-1986