Adjust update strategy for NNF DaemonSets #118

bdevcich · 2024-01-11T16:08:59Z

It took a significant amount of time for it to kill and restart all the pods when I added in the merced filesystem since it ran through it sequentially. Thankfully that's a one time thing, but it seems it will make redeploying slow.

When you add/remove a lustrefilesystem resource, the nnf-dm-manager-controller sees that and then adds/removes a Volume and VolumeMount to the nnf-dm-worker DaemonSet. Kubernetes then handles it from there and restarts the nnf-dm-worker pods on each rabbit to mount/umount that filesystem change. The DaemonSet defines this for the updateStrategy:

 updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

That maxUnavailable: 1 is what is causing the sequential behavior. We'll need to tweak this.

The text was updated successfully, but these errors were encountered:

bdevcich · 2024-02-12T15:03:52Z

We need to consider the same for lusre-csi-driver. There might be other areas for this as well.

bdevcich · 2024-02-13T15:43:57Z

@behlendorf, I'm making changes to set this to a sane default for our 3 daemonsets:

nnf-dm-worker
lustre-csi-node
nnf-node-manager

maxUnavailable can be set to a number of nodes/pods or a percentage. Setting it to 100% would cause it to attempt a restart on all the nodes at the same time, 50% would do half, 25% a quarter, etc. Do you have a preference on what percentage (or hard number) should be?

This value will be adjustable for each system.

behlendorf · 2024-02-13T17:48:16Z

@bdevcich as an initial swag how about 25%. This seems like it may be a reasonable compromise between propagating the changes rapidly and potentially overwhelming the system / container repository / other. Then we can tune on a per system basis, and revisit the default as needed.

bdevcich · 2024-02-13T18:49:56Z

@bdevcich as an initial swag how about 25%. This seems like it may be a reasonable compromise between propagating the changes rapidly and potentially overwhelming the system / container repository / other. Then we can tune on a per system basis, and revisit the default as needed.

Perfect, that's the percentage that I've been playing around with in my testing.

bdevcich · 2024-02-13T20:43:35Z

PRs here to default these all to 25%:

Configure the DS to use a rolling update strategy of 25% HewlettPackard/lustre-csi-driver#62
Configure the NNF Node Manager DS to use a rolling update strategy of 25% nnf-sos#262
Add UpdateStrategy to DMM for the dm-worker DS and use 25% nnf-dm#155

For nnf-dm, the NnfDataMovementManager resource is edited rather than the DaemonSet directly. The manager is responsibe for managing the DaemonSet.

bdevcich · 2024-02-14T01:08:19Z

@behlendorf I am comfortable closing this issue after we implemented this manually today on El Cap. Do you agree?

behlendorf · 2024-02-14T17:04:26Z

Yup, things are looking much better after these changes.

github-project-automation bot added this to Issues Dashboard Jan 11, 2024

github-project-automation bot moved this to 📋 Open in Issues Dashboard Jan 11, 2024

bdevcich self-assigned this Jan 11, 2024

bdevcich moved this from 📋 Open to 👀 In review in Issues Dashboard Feb 13, 2024

bdevcich changed the title ~~Adjust update strategy for nnf-dm worker pods~~ Adjust update strategy for NNF DaemonSets Feb 13, 2024

bdevcich closed this as completed in HewlettPackard/lustre-csi-driver#62 Feb 14, 2024

github-project-automation bot moved this from 👀 In review to ✅ Closed in Issues Dashboard Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust update strategy for NNF DaemonSets #118

Adjust update strategy for NNF DaemonSets #118

bdevcich commented Jan 11, 2024 •

edited

Loading

bdevcich commented Feb 12, 2024

bdevcich commented Feb 13, 2024

behlendorf commented Feb 13, 2024

bdevcich commented Feb 13, 2024

bdevcich commented Feb 13, 2024 •

edited

Loading

bdevcich commented Feb 14, 2024

behlendorf commented Feb 14, 2024

Adjust update strategy for NNF DaemonSets #118

Adjust update strategy for NNF DaemonSets #118

Comments

bdevcich commented Jan 11, 2024 • edited Loading

bdevcich commented Feb 12, 2024

bdevcich commented Feb 13, 2024

behlendorf commented Feb 13, 2024

bdevcich commented Feb 13, 2024

bdevcich commented Feb 13, 2024 • edited Loading

bdevcich commented Feb 14, 2024

behlendorf commented Feb 14, 2024

bdevcich commented Jan 11, 2024 •

edited

Loading

bdevcich commented Feb 13, 2024 •

edited

Loading