Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle update network status failed after add and del attachments #237

Merged

Conversation

Longchuanzheng
Copy link
Contributor

@Longchuanzheng Longchuanzheng commented Jul 23, 2024

What this PR does / why we need it:
This pr delete all attachments added in this processNextWorkItem when set NetworkStatus failed in handleResult. Because DEL is Idempotence we can always return to the state before dynamic handle.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #235

Special notes for your reviewer (optional):

@Longchuanzheng Longchuanzheng requested a review from maiqueb as a code owner July 23, 2024 11:02
@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch from 3d1610d to 0320270 Compare July 23, 2024 11:16
@Longchuanzheng Longchuanzheng changed the title handle Interface already exists due to SetNetworkStatus failed Handle update network status failed after add and del attachments Jul 23, 2024
@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch from 0320270 to c1a00ec Compare July 24, 2024 01:48
@Longchuanzheng
Copy link
Contributor Author

This pr will cancel(delete) the added attachment in two cases:

  1. When update network status failed in handleResult, we delete all attachment added in this reconcile.
  2. When add attachments failed in reconcile, we try to delete add failed and all subsequent attachments that need to be added.

Copy link
Collaborator

@LionelJouin LionelJouin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature makes sense. I don't know if it would be possible to add unit tests for it?

pkg/controller/pod.go Outdated Show resolved Hide resolved
pkg/controller/pod.go Outdated Show resolved Hide resolved
pkg/controller/pod.go Outdated Show resolved Hide resolved
@Longchuanzheng
Copy link
Contributor Author

I think this feature makes sense. I don't know if it would be possible to add unit tests for it?

ok, I will try.

@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch from 61ec648 to c2829d1 Compare July 27, 2024 05:55
@Longchuanzheng
Copy link
Contributor Author

Right now, it will return an err if we want to delet net1 in ut. If we change this, some previous ut will failed. So my ut expect FailedRemovingInterface for remove the net1.
Or do we need do more in this pr to improve ut (I think it will change more codes in ut.)?

@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch 2 times, most recently from 79be62f to 1c54179 Compare July 30, 2024 09:11
@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch from 1c54179 to 24c5dbe Compare August 7, 2024 02:27
if shouldDeleteattachmentsAdded {
attachmentsToCancel = attachmentsToAdd
} else if addfailedIndex != -1 {
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[addfailedIndex])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[addfailedIndex])
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[:addfailedIndex])

Why having reverted this changes? All non failing attachment added must be reverted no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point ...

I agree with @LionelJouin; shouldn't you delete all the attachments you've just added ? i.e. setting the pod back to its original state ?

otherwise, the pod's network-status and the pod's actual interfaces will not be consistent.

Eventually(<-eventRecorder.Events).Should(Equal(expectedEventPayload))
})

It("the pod network-status dosen't change", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It("the pod network-status dosen't change", func() {
It("the pod network-status doesn't change", func() {

Eventually(<-eventRecorder.Events).Should(Equal(expectedEventPayload))

expectedEventPayload = fmt.Sprintf(
"Warning FailedRemovingInterface pod [%s]: failed removing interface %s from network: %s",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add networkConfig(multuscni.CmdDel, "net1", macAddr), to the fake multus client line 126 to test the simple case if you want (the simple case I mean: everything works except status update).

Eventually(<-eventRecorder.Events).Should(Equal(expectedEventPayload))
})

It("the pod network-status dosen't change", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is required as the status update has been prevented, so this might test Kubernetes.

Comment on lines +644 to +663
expectedError := errors.New("someerror")
k8sClient.PrependReactor("update", "pods", func(_ k8stesting.Action) (bool, runtime.Object, error) {
return true, nil, expectedError
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be moved to the JustBeforeEach

@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch 2 times, most recently from 13eaa7b to 2cd1a8d Compare August 30, 2024 06:31
@Longchuanzheng
Copy link
Contributor Author

@maiqueb Hi, can you review this pr, thank you in advance. I think this #235 will cause pod add/del attachnet stuck.
It is not very friendly if we stuck when try to hotplugin nic for vm with kubevirt.

Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. I agree we should keep the pod's interfaces consistent with the reported network-status at all times. Extra thanks for complying and adding the unit test @LionelJouin requested.

I'm requesting some changes to improve the overall code quality, and I also wonder if your approach (only deleting one interface) wouldn't break the pod interface / pod network status consistency.

Comment on lines 246 to 248
defer func() {
pnc.handleResult(err, podNamespacedName, pod, results, attachmentsToAdd, netnsPath, podSandboxID, addFailedIndex)
}()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how this looks.

How about we define a Rollback interface and pass that to the handle result func ?

It would be something like

type interface Rollback {
    Rollback() error
}

type asd struct {
    netnsPath string
    podSandboxID string
    attachmentsToRollback []nadv1.NetworkSelectionElement
}

func (a *asd) Rollback() error {
//     ... invoke DEL for attachments on asd.attachments
}

Then, on handleResult you just invoke rollback if err != nil

pkg/controller/pod_test.go Outdated Show resolved Hide resolved
pkg/controller/pod_test.go Outdated Show resolved Hide resolved
pkg/controller/pod_test.go Outdated Show resolved Hide resolved
pkg/controller/pod_test.go Outdated Show resolved Hide resolved
if shouldDeleteattachmentsAdded {
attachmentsToCancel = attachmentsToAdd
} else if addfailedIndex != -1 {
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[addfailedIndex])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point ...

I agree with @LionelJouin; shouldn't you delete all the attachments you've just added ? i.e. setting the pod back to its original state ?

otherwise, the pod's network-status and the pod's actual interfaces will not be consistent.

@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch 2 times, most recently from 13eaa7b to 2aa70ec Compare September 27, 2024 12:33
Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good.

@LionelJouin could you please review ?
@oshoval could you help me out in this one ?

I think this is close to a merge, thanks ! And sorry it took so long for me to review it.

Copy link
Member

@oshoval oshoval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks nice
just a nit so far

pkg/controller/pod.go Outdated Show resolved Hide resolved
@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch 2 times, most recently from 13eaa7b to b2044a4 Compare November 8, 2024 01:51
Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending CI results, this looks good.

I'll take a last quick review on Monday, then, assuming I don't see anything weird, I'll proceed with the merge.

Huge thanks for the contribution, and for your patience. I'm aware this took more than anticipated ...

@maiqueb
Copy link
Collaborator

maiqueb commented Nov 29, 2024

I see your PR now fails the linter tests ... Sorry about that.

Let me take a quick look; FWIW, I'm considering merging it and fixing it myself afterwards.

Really sorry this is taking so long.

Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's bring this in.

I'll fix the main branch on a follow up PR.

Huge thanks.

@maiqueb maiqueb merged commit 34ae165 into k8snetworkplumbingwg:main Nov 29, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Dynamic network plugin stuck due to updating pod network status failed
4 participants