Fix duplicated Traceflow tag allocation due to Traceflow CRD updates #1094

jianjuns · 2020-08-14T23:13:25Z

For a CRD update, controller checks if the CRD has been processed and
already has a tag allocated or not; if it has been processed already
the update will be ignored by controller.
Also controller releases the allocated tag when failing to update the
CRD status.

antrea-bot · 2020-08-14T23:13:39Z

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

/test-e2e: to trigger e2e tests.
/skip-e2e: to skip e2e tests.
/test-conformance: to trigger conformance tests.
/skip-conformance: to skip conformance tests.
/test-whole-conformance: to trigger all conformance tests on linux.
/skip-whole-conformance: to skip all conformance tests on linux.
/test-networkpolicy: to trigger networkpolicy tests.
/skip-networkpolicy: to skip networkpolicy tests.
/test-windows-conformance: to trigger windows conformance tests.
/skip-windows-conformance: to skip windows conformance tests.
/test-windows-networkpolicy: to trigger windows networkpolicy tests.
/skip-windows-networkpolicy: to skip windows networkpolicy tests.
/test-hw-offload: to trigger ovs hardware offload test.
/skip-hw-offload: to skip ovs hardware offload test.
/test-all: to trigger all tests (except whole conformance).
/skip-all: to skip all tests (except whole conformance).

jianjuns · 2020-08-14T23:52:10Z

@gran-vmv @abhiraut I did not really rootcause it, but this fix can resolve the issue reported by Dumlu, that the Traceflow CRD Status is not set to Running. I just guess it might be due to duplicated patches by controller. Maybe you have a theory.

jianjuns · 2020-08-14T23:54:57Z

pkg/controller/traceflow/controller.go

+
+	for _, n := range c.runningTraceflows {
+		if n == name {
+			// The Traceflow request has been processed already.


I guess using name as the key might not be safe. For example, what will happen if controller disconnected from K8s apiserver, and then a processed CRD is deleted but a new CRD with the same name is created before controller connects to K8s apiserver?
But maybe ok let customer delete the CRD created during controller disconnected/down.

Yes, to solve this case we need to store tf name and uuid into c.runningTraceflows, and improve allocate/deallocate logic.

Right. Feel it is too big a change, and maybe not worthwhile.

probably worth adding a note in the traceflow doc about this

antoninbas · 2020-08-15T00:20:35Z

@jianjuns I understand that there is a tag leak, but do you know why avoiding the duplicate Patch request fixes the status not set issue?

@gran-vmv I'm also curious as to why we use Patch as opposed to UpdateStatus in runningTraceflowCRD? It seems that the agents should see the updates either way?

gran-vmv · 2020-08-15T00:50:00Z

@gran-vmv I'm also curious as to why we use Patch as opposed to UpdateStatus in runningTraceflowCRD? It seems that the agents should see the updates either way?

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

jianjuns · 2020-08-17T17:51:19Z

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

gran-vmv · 2020-08-18T02:06:10Z

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

In a PATCH request, we only need to construct the body for what we want to change, and no need to use RetryOnConflict.

jianjuns · 2020-08-18T03:12:10Z

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

In a PATCH request, we only need to construct the body for what we want to change, and no need to use RetryOnConflict.

Maybe I missed something, but in antrea-controller, when your code calls UpdateStatus() I did not see RetryOnConflict either?
I do see a RetryOnConflict func in controller.go, but it is never calls, and it does look like about RetryOnConflict for updates?

gran-vmv · 2020-08-18T03:15:51Z

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

In a PATCH request, we only need to construct the body for what we want to change, and no need to use RetryOnConflict.

Maybe I missed something, but in antrea-controller, when your code calls UpdateStatus() I did not see RetryOnConflict either?
I do see a RetryOnConflict func in controller.go, but it is never calls, and it does look like about RetryOnConflict for updates?

RetryOnConflict is not used in controller but in agent, traceflow/packetin.go

jianjuns · 2020-08-18T03:43:27Z

RetryOnConflict is not used in controller but in agent, traceflow/packetin.go

In controller, checkTraceflowStatus also calls UpdateStatus, but it does not do RetryOnConflict.

Could you also explain to me what the controller func RetryOnConflict is for?

gran-vmv · 2020-08-18T03:46:52Z

RetryOnConflict is not used in controller but in agent, traceflow/packetin.go

In controller, checkTraceflowStatus also calls UpdateStatus, but it does not do RetryOnConflict.

Could you also explain to me what the controller func RetryOnConflict is for?

What is "controller func RetryOnConflict" mean? There is only one RetryOnConflict usage in agent.
If a object is updated by multiple clients, we need to use RetryOnConflict to handle.

gran-vmv · 2020-08-18T05:13:57Z

pkg/controller/traceflow/controller.go

-	patchData := Traceflow{Status: opsv1alpha1.TraceflowStatus{Phase: tf.Status.Phase, DataplaneTag: dataPlaneTag}}
-	payloads, _ := json.Marshal(patchData)
-	return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status")
+	_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{})


Can we use RetryOnConflict to catch this? Like what we did in agent.
https://github.com/vmware-tanzu/antrea/blob/880794f5589b641fd77cb60e9ca8b71d7202c69a/pkg/agent/controller/traceflow/packetin.go#L46

I thought we will always retry at any error, and even patch can fail and we should retry too.

I can add retry, but what the problem to retry in processTraceflowItem() to follow the standard controller retry model, instead of adding a new retry mechanisms.

I just checked the code, and feel we should always retry at processTraceflowItem() at any error of processTraceflowItem(), except the case the TF CRD is already deleted. Would you agree?

I'm OK if the controller won't retry indefinitely.
Please note we need to retry periodically for a Running TF, to check if the work is done or timeout, the logic is in checkTraceflowStatus

I actually feel controller should retry indefinitely for the current errors as they should be recoverable.
@tnqn : thoughts?

Agree, infinite retries with a rate limit queue is a common pattern of controller. The NotFound error is usually handled after tf, err := c.traceflowLister.Get(traceflowName) by not returning an error (then the item will be removed from the queue)
For the current errors, there seems no real reason to stop retrying: even for the error of no available tag, it could keep retrying until it timeouts.
Maybe we could return a single error value to indicate whether it should be retried to be simpler.
AFAIK, both of retrying immediately and relying on workqueue to retry are used in K8s. I think the former way could avoid some repeated processing before the API call (in next rounds) and finish the task earlier while the latter processes the whole tasks in a more fair way by not blocking on any specific task (the retry comes with backoff). I don't have strong opinion on which mode should be used here.

considering TF is a diagnostic tool for on demand purposes.. we may not need to have so many conflicts so relying on work queue for retries should be okay..

Thanks for the review. I will update the PR to retry for all errors.

gran-vmv · 2020-08-18T05:14:29Z

pkg/controller/traceflow/controller.go

-	payloads, _ := json.Marshal(patchData)
-	return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status")
+	tf.Status.Reason = reason
+	_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{})


tnqn · 2020-08-18T14:56:02Z

pkg/controller/traceflow/controller.go

 			return i, nil
 		}
 	}
-	return 0, errors.New("Too much traceflow currently")
+	return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit :%d", maxTagNum)


Suggested change

return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit :%d", maxTagNum)

return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit: %d", maxTagNum)

looks like you are still to push this fix?

Yes, will update later today.

abhiraut · 2020-08-18T20:41:46Z

pkg/controller/traceflow/controller.go

-	patchData := Traceflow{Status: opsv1alpha1.TraceflowStatus{Phase: tf.Status.Phase, DataplaneTag: dataPlaneTag}}
-	payloads, _ := json.Marshal(patchData)
-	return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status")
+	_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{})


considering TF is a diagnostic tool for on demand purposes.. we may not need to have so many conflicts so relying on work queue for retries should be okay..

tnqn

The PR itself LGTM, but there's a potential bug existing before the PR, I'm fine with addressing it with this PR or a separate one.

pkg/controller/traceflow/controller.go

tnqn · 2020-08-19T10:42:35Z

pkg/controller/traceflow/controller.go

 }

-func (c *Controller) runningTraceflowCRD(tf *opsv1alpha1.Traceflow, dataPlaneTag uint8) (*opsv1alpha1.Traceflow, error) {
+func (c *Controller) runningTraceflowCRD(tf *opsv1alpha1.Traceflow, dataPlaneTag uint8) error {
 	tf.Status.DataplaneTag = dataPlaneTag


Just noticed a potential bug, but not introduced by this PR:
The controller is modifying the object returned by the store's "Get" method, which is highly discouraged: https://github.com/kubernetes/kubernetes/blob/3f579d8971fcce96d6b01b968a46c720f10940b8/staging/src/k8s.io/client-go/tools/cache/thread_safe_store.go#L31-L40.
Although it might be fine for now as we don't build indices on the two fields it's mutating and don't operate a single object concurrently, it's still safer to treat the object as read-only as recommended like other controllers. It may affect the update event the eventHandler will receive.

Sure. Let me change to copy the object.

tnqn · 2020-08-19T10:52:35Z

typo in commit message: proceessed -> processed

jianjuns · 2020-08-20T00:38:45Z

typo in commit message: proceessed -> processed

Fixed. Thanks!

tnqn

A typo fix is missing, otherwise LGTM

pkg/controller/traceflow/controller.go

For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.

tnqn

LGTM

tnqn · 2020-08-20T05:17:23Z

/test-all

…ntrea-io#1094) For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.

…1094) For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.

…ntrea-io#1094) For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.

jianjuns requested review from abhiraut, tnqn and gran-vmv August 14, 2020 23:13

vmwclabot added the cla-not-required label Aug 14, 2020

jianjuns commented Aug 14, 2020

View reviewed changes

jianjuns changed the title ~~WIP - Fix duplicated Traceflow tag allocation due to Traceflow CRD updates~~ Fix duplicated Traceflow tag allocation due to Traceflow CRD updates Aug 17, 2020

jianjuns force-pushed the tf-update branch 2 times, most recently from 483ba46 to c41cf37 Compare August 18, 2020 04:25

gran-vmv reviewed Aug 18, 2020

View reviewed changes

tnqn reviewed Aug 18, 2020

View reviewed changes

abhiraut previously approved these changes Aug 18, 2020

View reviewed changes

abhiraut self-requested a review August 18, 2020 20:44

jianjuns dismissed abhiraut’s stale review via a8c11d5 August 19, 2020 03:38

jianjuns force-pushed the tf-update branch from c41cf37 to a8c11d5 Compare August 19, 2020 03:38

tnqn reviewed Aug 19, 2020

View reviewed changes

jianjuns force-pushed the tf-update branch from a8c11d5 to de773a9 Compare August 20, 2020 00:38

tnqn reviewed Aug 20, 2020

View reviewed changes

pkg/controller/traceflow/controller.go Outdated Show resolved Hide resolved

jianjuns force-pushed the tf-update branch from de773a9 to 0f07d5e Compare August 20, 2020 04:49

tnqn approved these changes Aug 20, 2020

View reviewed changes

jianjuns merged commit f2cdeed into antrea-io:master Aug 20, 2020

jianjuns deleted the tf-update branch August 25, 2020 21:54

	return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit :%d", maxTagNum)
	return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit: %d", maxTagNum)

Fix duplicated Traceflow tag allocation due to Traceflow CRD updates #1094

Fix duplicated Traceflow tag allocation due to Traceflow CRD updates #1094

Conversation

jianjuns commented Aug 14, 2020 • edited Loading

antrea-bot commented Aug 14, 2020

jianjuns commented Aug 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoninbas commented Aug 15, 2020

gran-vmv commented Aug 15, 2020

jianjuns commented Aug 17, 2020

gran-vmv commented Aug 18, 2020

jianjuns commented Aug 18, 2020

gran-vmv commented Aug 18, 2020

jianjuns commented Aug 18, 2020

gran-vmv commented Aug 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn commented Aug 19, 2020

jianjuns commented Aug 20, 2020

tnqn left a comment

Choose a reason for hiding this comment

tnqn left a comment

Choose a reason for hiding this comment

tnqn commented Aug 20, 2020

jianjuns commented Aug 14, 2020 •

edited

Loading