Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicated Traceflow tag allocation due to Traceflow CRD updates #1094

Merged
merged 1 commit into from
Aug 20, 2020

Conversation

jianjuns
Copy link
Contributor

@jianjuns jianjuns commented Aug 14, 2020

For a CRD update, controller checks if the CRD has been processed and
already has a tag allocated or not; if it has been processed already
the update will be ignored by controller.
Also controller releases the allocated tag when failing to update the
CRD status.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@jianjuns
Copy link
Contributor Author

@gran-vmv @abhiraut I did not really rootcause it, but this fix can resolve the issue reported by Dumlu, that the Traceflow CRD Status is not set to Running. I just guess it might be due to duplicated patches by controller. Maybe you have a theory.


for _, n := range c.runningTraceflows {
if n == name {
// The Traceflow request has been processed already.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess using name as the key might not be safe. For example, what will happen if controller disconnected from K8s apiserver, and then a processed CRD is deleted but a new CRD with the same name is created before controller connects to K8s apiserver?
But maybe ok let customer delete the CRD created during controller disconnected/down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to solve this case we need to store tf name and uuid into c.runningTraceflows, and improve allocate/deallocate logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Feel it is too big a change, and maybe not worthwhile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth adding a note in the traceflow doc about this

@antoninbas
Copy link
Contributor

@jianjuns I understand that there is a tag leak, but do you know why avoiding the duplicate Patch request fixes the status not set issue?

@gran-vmv I'm also curious as to why we use Patch as opposed to UpdateStatus in runningTraceflowCRD? It seems that the agents should see the updates either way?

@gran-vmv
Copy link
Contributor

@gran-vmv I'm also curious as to why we use Patch as opposed to UpdateStatus in runningTraceflowCRD? It seems that the agents should see the updates either way?

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

@jianjuns
Copy link
Contributor Author

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

@jianjuns jianjuns changed the title WIP - Fix duplicated Traceflow tag allocation due to Traceflow CRD updates Fix duplicated Traceflow tag allocation due to Traceflow CRD updates Aug 17, 2020
@gran-vmv
Copy link
Contributor

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

In a PATCH request, we only need to construct the body for what we want to change, and no need to use RetryOnConflict.

@jianjuns
Copy link
Contributor Author

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

In a PATCH request, we only need to construct the body for what we want to change, and no need to use RetryOnConflict.

Maybe I missed something, but in antrea-controller, when your code calls UpdateStatus() I did not see RetryOnConflict either?
I do see a RetryOnConflict func in controller.go, but it is never calls, and it does look like about RetryOnConflict for updates?

@gran-vmv
Copy link
Contributor

Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict.

Sorry, I did not get why PATCH is better if no other writer? Could you explain?

In a PATCH request, we only need to construct the body for what we want to change, and no need to use RetryOnConflict.

Maybe I missed something, but in antrea-controller, when your code calls UpdateStatus() I did not see RetryOnConflict either?
I do see a RetryOnConflict func in controller.go, but it is never calls, and it does look like about RetryOnConflict for updates?

RetryOnConflict is not used in controller but in agent, traceflow/packetin.go

@jianjuns
Copy link
Contributor Author

RetryOnConflict is not used in controller but in agent, traceflow/packetin.go

In controller, checkTraceflowStatus also calls UpdateStatus, but it does not do RetryOnConflict.

Could you also explain to me what the controller func RetryOnConflict is for?

@gran-vmv
Copy link
Contributor

RetryOnConflict is not used in controller but in agent, traceflow/packetin.go

In controller, checkTraceflowStatus also calls UpdateStatus, but it does not do RetryOnConflict.

Could you also explain to me what the controller func RetryOnConflict is for?

What is "controller func RetryOnConflict" mean? There is only one RetryOnConflict usage in agent.
If a object is updated by multiple clients, we need to use RetryOnConflict to handle.

@jianjuns jianjuns force-pushed the tf-update branch 2 times, most recently from 483ba46 to c41cf37 Compare August 18, 2020 04:25
patchData := Traceflow{Status: opsv1alpha1.TraceflowStatus{Phase: tf.Status.Phase, DataplaneTag: dataPlaneTag}}
payloads, _ := json.Marshal(patchData)
return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status")
_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we will always retry at any error, and even patch can fail and we should retry too.

I can add retry, but what the problem to retry in processTraceflowItem() to follow the standard controller retry model, instead of adding a new retry mechanisms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the code, and feel we should always retry at processTraceflowItem() at any error of processTraceflowItem(), except the case the TF CRD is already deleted. Would you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK if the controller won't retry indefinitely.
Please note we need to retry periodically for a Running TF, to check if the work is done or timeout, the logic is in checkTraceflowStatus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually feel controller should retry indefinitely for the current errors as they should be recoverable.
@tnqn : thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, infinite retries with a rate limit queue is a common pattern of controller. The NotFound error is usually handled after tf, err := c.traceflowLister.Get(traceflowName) by not returning an error (then the item will be removed from the queue)
For the current errors, there seems no real reason to stop retrying: even for the error of no available tag, it could keep retrying until it timeouts.
Maybe we could return a single error value to indicate whether it should be retried to be simpler.
AFAIK, both of retrying immediately and relying on workqueue to retry are used in K8s. I think the former way could avoid some repeated processing before the API call (in next rounds) and finish the task earlier while the latter processes the whole tasks in a more fair way by not blocking on any specific task (the retry comes with backoff). I don't have strong opinion on which mode should be used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering TF is a diagnostic tool for on demand purposes.. we may not need to have so many conflicts so relying on work queue for retries should be okay..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I will update the PR to retry for all errors.

payloads, _ := json.Marshal(patchData)
return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status")
tf.Status.Reason = reason
_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return i, nil
}
}
return 0, errors.New("Too much traceflow currently")
return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit :%d", maxTagNum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit :%d", maxTagNum)
return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit: %d", maxTagNum)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you are still to push this fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will update later today.

abhiraut
abhiraut previously approved these changes Aug 18, 2020
patchData := Traceflow{Status: opsv1alpha1.TraceflowStatus{Phase: tf.Status.Phase, DataplaneTag: dataPlaneTag}}
payloads, _ := json.Marshal(patchData)
return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status")
_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering TF is a diagnostic tool for on demand purposes.. we may not need to have so many conflicts so relying on work queue for retries should be okay..

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself LGTM, but there's a potential bug existing before the PR, I'm fine with addressing it with this PR or a separate one.

pkg/controller/traceflow/controller.go Outdated Show resolved Hide resolved
}

func (c *Controller) runningTraceflowCRD(tf *opsv1alpha1.Traceflow, dataPlaneTag uint8) (*opsv1alpha1.Traceflow, error) {
func (c *Controller) runningTraceflowCRD(tf *opsv1alpha1.Traceflow, dataPlaneTag uint8) error {
tf.Status.DataplaneTag = dataPlaneTag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed a potential bug, but not introduced by this PR:
The controller is modifying the object returned by the store's "Get" method, which is highly discouraged: https://github.com/kubernetes/kubernetes/blob/3f579d8971fcce96d6b01b968a46c720f10940b8/staging/src/k8s.io/client-go/tools/cache/thread_safe_store.go#L31-L40.
Although it might be fine for now as we don't build indices on the two fields it's mutating and don't operate a single object concurrently, it's still safer to treat the object as read-only as recommended like other controllers. It may affect the update event the eventHandler will receive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let me change to copy the object.

@tnqn
Copy link
Member

tnqn commented Aug 19, 2020

typo in commit message: proceessed -> processed

@jianjuns
Copy link
Contributor Author

typo in commit message: proceessed -> processed

Fixed. Thanks!

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typo fix is missing, otherwise LGTM

pkg/controller/traceflow/controller.go Outdated Show resolved Hide resolved
For a CRD update, controller checks if the CRD has been processed and
already has a tag allocated or not; if it has been processed already
the update will be ignored by controller.
Also controller releases the allocated tag when failing to update the
CRD status.
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Aug 20, 2020

/test-all

@jianjuns jianjuns merged commit f2cdeed into antrea-io:master Aug 20, 2020
antoninbas pushed a commit to antoninbas/antrea that referenced this pull request Aug 21, 2020
…ntrea-io#1094)

For a CRD update, controller checks if the CRD has been processed and
already has a tag allocated or not; if it has been processed already
the update will be ignored by controller.
Also controller releases the allocated tag when failing to update the
CRD status.
antoninbas pushed a commit that referenced this pull request Aug 21, 2020
…1094)

For a CRD update, controller checks if the CRD has been processed and
already has a tag allocated or not; if it has been processed already
the update will be ignored by controller.
Also controller releases the allocated tag when failing to update the
CRD status.
@jianjuns jianjuns deleted the tf-update branch August 25, 2020 21:54
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
…ntrea-io#1094)

For a CRD update, controller checks if the CRD has been processed and
already has a tag allocated or not; if it has been processed already
the update will be ignored by controller.
Also controller releases the allocated tag when failing to update the
CRD status.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants