-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicated Traceflow tag allocation due to Traceflow CRD updates #1094
Conversation
Thanks for your PR. The following commands are available:
|
|
||
for _, n := range c.runningTraceflows { | ||
if n == name { | ||
// The Traceflow request has been processed already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess using name as the key might not be safe. For example, what will happen if controller disconnected from K8s apiserver, and then a processed CRD is deleted but a new CRD with the same name is created before controller connects to K8s apiserver?
But maybe ok let customer delete the CRD created during controller disconnected/down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to solve this case we need to store tf name and uuid into c.runningTraceflows, and improve allocate/deallocate logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Feel it is too big a change, and maybe not worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth adding a note in the traceflow doc about this
Because in this phase, only controller will update this status properties, thus we use PATCH method here. I'm OK to change this to UpdateStatus with RetryOnConflict. |
Sorry, I did not get why PATCH is better if no other writer? Could you explain? |
In a PATCH request, we only need to construct the body for what we want to change, and no need to use RetryOnConflict. |
Maybe I missed something, but in antrea-controller, when your code calls UpdateStatus() I did not see RetryOnConflict either? |
RetryOnConflict is not used in controller but in agent, traceflow/packetin.go |
In controller, checkTraceflowStatus also calls UpdateStatus, but it does not do RetryOnConflict. Could you also explain to me what the controller func RetryOnConflict is for? |
What is "controller func RetryOnConflict" mean? There is only one RetryOnConflict usage in agent. |
483ba46
to
c41cf37
Compare
patchData := Traceflow{Status: opsv1alpha1.TraceflowStatus{Phase: tf.Status.Phase, DataplaneTag: dataPlaneTag}} | ||
payloads, _ := json.Marshal(patchData) | ||
return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status") | ||
_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use RetryOnConflict to catch this? Like what we did in agent.
https://github.com/vmware-tanzu/antrea/blob/880794f5589b641fd77cb60e9ca8b71d7202c69a/pkg/agent/controller/traceflow/packetin.go#L46
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we will always retry at any error, and even patch can fail and we should retry too.
I can add retry, but what the problem to retry in processTraceflowItem() to follow the standard controller retry model, instead of adding a new retry mechanisms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked the code, and feel we should always retry at processTraceflowItem() at any error of processTraceflowItem(), except the case the TF CRD is already deleted. Would you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK if the controller won't retry indefinitely.
Please note we need to retry periodically for a Running TF, to check if the work is done or timeout, the logic is in checkTraceflowStatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually feel controller should retry indefinitely for the current errors as they should be recoverable.
@tnqn : thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, infinite retries with a rate limit queue is a common pattern of controller. The NotFound error is usually handled after tf, err := c.traceflowLister.Get(traceflowName)
by not returning an error (then the item will be removed from the queue)
For the current errors, there seems no real reason to stop retrying: even for the error of no available tag, it could keep retrying until it timeouts.
Maybe we could return a single error value to indicate whether it should be retried to be simpler.
AFAIK, both of retrying immediately and relying on workqueue to retry are used in K8s. I think the former way could avoid some repeated processing before the API call (in next rounds) and finish the task earlier while the latter processes the whole tasks in a more fair way by not blocking on any specific task (the retry comes with backoff). I don't have strong opinion on which mode should be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering TF is a diagnostic tool for on demand purposes.. we may not need to have so many conflicts so relying on work queue for retries should be okay..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I will update the PR to retry for all errors.
payloads, _ := json.Marshal(patchData) | ||
return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status") | ||
tf.Status.Reason = reason | ||
_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
return i, nil | ||
} | ||
} | ||
return 0, errors.New("Too much traceflow currently") | ||
return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit :%d", maxTagNum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit :%d", maxTagNum) | |
return 0, fmt.Errorf("On-going Traceflow operations already reached the upper limit: %d", maxTagNum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you are still to push this fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will update later today.
patchData := Traceflow{Status: opsv1alpha1.TraceflowStatus{Phase: tf.Status.Phase, DataplaneTag: dataPlaneTag}} | ||
payloads, _ := json.Marshal(patchData) | ||
return c.client.OpsV1alpha1().Traceflows().Patch(context.TODO(), tf.Name, types.MergePatchType, payloads, metav1.PatchOptions{}, "status") | ||
_, err := c.client.OpsV1alpha1().Traceflows().UpdateStatus(context.TODO(), tf, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering TF is a diagnostic tool for on demand purposes.. we may not need to have so many conflicts so relying on work queue for retries should be okay..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR itself LGTM, but there's a potential bug existing before the PR, I'm fine with addressing it with this PR or a separate one.
} | ||
|
||
func (c *Controller) runningTraceflowCRD(tf *opsv1alpha1.Traceflow, dataPlaneTag uint8) (*opsv1alpha1.Traceflow, error) { | ||
func (c *Controller) runningTraceflowCRD(tf *opsv1alpha1.Traceflow, dataPlaneTag uint8) error { | ||
tf.Status.DataplaneTag = dataPlaneTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed a potential bug, but not introduced by this PR:
The controller is modifying the object returned by the store's "Get" method, which is highly discouraged: https://github.com/kubernetes/kubernetes/blob/3f579d8971fcce96d6b01b968a46c720f10940b8/staging/src/k8s.io/client-go/tools/cache/thread_safe_store.go#L31-L40.
Although it might be fine for now as we don't build indices on the two fields it's mutating and don't operate a single object concurrently, it's still safer to treat the object as read-only as recommended like other controllers. It may affect the update event the eventHandler will receive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let me change to copy the object.
typo in commit message: proceessed -> processed |
Fixed. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A typo fix is missing, otherwise LGTM
For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
…ntrea-io#1094) For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.
…1094) For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.
…ntrea-io#1094) For a CRD update, controller checks if the CRD has been processed and already has a tag allocated or not; if it has been processed already the update will be ignored by controller. Also controller releases the allocated tag when failing to update the CRD status.
For a CRD update, controller checks if the CRD has been processed and
already has a tag allocated or not; if it has been processed already
the update will be ignored by controller.
Also controller releases the allocated tag when failing to update the
CRD status.