-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add startTime to the Traceflow Status #2952
Add startTime to the Traceflow Status #2952
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2952 +/- ##
==========================================
+ Coverage 59.84% 61.07% +1.23%
==========================================
Files 289 289
Lines 24551 24562 +11
==========================================
+ Hits 14693 15002 +309
+ Misses 8250 7941 -309
- Partials 1608 1619 +11
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all @wenqiq This may be the reason why Traceflow failed immediately in your testbed. You can check if the clock on controlplane node is at least 10s sooner than the clock on the Node antrea-controller runs. |
/test-integration |
It seems you forgot to update generated code.
|
571d38f
to
514bc56
Compare
I think that other people using the Vagrant-based cluster may experience the same issue if they keep it running for a long time. @tnqn thanks for the reminder, I updated the generated code |
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-e2e |
Thanks, I got it now.
Yes, you are right. The clock on controlplane node is at least 19s sooner than the clock on the Node antrea-controller runs. The clock on controlplane node:
The Node antrea-controller runs:
|
/test-e2e |
2 Traceflow tests are failing with this change, I am investigating the failures
|
The startTime is used to determine if a Traceflow has timed out and should be reported as failed. Until now we were relying on the CreationTimestamp set by the kube-apiserver. However, by chance, I was testing Traceflow on a cluster with a clock skew between the control plane Node (running the kube-apiserver) and the worker Node running the antrea-controller. Because of the clock skew, each Traceflow request was tagged as failed as soon as I created it (with reason timeout). It took me a while to figure out the reason. By introducing the startTime field (set & used by the antrea-controller), we avoid the possibility of such issues. If startTime is not available for any reason, we fall back to the CreationTimestamp. This API change is backwards-compatible. Signed-off-by: Antonin Bas <[email protected]>
Signed-off-by: Antonin Bas <[email protected]>
To avoid issues with OpenAPI validation in K8s versions prior to v1.20. See kubernetes/kubernetes#86811 The recommendation to avoid issues with metav1.Time being serialized as null when the field is unset seems to be to make it a pointer. A nil pointer is them omitted during serialization. As far as I can tell, it should also be possible to stick with a metav1.Time and make the property "nullable" in the OpenAPI schema, but I am not sure whether this can create other issues. Signed-off-by: Antonin Bas <[email protected]>
514bc56
to
070e010
Compare
/test-all |
1 similar comment
/test-all |
@tnqn could you please review the last commit? I had to make a change in order to support K8s versions older than 1.20. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a typo in the message of 3rd commit: s/them/then
/test-integration |
This reverts commit f679f04.
The startTime is used to determine if a Traceflow has timed out and
should be reported as failed.
Until now we were relying on the CreationTimestamp set by the
kube-apiserver. However, by chance, I was testing Traceflow on a cluster
with a clock skew between the control plane Node (running the
kube-apiserver) and the worker Node running the
antrea-controller. Because of the clock skew, each Traceflow request was
tagged as failed as soon as I created it (with reason timeout). It took
me a while to figure out the reason.
By introducing the startTime field (set & used by the
antrea-controller), we avoid the possibility of such issues. If
startTime is not available for any reason, we fall back to the
CreationTimestamp.
This API change is backwards-compatible.
Signed-off-by: Antonin Bas [email protected]