Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add startTime to the Traceflow Status #2952

Merged

Conversation

antoninbas
Copy link
Contributor

The startTime is used to determine if a Traceflow has timed out and
should be reported as failed.

Until now we were relying on the CreationTimestamp set by the
kube-apiserver. However, by chance, I was testing Traceflow on a cluster
with a clock skew between the control plane Node (running the
kube-apiserver) and the worker Node running the
antrea-controller. Because of the clock skew, each Traceflow request was
tagged as failed as soon as I created it (with reason timeout). It took
me a while to figure out the reason.

By introducing the startTime field (set & used by the
antrea-controller), we avoid the possibility of such issues. If
startTime is not available for any reason, we fall back to the
CreationTimestamp.

This API change is backwards-compatible.

Signed-off-by: Antonin Bas [email protected]

@antoninbas antoninbas added area/ops/traceflow Issues or PRs related to the Traceflow feature kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API. labels Oct 30, 2021
@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2021

Codecov Report

Merging #2952 (070e010) into main (5db792d) will increase coverage by 1.23%.
The diff coverage is 73.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2952      +/-   ##
==========================================
+ Coverage   59.84%   61.07%   +1.23%     
==========================================
  Files         289      289              
  Lines       24551    24562      +11     
==========================================
+ Hits        14693    15002     +309     
+ Misses       8250     7941     -309     
- Partials     1608     1619      +11     
Flag Coverage Δ
kind-e2e-tests 48.21% <53.33%> (+1.41%) ⬆️
unit-tests 40.18% <66.66%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/controller/traceflow/controller.go 74.07% <73.33%> (-0.65%) ⬇️
pkg/apiserver/handlers/endpoint/handler.go 58.82% <0.00%> (-11.77%) ⬇️
...g/controller/networkpolicy/store/appliedtogroup.go 86.36% <0.00%> (-3.04%) ⬇️
...gent/controller/networkpolicy/status_controller.go 72.60% <0.00%> (-2.74%) ⬇️
...gent/controller/noderoute/node_route_controller.go 54.91% <0.00%> (-1.10%) ⬇️
pkg/agent/controller/networkpolicy/reconciler.go 77.19% <0.00%> (+0.20%) ⬆️
...ntroller/networkpolicy/networkpolicy_controller.go 71.34% <0.00%> (+0.91%) ⬆️
pkg/agent/nodeportlocal/k8s/npl_controller.go 62.02% <0.00%> (+1.04%) ⬆️
pkg/monitor/controller.go 29.10% <0.00%> (+1.49%) ⬆️
pkg/ovs/openflow/ofctrl_action.go 69.58% <0.00%> (+1.66%) ⬆️
... and 11 more

tnqn
tnqn previously approved these changes Nov 2, 2021
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Nov 2, 2021

/test-all

@wenqiq This may be the reason why Traceflow failed immediately in your testbed. You can check if the clock on controlplane node is at least 10s sooner than the clock on the Node antrea-controller runs.

@tnqn
Copy link
Member

tnqn commented Nov 2, 2021

/test-integration

@tnqn
Copy link
Member

tnqn commented Nov 2, 2021

It seems you forgot to update generated code.

diff --git a/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go b/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go
index a2a892fd..435977f4 100644
--- a/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go
+++ b/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go
@@ -732,6 +732,7 @@ func (in *TraceflowSpec) DeepCopy() *TraceflowSpec {
 // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
 func (in *TraceflowStatus) DeepCopyInto(out *TraceflowStatus) {
        *out = *in
+       in.StartTime.DeepCopyInto(&out.StartTime)
        if in.Results != nil {
                in, out := &in.Results, &out.Results
                *out = make([]NodeResult, len(*in))

@antoninbas
Copy link
Contributor Author

I think that other people using the Vagrant-based cluster may experience the same issue if they keep it running for a long time.

@tnqn thanks for the reminder, I updated the generated code

@antoninbas antoninbas added this to the Antrea v1.5 release milestone Nov 2, 2021
@antoninbas
Copy link
Contributor Author

/test-all

tnqn
tnqn previously approved these changes Nov 3, 2021
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor Author

/test-e2e
/test-networkpolicy

@wenqiq
Copy link
Contributor

wenqiq commented Nov 5, 2021

@wenqiq This may be the reason why Traceflow failed immediately in your testbed. You can check if the clock on controlplane node is at least 10s sooner than the clock on the Node antrea-controller runs.

Thanks for this information. I will take a look at it. Related issue #2944

@tnqn
Copy link
Member

tnqn commented Nov 5, 2021

@wenqiq #2944 is not caused by this. I meant the issue when we used "antctl traceflow" to debug connection issue, which failed immediately with Timeout error.

@wenqiq
Copy link
Contributor

wenqiq commented Nov 5, 2021

@wenqiq #2944 is not caused by this. I meant the issue when we used "antctl traceflow" to debug connection issue, which failed immediately with Timeout error.

Thanks, I got it now.

@wenqiq This may be the reason why Traceflow failed immediately in your testbed. You can check if the clock on controlplane node is at least 10s sooner than the clock on the Node antrea-controller runs.

Yes, you are right. The clock on controlplane node is at least 19s sooner than the clock on the Node antrea-controller runs.

The clock on controlplane node:

[core@sunq-antrea-scale-01-6w6g6-master-0 ~]$ date
Fri Nov  5 07:18:05 UTC 2021

The Node antrea-controller runs:

[core@sunq-antrea-scale-01-6w6g6-worker-z22ws ~]$ date
Fri Nov  5 07:18:24 UTC 2021

@antoninbas
Copy link
Contributor Author

/test-e2e

@antoninbas
Copy link
Contributor Author

2 Traceflow tests are failing with this change, I am investigating the failures

            --- FAIL: TestTraceflow/testTraceflowIntraNode/traceflowGroupTest/hostNetworkSrcPodIPv4 (15.02s)
            --- FAIL: TestTraceflow/testTraceflowIntraNode/traceflowGroupTest/nonExistingSrcPodIPv4 (15.03s)

The startTime is used to determine if a Traceflow has timed out and
should be reported as failed.

Until now we were relying on the CreationTimestamp set by the
kube-apiserver. However, by chance, I was testing Traceflow on a cluster
with a clock skew between the control plane Node (running the
kube-apiserver) and the worker Node running the
antrea-controller. Because of the clock skew, each Traceflow request was
tagged as failed as soon as I created it (with reason timeout). It took
me a while to figure out the reason.

By introducing the startTime field (set & used by the
antrea-controller), we avoid the possibility of such issues. If
startTime is not available for any reason, we fall back to the
CreationTimestamp.

This API change is backwards-compatible.

Signed-off-by: Antonin Bas <[email protected]>
Signed-off-by: Antonin Bas <[email protected]>
To avoid issues with OpenAPI validation in K8s versions prior to v1.20.
See kubernetes/kubernetes#86811
The recommendation to avoid issues with metav1.Time being serialized as
null when the field is unset seems to be to make it a pointer. A nil
pointer is them omitted during serialization. As far as I can tell, it
should also be possible to stick with a metav1.Time and make the
property "nullable" in the OpenAPI schema, but I am not sure whether
this can create other issues.

Signed-off-by: Antonin Bas <[email protected]>
@antoninbas
Copy link
Contributor Author

/test-all

1 similar comment
@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor Author

@tnqn could you please review the last commit? I had to make a change in order to support K8s versions older than 1.20.

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a typo in the message of 3rd commit: s/them/then

@antoninbas
Copy link
Contributor Author

/test-integration

@antoninbas antoninbas merged commit f679f04 into antrea-io:main Nov 9, 2021
@antoninbas antoninbas deleted the add-startTime-to-traceflow-status branch November 9, 2021 18:02
qiyueyao added a commit to Dyanngg/antrea that referenced this pull request Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops/traceflow Issues or PRs related to the Traceflow feature kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants