Add startTime to the Traceflow Status #2952

antoninbas · 2021-10-30T01:34:27Z

The startTime is used to determine if a Traceflow has timed out and
should be reported as failed.

Until now we were relying on the CreationTimestamp set by the
kube-apiserver. However, by chance, I was testing Traceflow on a cluster
with a clock skew between the control plane Node (running the
kube-apiserver) and the worker Node running the
antrea-controller. Because of the clock skew, each Traceflow request was
tagged as failed as soon as I created it (with reason timeout). It took
me a while to figure out the reason.

By introducing the startTime field (set & used by the
antrea-controller), we avoid the possibility of such issues. If
startTime is not available for any reason, we fall back to the
CreationTimestamp.

This API change is backwards-compatible.

Signed-off-by: Antonin Bas [email protected]

codecov-commenter · 2021-10-30T01:55:41Z

Codecov Report

Merging #2952 (070e010) into main (5db792d) will increase coverage by 1.23%.
The diff coverage is 73.33%.

@@            Coverage Diff             @@
##             main    #2952      +/-   ##
==========================================
+ Coverage   59.84%   61.07%   +1.23%     
==========================================
  Files         289      289              
  Lines       24551    24562      +11     
==========================================
+ Hits        14693    15002     +309     
+ Misses       8250     7941     -309     
- Partials     1608     1619      +11

Flag	Coverage Δ
kind-e2e-tests	`48.21% <53.33%> (+1.41%)`	⬆️
unit-tests	`40.18% <66.66%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/controller/traceflow/controller.go	`74.07% <73.33%> (-0.65%)`	⬇️
pkg/apiserver/handlers/endpoint/handler.go	`58.82% <0.00%> (-11.77%)`	⬇️
...g/controller/networkpolicy/store/appliedtogroup.go	`86.36% <0.00%> (-3.04%)`	⬇️
...gent/controller/networkpolicy/status_controller.go	`72.60% <0.00%> (-2.74%)`	⬇️
...gent/controller/noderoute/node_route_controller.go	`54.91% <0.00%> (-1.10%)`	⬇️
pkg/agent/controller/networkpolicy/reconciler.go	`77.19% <0.00%> (+0.20%)`	⬆️
...ntroller/networkpolicy/networkpolicy_controller.go	`71.34% <0.00%> (+0.91%)`	⬆️
pkg/agent/nodeportlocal/k8s/npl_controller.go	`62.02% <0.00%> (+1.04%)`	⬆️
pkg/monitor/controller.go	`29.10% <0.00%> (+1.49%)`	⬆️
pkg/ovs/openflow/ofctrl_action.go	`69.58% <0.00%> (+1.66%)`	⬆️
... and 11 more

tnqn

LGTM

tnqn · 2021-11-02T11:42:13Z

/test-all

@wenqiq This may be the reason why Traceflow failed immediately in your testbed. You can check if the clock on controlplane node is at least 10s sooner than the clock on the Node antrea-controller runs.

tnqn · 2021-11-02T11:42:43Z

/test-integration

tnqn · 2021-11-02T11:52:37Z

It seems you forgot to update generated code.

diff --git a/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go b/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go
index a2a892fd..435977f4 100644
--- a/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go
+++ b/pkg/apis/crd/v1alpha1/zz_generated.deepcopy.go
@@ -732,6 +732,7 @@ func (in *TraceflowSpec) DeepCopy() *TraceflowSpec {
 // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
 func (in *TraceflowStatus) DeepCopyInto(out *TraceflowStatus) {
        *out = *in
+       in.StartTime.DeepCopyInto(&out.StartTime)
        if in.Results != nil {
                in, out := &in.Results, &out.Results
                *out = make([]NodeResult, len(*in))

antoninbas · 2021-11-02T19:01:50Z

I think that other people using the Vagrant-based cluster may experience the same issue if they keep it running for a long time.

@tnqn thanks for the reminder, I updated the generated code

antoninbas · 2021-11-02T20:26:42Z

/test-all

tnqn

LGTM

antoninbas · 2021-11-04T21:53:34Z

/test-e2e
/test-networkpolicy

wenqiq · 2021-11-05T02:35:00Z

@wenqiq This may be the reason why Traceflow failed immediately in your testbed. You can check if the clock on controlplane node is at least 10s sooner than the clock on the Node antrea-controller runs.

Thanks for this information. I will take a look at it. Related issue #2944

tnqn · 2021-11-05T02:46:07Z

@wenqiq #2944 is not caused by this. I meant the issue when we used "antctl traceflow" to debug connection issue, which failed immediately with Timeout error.

wenqiq · 2021-11-05T07:28:26Z

@wenqiq #2944 is not caused by this. I meant the issue when we used "antctl traceflow" to debug connection issue, which failed immediately with Timeout error.

Thanks, I got it now.

@wenqiq This may be the reason why Traceflow failed immediately in your testbed. You can check if the clock on controlplane node is at least 10s sooner than the clock on the Node antrea-controller runs.

Yes, you are right. The clock on controlplane node is at least 19s sooner than the clock on the Node antrea-controller runs.

The clock on controlplane node:

[core@sunq-antrea-scale-01-6w6g6-master-0 ~]$ date
Fri Nov  5 07:18:05 UTC 2021

The Node antrea-controller runs:

[core@sunq-antrea-scale-01-6w6g6-worker-z22ws ~]$ date
Fri Nov  5 07:18:24 UTC 2021

antoninbas · 2021-11-05T16:24:13Z

/test-e2e

antoninbas · 2021-11-05T18:53:30Z

2 Traceflow tests are failing with this change, I am investigating the failures

            --- FAIL: TestTraceflow/testTraceflowIntraNode/traceflowGroupTest/hostNetworkSrcPodIPv4 (15.02s)
            --- FAIL: TestTraceflow/testTraceflowIntraNode/traceflowGroupTest/nonExistingSrcPodIPv4 (15.03s)

The startTime is used to determine if a Traceflow has timed out and should be reported as failed. Until now we were relying on the CreationTimestamp set by the kube-apiserver. However, by chance, I was testing Traceflow on a cluster with a clock skew between the control plane Node (running the kube-apiserver) and the worker Node running the antrea-controller. Because of the clock skew, each Traceflow request was tagged as failed as soon as I created it (with reason timeout). It took me a while to figure out the reason. By introducing the startTime field (set & used by the antrea-controller), we avoid the possibility of such issues. If startTime is not available for any reason, we fall back to the CreationTimestamp. This API change is backwards-compatible. Signed-off-by: Antonin Bas <[email protected]>

Signed-off-by: Antonin Bas <[email protected]>

To avoid issues with OpenAPI validation in K8s versions prior to v1.20. See kubernetes/kubernetes#86811 The recommendation to avoid issues with metav1.Time being serialized as null when the field is unset seems to be to make it a pointer. A nil pointer is them omitted during serialization. As far as I can tell, it should also be possible to stick with a metav1.Time and make the property "nullable" in the OpenAPI schema, but I am not sure whether this can create other issues. Signed-off-by: Antonin Bas <[email protected]>

antoninbas · 2021-11-05T20:59:18Z

/test-all

antoninbas · 2021-11-08T18:13:19Z

/test-all

antoninbas · 2021-11-09T05:21:51Z

@tnqn could you please review the last commit? I had to make a change in order to support K8s versions older than 1.20.

tnqn

LGTM, a typo in the message of 3rd commit: s/them/then

antoninbas · 2021-11-09T17:49:50Z

/test-integration

This reverts commit f679f04.

antoninbas added area/ops/traceflow Issues or PRs related to the Traceflow feature kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API. labels Oct 30, 2021

tnqn previously approved these changes Nov 2, 2021

View reviewed changes

antoninbas dismissed tnqn’s stale review via 514bc56 November 2, 2021 18:58

antoninbas force-pushed the add-startTime-to-traceflow-status branch from 571d38f to 514bc56 Compare November 2, 2021 18:58

antoninbas added this to the Antrea v1.5 release milestone Nov 2, 2021

tnqn previously approved these changes Nov 3, 2021

View reviewed changes

antoninbas added 3 commits November 5, 2021 12:27

Regenerate code

0b8432d

Signed-off-by: Antonin Bas <[email protected]>

antoninbas dismissed tnqn’s stale review via 070e010 November 5, 2021 20:58

antoninbas force-pushed the add-startTime-to-traceflow-status branch from 514bc56 to 070e010 Compare November 5, 2021 20:58

tnqn approved these changes Nov 9, 2021

View reviewed changes

antoninbas merged commit f679f04 into antrea-io:main Nov 9, 2021

antoninbas deleted the add-startTime-to-traceflow-status branch November 9, 2021 18:02

qiyueyao added a commit to Dyanngg/antrea that referenced this pull request Nov 9, 2021

Revert "Add startTime to the Traceflow Status (antrea-io#2952)"

7fb2c3c

This reverts commit f679f04.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add startTime to the Traceflow Status #2952

Add startTime to the Traceflow Status #2952

antoninbas commented Oct 30, 2021

codecov-commenter commented Oct 30, 2021 •

edited

Loading

tnqn left a comment

tnqn commented Nov 2, 2021

tnqn commented Nov 2, 2021

tnqn commented Nov 2, 2021

antoninbas commented Nov 2, 2021

antoninbas commented Nov 2, 2021

tnqn left a comment

antoninbas commented Nov 4, 2021

wenqiq commented Nov 5, 2021

tnqn commented Nov 5, 2021

wenqiq commented Nov 5, 2021

antoninbas commented Nov 5, 2021

antoninbas commented Nov 5, 2021

antoninbas commented Nov 5, 2021

antoninbas commented Nov 8, 2021

antoninbas commented Nov 9, 2021

tnqn left a comment

antoninbas commented Nov 9, 2021

Add startTime to the Traceflow Status #2952

Add startTime to the Traceflow Status #2952

Conversation

antoninbas commented Oct 30, 2021

codecov-commenter commented Oct 30, 2021 • edited Loading

Codecov Report

tnqn left a comment

Choose a reason for hiding this comment

tnqn commented Nov 2, 2021

tnqn commented Nov 2, 2021

tnqn commented Nov 2, 2021

antoninbas commented Nov 2, 2021

antoninbas commented Nov 2, 2021

tnqn left a comment

Choose a reason for hiding this comment

antoninbas commented Nov 4, 2021

wenqiq commented Nov 5, 2021

tnqn commented Nov 5, 2021

wenqiq commented Nov 5, 2021

antoninbas commented Nov 5, 2021

antoninbas commented Nov 5, 2021

antoninbas commented Nov 5, 2021

antoninbas commented Nov 8, 2021

antoninbas commented Nov 9, 2021

tnqn left a comment

Choose a reason for hiding this comment

antoninbas commented Nov 9, 2021

codecov-commenter commented Oct 30, 2021 •

edited

Loading