Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Drop unnecessary listing for the sake of watch reinitialization #616

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tosi3k
Copy link

@tosi3k tosi3k commented Jul 23, 2024

This change addresses the performance issue existing in the cluster cache described in argoproj/argo-cd#18838.

kube-apiserver logs for the Pods resource (supposed super-low latency logged for the WATCH requests is due to a bug in Kubernetes: kubernetes/kubernetes#125614):

INFO 2024-07-23T05:26:40.330372Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500&resourceVersion=0" latency="95.186516ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="976ba7bb-40cc-4c2b-9739-e15c5e78415f" srcIP="10.64.11.10:35741" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="94.722114ms" resp=200
INFO 2024-07-23T05:26:40.516139Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=10636&timeoutSeconds=600&watch=true" latency="1.212642ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="26593f98-914a-4711-8eea-db9f284a8520" srcIP="10.64.11.10:35741" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="529.67µs" apf_execution_time="533.174µs" resp=0
INFO 2024-07-23T05:36:40.518449Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=17070&timeoutSeconds=600&watch=true" latency="1.104058ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="6e495250-eb2b-4af4-bb09-d999197d7e73" srcIP="10.64.11.10:35741" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="531.187µs" apf_execution_time="532.709µs" resp=0
INFO 2024-07-23T05:46:40.522146Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=23542&timeoutSeconds=600&watch=true" latency="988.866µs" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="05329b6e-8bea-44c0-b22a-42b7ce65b796" srcIP="10.64.11.10:35741" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="499.414µs" apf_execution_time="500.85µs" resp=0
INFO 2024-07-23T05:56:40.524950Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=29971&timeoutSeconds=600&watch=true" latency="995.693µs" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="fe66a06c-1329-4739-90f0-f0cdd4dd821d" srcIP="10.64.11.10:35741" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="455.428µs" apf_execution_time="456.954µs" resp=0

@tosi3k tosi3k force-pushed the clustercache-list-watch branch from 0c6045d to 4029ec6 Compare July 23, 2024 06:18
@@ -666,8 +656,10 @@ func (c *clusterCache) watchEvents(ctx context.Context, api kube.APIResourceInfo

obj, ok := event.Object.(*unstructured.Unstructured)
if !ok {
resourceVersion = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see if we can get away without resetting this and using whatever resourceVersion we already had. If we find we get into some nasty loop, we can always re-add it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't drop resetting the locally held RV here as well - put appropriate rationale in a comment in the code there as well.

// re-synchronize API state and restart watch if retry watcher failed to continue watching using provided resource version
case <-w.Done():
resourceVersion = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe see if we can get away without this? If we need it, we'll probably want to add a comment exactly why we needed it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't drop it - I put appropriate comment in the code to explain the rationale behind resetting resourceVersion.

This is the place where the retry watcher exits early and we are forced to relist in order to get a fresher RV from the server: https://github.com/kubernetes/kubernetes/blob/0fc167103189a4d462c3cc7a17b360c3d998f4bf/staging/src/k8s.io/client-go/tools/watch/retrywatcher.go#L208-L211.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - this has to remain here.

This should generally be rare, so relist from etcd is fine in those situations.

pkg/cache/cluster.go Show resolved Hide resolved
@tosi3k tosi3k force-pushed the clustercache-list-watch branch from 4029ec6 to b4fa2a7 Compare July 25, 2024 09:58
@tosi3k tosi3k changed the title feat: List resources from the watch cache feat: Drop unnecessary listing for the sake of watch reinitialization Jul 25, 2024
@tosi3k
Copy link
Author

tosi3k commented Jul 25, 2024

@crenshaw-dev thanks for the review - I'll respond to the comments here later on.

FWIW as agreed offline during yesterday's sync, I split the fix into two PRs - this one would just drop unnecessary listing after watch expiry and #617 would make the list API calls target the watch cache instead of etcd.

I guess we can proceed with the latter one.

@tosi3k tosi3k force-pushed the clustercache-list-watch branch from 409990b to f7c5e6e Compare July 30, 2024 15:58
Copy link

@wojtek-t
Copy link

This LGTM from k8s perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants