-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate effective storage limits for completed runs #3175
Comments
tektoncd/experimental#479 is also related for a cronjob to clean up these completed resources. |
Anecdotally, I started seeing some minor sluggishness with about 1600 completed taskruns (this was doing an artificial test on minikube, where I just spammed it with new taskruns of the "hello world" task from the tutorial). Going to spend some time over the next couple of days doing some more rigorous testing and documenting the results in this issue. |
/assign @psschwei |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen Excessive etcd storage is still an issue when results and pruning aren't configured. We should run some tests to get a rough idea how many resources can be reliably stored in etcd storage at different resource levels, both as documentated guidance to operators and as a sales pitch for enabling results and/or pruning to avoid these issues. While we're doing this, we should collect some symptoms of an overloaded cluster (what behavior does the cluster exhibit under excessive etcd load, what error messages can people google to find our docs) |
We know that storing details of completed Runs eventually results in too many stored resources/bytes and unresponsive behavior from etcd and the K8s API server. We don't have a good idea exactly how many resources/bytes it takes to start causing problems.
We should explore this on a standard GKE cluster and document (even if it's just in this issue) our findings about what symptoms we observed, roughly how many resources it took to see them, etc.
If anybody else has experienced this on their own clusters and could contribute data, even anecdata, that would be helpful.
Related #454
cc @wlynch
The text was updated successfully, but these errors were encountered: