-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance enhancements #599
Comments
Do we have reason to believe that it is the |
Not sure if it will be helpful, but this KEP elaborates on the thought process and goals of API fairness and priority in detail. |
In particular, both |
Is the one change listed here for the initial list the only performance change needed? Do you need help setting up a test environment? |
Hi @evankanderson I didn't mean to close it, but it got closed along with the PR, we are still working (although we are not able to spend much cycles) on some of the items from the list. Thank you so much for the help :) |
This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response. |
Describe the problem/challenge you have
We rely on
list
api calls to get information from the cluster which could put some burden on the cluster if the number of objects returned is high. When the number of apps being deployed using kapp increases (in cases of kapp-controller packages), this becomes a problem as the time taken to deploy the apps increases after a certain point without any burden on the cpu or memory of the cluster nodes.socket: too many open files
when ulimit is set to a low number (256)*Describe the solution you'd like
We need to minimise the list calls as much as possible (Replacing them with get or watch is also an option).
Tasks
Instead of trying to get all the server resources, get only the ones that are related to the available GKs (anyway we get rid of the others later so they are not being used) - Cancelled for the time being
When we deploy an app, we first list the labeled resources (GVs) and then try to get the non labeled resources one by one one that are not found in the first step. When an app is deployed for the first time, the first step would always return nil, so maybe we could skip that?
Use watch instead of get and list while waiting for resources to reconcile. Using watch will be helpful for resources that take more time to reconcile (for example deployments), but for resources that reconcile almost immediately (for example configmap), it might bring some overhead.
We increased the wait-check-interval to 3s as it is reducing api calls and not affecting deployment time much.
PR for the same is here and the data collected while doing spike can check here
When we have a CRD and CR present in the same manifest, we try to fetch the server resources again to find the CRD (since it wasn't present in the cached server resources). We should somehow avoid doing this as we wouldn't find the CRD this time as well. (No need to work on this if we already work on the first one)
Now that we have added the resource namespaces to the fallbackAllowedNamespaces, should we always use fallbackAllowedNamespaces instead of checking resources cluster wide?
Currently we store the unique GKs in the meta ConfigMap and we do a list on the GKs, since list calls are more expensive we can check if doing get calls for all the resources is less expensive than list calls for unique GKs.
Improving performance enhancement specifically during diff stage. With go profiling, it was noticed that there are too many calls to deepCopy and AsYAMLBytes. PR
Anything else you would like to add:
It might be worth understanding the API priority and fairness.
Vote on this request
This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.
👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"
We are also happy to receive and review Pull Requests if you want to help working on this issue.
The text was updated successfully, but these errors were encountered: