-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for label selection in watt/kubewatch #1292
Comments
I decided to investigate optimizing the yaml parsing path anyway, and it turns out that we can get a big speedup by using the C loader over the standard python implementation. The original issue I ran into was that ~20k service objects serialized as a yaml snapshot would take around 70 seconds to parse back into diagd's memory. With the C loader, this time is down to 6.5 seconds. Combining these two optimizations, my multi-tenant workload now allows for a single ambassador instance to process relevant updates in around 100ms (cutting the 20k services down to around 200-300, and getting a 10x speedup using the C loader). I'm happy with these results and I think each optimization has value own its own. I'll open a separate issue for using the C loader - #1294 |
i've been searching the issues and it seems my problem are related. In our case, i would postulate that it's the secrets count which is problematic since all our helm/tiller history is stored in secrets. |
Bump: thoughts? This optimization is critical for my use case, and I think others may eventually run into this, too. |
@esmet, so sorry for the delay here! I did in fact switch us to the C YAML parser, and I'd be very interested in seeing your patch to Kubewatch. Want to open a PR in the Teleproxy repo? Also, are you on the Datawire OSS Slack? There's an #ambassador-dev channel there which is a great place for discussions like this. |
Please describe your use case / problem.
I run multiple deployments of ambassador on a multi-tenant Kubernetes cluster, using ambassador_id to separate them into non-overlapping "environments". There can be hundreds of different environments running at the same time, and each environment can define dozens of Service objects.
In this scenario, Ambassador (kubewatch) uses a fair amount of memory (2-4gb) and takes up substantial CPU to process watcher updates. I've seen Ambassador take 60-70 seconds to process a single 15mb yaml snapshot. Worse, when any single service object changes, every ambassador will perform another that 60-70 second update. Ideally, Ambassador would use memory and CPU proportional to the set of services that are relevant to that particular ambassador_id, and could then scale well even in a massively multi-tenant cluster.
Describe the solution you'd like
I propose adding an environment configuration KUBEWATCH_LABEL_SELECTOR to kubewatch (https://github.com/datawire/teleproxy) which will be a raw label selection string to provide to the List/Watch implementation.
For example, if my architecture guarantees that all service objects in an environment contain a consistent
environment
label, then I could passKUBEWATCH_LABEL_SELECTOR="environment=qa123"
to limit the number of objects that kubewatch must operate on (ie: only the ones in the qa123 environment). This will limit the amount of memory and CPU required by Ambassador overall.I have a patch that implements this behavior for kubewatch (https://github.com/datawire/teleproxy)
I chose to open the issue here, at least for starters, since this feels mostly about an Ambassador scalability use case.
Describe alternatives you've considered
I considered investigating the hot path for yaml parsing in diagd to see if we could make it faster. I think this problem is solved best by letting an Ambassador operator tell the system which objects it should look at instead of making the "everything" case faster. Even better, this approach would allow an operator to add new guard rails to prevent user mistakes (eg: have "ambasasdor-staging" only consider services labeled "staging", for even better isolation from "production")
Additional context
I observed a few crash stacks in diagd when it was under performance pressure.
Unfortunately I seem to have misplaced my notes on this, but I remember it was within
load_from_filesystem
onwhere open() returned None, and the subsequent read() failed.
The text was updated successfully, but these errors were encountered: