-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global Throttling Across Multiple Aggregators #18190
Comments
Thanks for this request @bchen32 ! As you note, this will require sharing state across Vector instances, which is not something we have plans for in the near future as it greatly complicates the setup, but might be something we do. The workaround you identified of partitioning the the input along the fields you need to throttle is the usual one we recommend. A related workaround would be to separate just that step of processing to its own set of Vector aggregators that only apply the throttling but do no other processing. This could help you manage their resources better. |
@jszwedko That makes sense. In terms of partitioning the input, what's the recommended way to implement that? Currently using a vector sink -> vector source and it's not entirely clear to me how to even partition the logs by app-name, given that a single agent could be batching and sending logs from multiple apps at the same time. |
Ah, yeah, this recommendation is generally given when the partitioning is by client, in which case you can use sticky load balancing. If you have multiple clients that are sending events with the same |
That makes sense, we did consider that option. From what I can understand though, each http request is a batch of multiple logs. So you can't guarantee that all logs in a request are even from the same app-name, which makes it difficult to use that as a header. I guess you could limit the batch max_events to 1, but I'm assuming that might come with some negative performance/network traffic implications. |
Ah, actually, I meant to mention that the header would be involved in batch partitioning too. That is, if we allow dynamic headers to be set, that the |
Oh interesting, does the |
Not yet unfortunately. That is being tracked by #201 |
Just thought I'd chime in here because I would find this extremely useful. There are a bunch of workarounds we can employ, like:
Istio (the service mesh we use in the cluster running vector components) has an compelling solution they call a Global Rate Limit. The short version is that it defers rate limiting decisions to a gRPC call with a specific API (they even provide a reference implementation that uses Redis). This approach could be appropriate here, as it defers the complexity to an external service, and could be tailored to high-throughput scenarios by making the gRPC calls on configurable batches of data rather than on every message (e.g. I understand that despite the apparent simplicity, this could still be a big lift, but it would provide huge value, particularly because:
|
A note for the community
Use Cases
Context:
Currently building an observability pipeline with the unified architecture across several k8s clusters. Have a bunch of agents deployed as a DaemonSet, sending logs to some centralized aggregators fronted by a load balancer.
Problem:
I'm trying to implement the throttle transform, but realizing that it doesn't quite work as expected when the number of aggregator pods gets scaled up. The goal is to have logs throttled by app-name. However, the load balancer might spread out different agents from the same app to different aggregator pods, resulting in the throttle being higher than expected.
Example:
Let's say we have an app foo with 2 replicas across 2 nodes, and we have 2 aggregator pods. Throttle is configured to rate limit at 1 log per second per app. The load balancer decides that node 1 is sending logs to aggregator 1, and node 2 is sending logs to aggregator 2. So, aggregator 1 rate limits node 1 to 1 log per second, and aggregator 2 rate limits node 2 to 1 log per second. Therefore, the total output from the aggregators is 2 logs per second for foo, which is not what we want.
Attempted Solutions
One workaround would be to configure the load balancer to hash by app-name to always connect nodes with the same app-name to the same aggregator pod. The issue is that some apps may have hundreds of replicas and generate more logs than a single aggregator can handle.
Proposal
What seems like the natural solution is just for the aggregator pods to have some sort of global synchronization for throttling.
References
No response
Version
v0.31.0
The text was updated successfully, but these errors were encountered: