Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure the latency of the different "happy" scenarios #62

Closed
16 of 19 tasks
alexsnaps opened this issue Feb 26, 2024 · 2 comments
Closed
16 of 19 tasks

Measure the latency of the different "happy" scenarios #62

alexsnaps opened this issue Feb 26, 2024 · 2 comments
Assignees

Comments

@alexsnaps
Copy link
Member

alexsnaps commented Feb 26, 2024

In order to better understand the implications of leveraging the cloud providers Redis instances, we want to build a latency distribution matrix for all the different providers, but starting our with AWS's Elasticache.

We need a distribution for a LRT that:

  • Increments a counter, without ever reaching the limit (i.e. a RW usecase for Redis)
  • Reaches a limit and consistently responds with a 429 HTTP response code for a period
  • The same as the above, but with qualified counters (which we'll use to introduce data locality)

We then need to test these in multiple deployment scenarios:

  • From one location: same as the "master" (if applicable), from a different AZ, from different regions across the globe
  • From multiple location, distributed according the above

That use:

  • all the same limit (single globally shared counter)
  • each use a qualified counter for their az/region

Performance tests

Latencies in multiple deployment scenarios

Using cloudping as latency data source

EDIT: no need to test further. We confirmed expected understanding of limitdor's request to redis model. The N+1 model.

Scenarios with transient gw & redis_cached

  • Test with 3 limits across 2 regions
  • Test with 3 limits & redis_cached instead across these 2 same regions
  • Do we want to test the bin encoding as well? It should reduce the amount of bytes sent over the wires...

Scenario with write-behind-lock Kuadrant/limitador#276

@alexsnaps
Copy link
Member Author

alexsnaps commented Mar 18, 2024

What happens (currently)

Today, Limitador will do n + 1 round-trips to Redis (or in this case the cloud provider's Redis-alike, e.g. Elasticache) when "in the happy path", where n is amount of limits being hit.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end
    
    E->>+L: should_rate_limit?
    L->>R: MGET
    R->>L: counter values
    loop for every counter
        L->>R: LUA script increment
        R->>L: ()
    end
    L->>-E: Ok!
Loading

So say you have 3 limits (per hour, per day and per month) and your user is currently not above the threshold of these 3 limits, we'd need to do 4 round-trips to Redis to let them thru and increment the counters. Based on the testing above, we can fairly much confirm that a single round-trip to Redis is about the figures from cloudping.

What we could do about it (for now)

Improve on the current design

Instead of going to Redis n + 1 times, we could invoke a server-side Lua script that increments all counters in one go and returns the latest values in a single round-trip. While 1 is obviously much better than the current n + 1, a single round-trip time quickly goes in double digit milliseconds. So going to Redis on every single request that is rate limited seems unreasonable.

Leverage the existing CachedRedisStorage

We do have an alternate Redis-backed AsyncCounterStorage implementation that caches counter values in Limitador's memory, avoiding needing the round-trip to the backend Redis storage.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    participant C as Cached counter
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end
    
    E->>+L: should_rate_limit?
    L->>C: Read local values
    L->>R: Fetch missing counters 
    R->>L: values & ttl
    L->>C: (Store and) Increment counters
    L->>-E: Ok!
        loop Every second (by default)
        loop Per counter
           L->>R: LUA script increment
           R->>L: ()
        end
    end
Loading

While this addresses most of the cases, the jitter on the user's request remains with a max of 1 time the round-trip to the Redis' region from the call-site (tho right now it's worse than one rtt, because of the blocking nature of the write-behind task). At least it's paid once, rather than once with an addition round-trip per counter.

While this is probably the easiest way forward addressing what we've now seen happening in AWS, there are few items to address:

  • The "write-behind" task is currently blocking reads for the entire duration of writing of each counter
  • Bulk update counters instead of one by one (see how big that batch can become)
  • Possibly returning the latest value and update the cached values
  • Fix --ratio on the command line…
  • Look into asynchronously pre-reloading the counters

@alexsnaps
Copy link
Member Author

alexsnaps commented Mar 18, 2024

Obviously, the list above are small improvements for us to get closer to something sustainable in such a deployment. Probably some better caching strategy (rather than time based expiry), e.g. Hi/Lo, would be preferable, as we know we are dealing with counters that tends towards a known limit. And eventually get to completely avoid the need for coordination across the different limitador, probably by leveraging CRDTs, like a grow-only counter.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Kuadrant Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants