Measure the latency of the different "happy" scenarios #62

alexsnaps · 2024-02-26T13:30:52Z

In order to better understand the implications of leveraging the cloud providers Redis instances, we want to build a latency distribution matrix for all the different providers, but starting our with AWS's Elasticache.

We need a distribution for a LRT that:

Increments a counter, without ever reaching the limit (i.e. a RW usecase for Redis)
Reaches a limit and consistently responds with a 429 HTTP response code for a period
The same as the above, but with qualified counters (which we'll use to introduce data locality)

We then need to test these in multiple deployment scenarios:

From one location: same as the "master" (if applicable), from a different AZ, from different regions across the globe
From multiple location, distributed according the above

That use:

all the same limit (single globally shared counter)
each use a qualified counter for their az/region

Performance tests

Latencies in multiple deployment scenarios

Using cloudping as latency data source

EDIT: no need to test further. We confirmed expected understanding of limitdor's request to redis model. The N+1 model.

Scenarios with transient gw & `redis_cached`

Test with 3 limits across 2 regions
Test with 3 limits & redis_cached instead across these 2 same regions
Do we want to test the bin encoding as well? It should reduce the amount of bytes sent over the wires...

Scenario with write-behind-lock Kuadrant/limitador#276

The text was updated successfully, but these errors were encountered:

alexsnaps · 2024-03-18T18:35:18Z

What happens (currently)

Today, Limitador will do n + 1 round-trips to Redis (or in this case the cloud provider's Redis-alike, e.g. Elasticache) when "in the happy path", where n is amount of limits being hit.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end
    
    E->>+L: should_rate_limit?
    L->>R: MGET
    R->>L: counter values
    loop for every counter
        L->>R: LUA script increment
        R->>L: ()
    end
    L->>-E: Ok!

So say you have 3 limits (per hour, per day and per month) and your user is currently not above the threshold of these 3 limits, we'd need to do 4 round-trips to Redis to let them thru and increment the counters. Based on the testing above, we can fairly much confirm that a single round-trip to Redis is about the figures from cloudping.

What we could do about it (for now)

Improve on the current design

Instead of going to Redis n + 1 times, we could invoke a server-side Lua script that increments all counters in one go and returns the latest values in a single round-trip. While 1 is obviously much better than the current n + 1, a single round-trip time quickly goes in double digit milliseconds. So going to Redis on every single request that is rate limited seems unreasonable.

Leverage the existing `CachedRedisStorage`

We do have an alternate Redis-backed AsyncCounterStorage implementation that caches counter values in Limitador's memory, avoiding needing the round-trip to the backend Redis storage.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    participant C as Cached counter
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end
    
    E->>+L: should_rate_limit?
    L->>C: Read local values
    L->>R: Fetch missing counters 
    R->>L: values & ttl
    L->>C: (Store and) Increment counters
    L->>-E: Ok!
        loop Every second (by default)
        loop Per counter
           L->>R: LUA script increment
           R->>L: ()
        end
    end

While this addresses most of the cases, the jitter on the user's request remains with a max of 1 time the round-trip to the Redis' region from the call-site (tho right now it's worse than one rtt, because of the blocking nature of the write-behind task). At least it's paid once, rather than once with an addition round-trip per counter.

While this is probably the easiest way forward addressing what we've now seen happening in AWS, there are few items to address:

The "write-behind" task is currently blocking reads for the entire duration of writing of each counter
Bulk update counters instead of one by one (see how big that batch can become)
Possibly returning the latest value and update the cached values
Fix --ratio on the command line…
Look into asynchronously pre-reloading the counters

alexsnaps · 2024-03-18T19:13:31Z

Obviously, the list above are small improvements for us to get closer to something sustainable in such a deployment. Probably some better caching strategy (rather than time based expiry), e.g. Hi/Lo, would be preferable, as we know we are dealing with counters that tends towards a known limit. And eventually get to completely avoid the need for coordination across the different limitador, probably by leveraging CRDTs, like a grow-only counter.

alexsnaps mentioned this issue Feb 26, 2024

Global RL using cloud provider Redis storage #60

Closed

4 tasks

alexsnaps added this to Kuadrant Feb 26, 2024

alexsnaps moved this to Todo in Kuadrant Feb 26, 2024

This was referenced Feb 26, 2024

Spec Azure, Google, Redis Labs' offering (x-AZ/Region, Lua support/pipeline support, sharded?) Kuadrant/limitador#264

Closed

Come up with a test framework to re-use across all cloud providers offering Kuadrant/limitador#263

Open

alexsnaps moved this from Todo to In Progress in Kuadrant Mar 4, 2024

alexsnaps assigned alexsnaps and didierofrivia and unassigned alexsnaps Mar 4, 2024

eguzki self-assigned this Mar 8, 2024

alexsnaps closed this as completed Apr 22, 2024

github-project-automation bot moved this from In Progress to Done in Kuadrant Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure the latency of the different "happy" scenarios #62

Measure the latency of the different "happy" scenarios #62

alexsnaps commented Feb 26, 2024 •

edited by eguzki

Loading

alexsnaps commented Mar 18, 2024 •

edited

Loading

alexsnaps commented Mar 18, 2024 •

edited

Loading

Measure the latency of the different "happy" scenarios #62

Measure the latency of the different "happy" scenarios #62

Comments

alexsnaps commented Feb 26, 2024 • edited by eguzki Loading

Performance tests

Latencies in multiple deployment scenarios

Scenarios with transient gw & redis_cached

Scenario with write-behind-lock Kuadrant/limitador#276

alexsnaps commented Mar 18, 2024 • edited Loading

What happens (currently)

What we could do about it (for now)

Improve on the current design

Leverage the existing CachedRedisStorage

alexsnaps commented Mar 18, 2024 • edited Loading

alexsnaps commented Feb 26, 2024 •

edited by eguzki

Loading

Scenarios with transient gw & `redis_cached`

alexsnaps commented Mar 18, 2024 •

edited

Loading

Leverage the existing `CachedRedisStorage`

alexsnaps commented Mar 18, 2024 •

edited

Loading