perf: add cache to get evaluation rules storage method #1910

rudineirk · 2023-07-26T17:25:46Z

this improves the flags evaluation performance, after some investigation we found the main cause of delays in evaluation requests were the rules query, in our case running Flipt with a PostgreSQL DB on AWS RDS we were getting up to 1s of delays, with this fix it reduced to a few miliseconds.

this structure could also be used later to implement the cache for the ListFlags mentioned in issue #935

flipt=> explain (analyze, buffers) SELECT r.id, r.namespace_key, r.flag_key, r.segment_key, s.match_type, r."rank", c.id, c.type, c.property, c.operator, c.value FROM rules r JOIN segments s ON (r.segment_key = s."key" AND r.namespace_key = s.namespace_key) LEFT JOIN constraints c ON (s."key" = c.segment_key AND s.namespace_key = c.namespace_key) WHERE (r.flag_key = 'test-flag' AND r.namespace_key = 'production') GROUP BY r.id, c.id, s.match_type ORDER BY r."rank" ASC;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=22.17..22.32 rows=59 width=2134) (actual time=0.227..0.230 rows=6 loops=1)
   Sort Key: r.rank
   Sort Method: quicksort  Memory: 26kB
   Buffers: shared hit=14
   ->  HashAggregate  (cost=19.85..20.44 rows=59 width=2134) (actual time=0.218..0.222 rows=6 loops=1)
         Group Key: r.id, c.id, s.match_type
         Batches: 1  Memory Usage: 32kB
         Buffers: shared hit=14
         ->  Hash Join  (cost=2.14..19.40 rows=59 width=2134) (actual time=0.036..0.212 rows=6 loops=1)
               Hash Cond: ((s.key)::text = (r.segment_key)::text)
               Buffers: shared hit=14
               ->  Hash Right Join  (cost=1.06..17.96 rows=59 width=1098) (actual time=0.016..0.163 rows=235 loops=1)
                     Hash Cond: (((c.namespace_key)::text = (s.namespace_key)::text) AND ((c.segment_key)::text = (s.key)::text))
                     Buffers: shared hit=13
                     ->  Seq Scan on constraints c  (cost=0.00..14.94 rows=235 width=107) (actual time=0.005..0.072 rows=235 loops=1)
                           Filter: ((namespace_key)::text = 'production'::text)
                           Buffers: shared hit=12
                     ->  Hash  (cost=1.05..1.05 rows=1 width=1036) (actual time=0.006..0.007 rows=4 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           Buffers: shared hit=1
                           ->  Seq Scan on segments s  (cost=0.00..1.05 rows=1 width=1036) (actual time=0.003..0.005 rows=4 loops=1)
                                 Filter: ((namespace_key)::text = 'production'::text)
                                 Buffers: shared hit=1
               ->  Hash  (cost=1.06..1.06 rows=1 width=2068) (actual time=0.017..0.017 rows=2 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                     Buffers: shared hit=1
                     ->  Seq Scan on rules r  (cost=0.00..1.06 rows=1 width=2068) (actual time=0.013..0.014 rows=2 loops=1)
                           Filter: (((namespace_key)::text = 'production'::text) AND ((flag_key)::text = 'test-flag'::text))
                           Rows Removed by Filter: 2
                           Buffers: shared hit=1
 Planning Time: 0.240 ms
 Execution Time: 0.282 ms
(32 rows)

markphelps

Thanks for the PR @rudineirk !! Definitely love any changes that help improve performance.

Ive been thinking about if moving the existing caching back to wrap the storage layer makes sense instead of the current interceptor/server cache that we have, so this is a good first step!

I'm trying to remember why I decided to move the caching to the server layer in the first place, as it was initially in the storage layer similar to this approach..🤔. Here is when it changed: https://github.com/flipt-io/flipt/pull/954/files#diff-ee870b8fa81722aba3a872f0a81e22f5e100f4595dda3a4de59b39e23e8e882bR120

I think it makes sense to rely on TTL cache expiration (like you are doing) since that's already what we rely on for the existing entityID evaluation cache.

Also wondering if we even need the existing evaluation cache after this change, as im not sure how much performance benefit we get there since this is likely the slowest part of the process as you described.. something for us to benchmark I suppose.

Have you seen any overhead performance-wise of the marshalling to/from JSON for large sets of EvaluationRules? Im guessing most flags don't actually have that many rules to begin with so maybe its a non-issue?

internal/storage/storage.go

internal/storage/cache/cache.go

internal/storage/cache/cache_test.go

codecov · 2023-07-26T18:41:17Z

Codecov Report

Merging #1910 (a58d4eb) into main (112e86d) will increase coverage by 0.09%.
The diff coverage is 83.72%.

@@            Coverage Diff             @@
##             main    #1910      +/-   ##
==========================================
+ Coverage   71.21%   71.30%   +0.09%     
==========================================
  Files          60       61       +1     
  Lines        5777     5820      +43     
==========================================
+ Hits         4114     4150      +36     
- Misses       1431     1436       +5     
- Partials      232      234       +2

Files Changed	Coverage Δ
internal/cache/memory/cache.go	`100.00% <ø> (ø)`
internal/cache/redis/cache.go	`63.63% <ø> (ø)`
internal/cmd/grpc.go	`0.00% <0.00%> (ø)`
internal/server/middleware/grpc/middleware.go	`69.96% <ø> (ø)`
internal/storage/cache/cache.go	`85.71% <85.71%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

rudineirk · 2023-07-26T18:41:44Z

Thanks for the PR @rudineirk !! Definitely love any changes that help improve performance.

Ive been thinking about if moving the existing caching back to wrap the storage layer makes sense instead of the current interceptor/server cache that we have, so this is a good first step!

I'm trying to remember why I decided to move the caching to the server layer in the first place, as it was initially in the storage layer similar to this approach..thinking. Here is when it changed: https://github.com/flipt-io/flipt/pull/954/files#diff-ee870b8fa81722aba3a872f0a81e22f5e100f4595dda3a4de59b39e23e8e882bR120

I think it makes sense to rely on TTL cache expiration (like you are doing) since that's already what we rely on for the existing entityID evaluation cache.

Also wondering if we even need the existing evaluation cache after this change, as im not sure how much performance benefit we get there since this is likely the slowest part of the process as you described.. something for us to benchmark I suppose.

Have you seen any overhead performance-wise of the marshalling to/from JSON for large sets of EvaluationRules? Im guessing most flags don't actually have that many rules to begin with so maybe its a non-issue?

I think the final evaluation result cache is still valid, in our use case we make multiple evaluation requests over the same flag/entityId, the evaluation result cache helps with this, reducing a few ms from each request. The cache that is being added in this PR helps with evaluation requests with different entityIDs for the same flag, where we got the worst delays.

About the overhead of decoding a large JSON, it has a bit of impact CPU wise, but I think it's still way better than querying a lot of data from the DB/storage.

yquansah

Thanks for the PR indeed! Awesome to see users running the software and making improvements @rudineirk.

I am wondering if we should keep the same original semantics of serializing the rules as protobuf as we did EvaluationResponse, but if the JSON serialization is not too big of a CPU hit as you've stated, then maybe it is not worth worrying about that right now.

Code looks good though.

internal/storage/storage.go

rudineirk · 2023-07-26T20:02:09Z

Thanks for the PR indeed! Awesome to see users running the software and making improvements @rudineirk.

I am wondering if we should keep the same original semantics of serializing the rules as protobuf as we did EvaluationResponse, but if the JSON serialization is not too big of a CPU hit as you've stated, then maybe it is not worth worrying about that right now.

Code looks good though.

I thought about that too, to use protobuf to serialize the cache, but it would require to convert all storage structs into protobuf, and I don't know if the performance gains over JSON would be worth it.

markphelps

Thanks again @rudineirk ! added some thoughts around error handling

internal/storage/cache/cache.go

markphelps

lgtm! thank you again @rudineirk !!

markphelps · 2023-07-27T17:46:37Z

@all-contributors please add @rudineirk for code

allcontributors · 2023-07-27T17:46:46Z

@markphelps

I've put up a pull request to add @rudineirk! 🎉

markphelps · 2023-07-27T17:49:31Z

will go out in v1.24.0, likely Monday 🎉

* 'main' of https://github.com/flipt-io/flipt: docs: add rudineirk as a contributor for code (#1913) perf: add cache to get evaluation rules storage method (#1910) fix: renable pprof (#1912)

rudineirk requested a review from a team as a code owner July 26, 2023 17:25

rudineirk force-pushed the feat/add-evaluation-rules-cache branch 2 times, most recently from 347e65e to 9e86c1b Compare July 26, 2023 17:40

rudineirk changed the title ~~feat: add cache to get evaluation rules storage method~~ perf: add cache to get evaluation rules storage method Jul 26, 2023

markphelps reviewed Jul 26, 2023

View reviewed changes

rudineirk force-pushed the feat/add-evaluation-rules-cache branch from 9e86c1b to 5dc0558 Compare July 26, 2023 18:31

rudineirk requested a review from markphelps July 26, 2023 18:42

yquansah reviewed Jul 26, 2023

View reviewed changes

markphelps reviewed Jul 26, 2023

View reviewed changes

internal/storage/storage.go Outdated Show resolved Hide resolved

rudineirk force-pushed the feat/add-evaluation-rules-cache branch from 5dc0558 to 7333573 Compare July 26, 2023 19:57

rudineirk requested review from markphelps and yquansah July 26, 2023 20:02

markphelps reviewed Jul 27, 2023

View reviewed changes

internal/storage/cache/cache.go Outdated Show resolved Hide resolved

internal/storage/cache/cache.go Outdated Show resolved Hide resolved

perf: add cache to get evaluation rules storage method

5135f2c

rudineirk force-pushed the feat/add-evaluation-rules-cache branch from 7333573 to 5135f2c Compare July 27, 2023 13:39

rudineirk requested a review from markphelps July 27, 2023 13:40

markphelps approved these changes Jul 27, 2023

View reviewed changes

allcontributors bot mentioned this pull request Jul 27, 2023

docs: add rudineirk as a contributor for code #1913

Merged

Merge branch 'main' into feat/add-evaluation-rules-cache

a58d4eb

markphelps enabled auto-merge (squash) July 27, 2023 17:46

markphelps merged commit ef8c772 into flipt-io:main Jul 27, 2023

markphelps mentioned this pull request Aug 1, 2023

feat(segments): Segment AND'ing #1915

Merged

markphelps mentioned this pull request Dec 4, 2023

[FLI-533] perf: cache rollouts at storage layer #2498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add cache to get evaluation rules storage method #1910

perf: add cache to get evaluation rules storage method #1910

rudineirk commented Jul 26, 2023 •

edited

Loading

markphelps left a comment •

edited

Loading

codecov bot commented Jul 26, 2023 •

edited

Loading

rudineirk commented Jul 26, 2023 •

edited

Loading

yquansah left a comment

rudineirk commented Jul 26, 2023

markphelps left a comment

markphelps left a comment

markphelps commented Jul 27, 2023

allcontributors bot commented Jul 27, 2023

markphelps commented Jul 27, 2023

perf: add cache to get evaluation rules storage method #1910

perf: add cache to get evaluation rules storage method #1910

Conversation

rudineirk commented Jul 26, 2023 • edited Loading

markphelps left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jul 26, 2023 • edited Loading

Codecov Report

rudineirk commented Jul 26, 2023 • edited Loading

yquansah left a comment

Choose a reason for hiding this comment

rudineirk commented Jul 26, 2023

markphelps left a comment

Choose a reason for hiding this comment

markphelps left a comment

Choose a reason for hiding this comment

markphelps commented Jul 27, 2023

allcontributors bot commented Jul 27, 2023

markphelps commented Jul 27, 2023

rudineirk commented Jul 26, 2023 •

edited

Loading

markphelps left a comment •

edited

Loading

codecov bot commented Jul 26, 2023 •

edited

Loading

rudineirk commented Jul 26, 2023 •

edited

Loading