Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x/cache] Fix LRU cache mem leak (when used with no loader) #3806

Merged
merged 7 commits into from
Oct 4, 2021

Conversation

linasm
Copy link
Collaborator

@linasm linasm commented Oct 2, 2021

What this PR does / why we need it:
A regression in LRU cache was introduced here:

m3/src/x/cache/lru_cache.go

Lines 356 to 366 in e2c6903

if getWithNoLoader && !exists {
// If we're not using a loader then return entry not found
// rather than creating a loading channel since we are not trying
// to load an element we are just attempting to retrieve it if and
// only if it exists.
return nil, false, nil, ErrEntryNotFound
}
if !exists {
// The entry doesn't exist, clear enough space for it and then add it
if err := c.reserveCapacity(1); err != nil {

Cache entries were never being evicted when it was being used with no loader. Because of early return under getWithNoLoader = true the code that invokes reserveCapacity was unreachable, and this was the only place where eviction would happen.

Special notes for your reviewer:
I have added a call to reserveCapacity from Put (which assumes no loader is in use).
Another possibility would be to fix tryCache implementation to reserve the space on cache miss, but then Put would happen not under the same lock, creating a race between Gets (freeing up space in advance) and Puts which would permit exceeding the cache limit (still being eventually consistent). I chose doing eviction from Put as this seemed more semantically correct.

Does this PR introduce a user-facing and/or backwards incompatible change?:
NONE

Does this PR require updating code package or user-facing documentation?:
NONE

@codecov
Copy link

codecov bot commented Oct 2, 2021

Codecov Report

Merging #3806 (6b1a24d) into master (6b1a24d) will not change coverage.
The diff coverage is n/a.

❗ Current head 6b1a24d differs from pull request most recent head 6845b33. Consider uploading reports for the commit 6845b33 to get more accurate results

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #3806   +/-   ##
======================================
  Coverage    56.8%   56.8%           
======================================
  Files         552     552           
  Lines       63077   63077           
======================================
  Hits        35883   35883           
  Misses      23996   23996           
  Partials     3198    3198           
Flag Coverage Δ
aggregator 63.3% <0.0%> (ø)
cluster ∅ <0.0%> (∅)
collector 58.4% <0.0%> (ø)
dbnode 60.5% <0.0%> (ø)
m3em 46.4% <0.0%> (ø)
metrics 19.7% <0.0%> (ø)
msg 74.3% <0.0%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b1a24d...6845b33. Read the comment docs.

Comment on lines 437 to 441
if enforceLimit && c.reserveCapacity(1) != nil {
// Silently skip adding the new entry if we fail to free up space for it
// (which should never be happening).
return value, err
}
Copy link
Collaborator

@robskillington robskillington Oct 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return the error from the reserve capacity call yeah? Otherwise the err would be nil yeah?

Suggested change
if enforceLimit && c.reserveCapacity(1) != nil {
// Silently skip adding the new entry if we fail to free up space for it
// (which should never be happening).
return value, err
}
if enforceLimit {
if err := c.reserveCapacity(1); err != nil {
// Silently skip adding the new entry if we fail to free up space for it
// (which should never be happening).
return value, err
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this aspect is a bit vague. The contract of updateCacheEntryWithLock is to return its arguments value interface{}, err error unmodified, and I've tried to preserve that, commenting that we "silently skip" such situation (being unable to evict from cache - which from what I see would only happen in case there is a code bug).
Also, in the only place that calls updateCacheEntryWithLock with enforceLimit = true, those return values are ignored completely:

m3/src/x/cache/lru_cache.go

Lines 223 to 233 in c5f6237

func (c *LRU) PutWithTTL(key string, value interface{}, ttl time.Duration) {
var expiresAt time.Time
if ttl > 0 {
expiresAt = c.now().Add(ttl)
}
c.mut.Lock()
defer c.mut.Unlock()
_, _ = c.updateCacheEntryWithLock(key, expiresAt, value, nil, true)
}

So I'm not really sure if I should change this (but don't have a strong opinion either, even after jumping around this code for quire a long time). I guess that's the support of two modes (with/without loader) in a single data structure that makes it hard to come up with an elegant implementation. Perhaps @ryanhall07 will chime in with his insights as well.

Copy link
Collaborator

@robskillington robskillington Oct 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see yeah, that makes sense. Feel free to ignore my suggestion then 👍

The other question is, this only happens when there is a code bug yeah?

If so.. does that mean we have a remaining code bug we're not aware of? Which is fine if so, we can chase this up after the fact. Just wanted to understand current state of the world (post-merging this change).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I meant, reserveCapacity is not supposed to return errors, unless there is some bug that we are not yet aware of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, that makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left another comment. maybe it's just best to change reserveCapacity to return a bool so it's clear it can't be an error.

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

) (interface{}, error) {
entry := c.entries[key]
if entry == nil {
if enforceLimit && c.reserveCapacity(1) != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe check the returned error type is ErrCacheFull in case a future contributor changes reserveCapacity to return some other kind of error. or change the signature of reserveCapacity to return a bool instead to be more explicit it's not really an error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, replacing return value error with bool for reserveCapacity makes lots of sense. Should have thought of this myself.

Comment on lines 437 to 441
if enforceLimit && c.reserveCapacity(1) != nil {
// Silently skip adding the new entry if we fail to free up space for it
// (which should never be happening).
return value, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left another comment. maybe it's just best to change reserveCapacity to return a bool so it's clear it can't be an error.

@@ -229,7 +229,7 @@ func (c *LRU) PutWithTTL(key string, value interface{}, ttl time.Duration) {
c.mut.Lock()
defer c.mut.Unlock()

_, _ = c.updateCacheEntryWithLock(key, expiresAt, value, nil)
_, _ = c.updateCacheEntryWithLock(key, expiresAt, value, nil, true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i hate to over optimize, but I do wonder if scanning the entire cache for evictions on every put is going to be too much. I guess we can try this and if it causes performance issues, we can add some kind of eviction every N puts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should not be a problem - reserveCapacity removes entries while scanning them, so we either free up one slot for the current Put value, or free up more and then some subsequent Puts will be handled for free.

@linasm linasm merged commit 4a4559d into master Oct 4, 2021
@linasm linasm deleted the linasm/fix-lru-cache-mem-leak-no-loader branch October 4, 2021 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants