-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unintuitive behavior (bug?) with put()
#568
Comments
Thanks for the report and isolating where the problem is likely stemming from. From reviewing that setter logic, I think you're right as to where the bug is. I can confirm that if I remove the
Do you have any preference / recommendation on which strategy we should prefer? I can try to get a bug fix out tonight (or this week if not) for 2.x and 3.x development lines. |
I don’t think I know the architecture well enough to make a recommendation. Generally we just want some value instead of null. Can you explain what the extra gc churn would be from? Maybe there’s a way to special case and do nothing if the value is the same (kind of like putIfAbsent). Thanks for the quick attention. |
The GC has to clear out all Since GC is magical and we'd need an expert like Jeremy to tell us the optimal incarnation, the The other option is to check on read if a stale reference was observed. If null, then a new read barrier ensures that the thread sees the prior write (and the next iteration's plain read cannot go back in time). That would look something like, public final V getValue() {
for (;;) {
Reference<V> ref = ((Reference<V>) VALUE.get(this));
V value = ref.get();
if ((value != null) || (ref == VALUE.getVolatile(this))) {
return value;
}
}
}
public final void setValue(V value, ReferenceQueue<V> referenceQueue) {
Reference<V> ref = ((Reference<V>) getValueReference());
VALUE.set(this, new WeakValueReference<V>(getKeyReference(), value, referenceQueue));
ref.clear();
} The slightly higher cost in user code might be preferable to adding GC work. This is very much a micro-optimization, but generally worth considering in data structure classes. |
Unfortunately I'm finding that while the proposed solutions fix my provided repro, they don't actually fix the issue in our tests. I don't have a full explanation for what's going on yet. Upon adding more debugging info, I am seeing removal notifications of type |
Can you narrow it down to a unit test? Let's debug only with the removal of A |
I'll try to get a better explanation tomorrow, but I'm definitely seeing a I grabbed a stack trace of what's submitting the weird removal notification:
We're using strong keys & weak values, version 2.8.5. |
Line numbers are probably off because I added some debugging code. |
hmm.. I do know that there is a bug in Guava where that can occur. Nothing jumps out when scanning over the code and it seems correct by checking for null first before supplying that cause. Thanks for digging into this. |
Thinking about solution 1 of not calling |
Please try this patch in your tests and see if it resolves the original problem. This implements the retry loop to resolve the race. |
I think that the issue as originally described is not actually affecting our tests, I just discovered it by accident. Instead, my understanding is that we're encountering a race between cleanup of an evicted value and a call to
With that understanding, I modified synchronized (n) {
value[0] = n.getValue();
if (cause == RemovalCause.COLLECTED && value[0] != null) {
resurrect[0] = true;
return n;
}
} Does this explanation and solution make sense to you? I still think we should fix the originally reported issue as well. |
That's another excellent find! Yes, lets fix both. I think that we can write a unit test by downcasting to the implementation (BoundedLocalCacheTest), clearing the reference to simulate a garbage collection, and throwing threads to catch the race. |
Here's a stress test that seems to reproduce the 2nd issue: AtomicReference<Exception> e = new AtomicReference<>();
Cache<String, Object> cache =
Caffeine.newBuilder()
.weakValues()
.removalListener(
(k, v, cause) -> {
if (cause == RemovalCause.COLLECTED && v != null) {
e.compareAndSet(null, new IllegalStateException("Evicted a live value: " + v));
}
})
.build();
String key = "key";
cache.put(key, new Object());
List<Thread> threads = new ArrayList<>();
AtomicBoolean missing = new AtomicBoolean();
for (int i = 0; i < 100; i++) {
Thread t =
new Thread(
() -> {
for (int j = 0; j < 1000; j++) {
if (e.get() != null) {
break;
}
if (Math.random() < .01) {
System.gc();
cache.cleanUp();
} else if (cache.getIfPresent(key) == null && !missing.getAndSet(true)) {
cache.put(key, new Object());
missing.set(false);
}
}
});
threads.add(t);
t.start();
}
for (Thread t : threads) {
t.join();
}
if (e.get() != null) {
throw e.get();
} As for the fix for the first issue - I don't understand why a loop is necessary. Can we just follow up the non-volatile read with a volatile read? I'm not sure of the best way to write the tests with the existing framework, so I'm hoping you can help me get these ideas merged. |
Thank you @justinhorvitz. Sorry that I haven't been able to focus on this much (its a meeting day for me). My plan is to translate your work into our test framework and a commit, and add you to
The problem is that we have an indirection, value -> Reference -> referent. At any point the thread may be context switched out and the environment changed (referent was collected, value was replaced). This indirection means that even if we only used a volatile read then a race can occur, as our decision depends on dereferencing the state and assuming nothing changed. We can only get away with not validating on a non-null read because non-determinism is expected, but as you observed for a null that may never be an accurate visible result. The validation loop allows us to detect that. If we did not loop, then the if the volatile read differed we still couldn't trust the new referent as it too might have been collected. The context switch duration might be in the distant future. The non-volatile read is there because we can piggyback on the Map.get's volatile read. This is a memory read barrier, which enforces a point-in-time visibility within the MESI cache coherence protocol. This plain read means that we know the value up until the time of the Map.get, so as a micro-optimization we don't need stronger consistency of a few instructions later. Yet that can be inconsistent, so we need a volatile read to set a new point-in-time barrier for validation check. |
Makes sense. We're trying to distinguish between a GC clear (returning null is valid) and a stale read (need to read the updated reference). I guess I would write it as such, to avoid the double read when it loops: public final V getValue() {
Reference<V> ref = (Reference<V>) VALUE.get(this);
for (;;) {
V value = ref.get();
if (value != null) {
return value;
}
Reference<V> newRef = (Reference<V>) VALUE.getVolatile(this);
if (newRef == ref) {
return null; // Reference is up to date - it must have been cleared by GC.
}
ref = newRef;
}
} Also wonder if there are any cases where |
Yeah, I suppose it's a preference of which is more readable? The shorter code is concise and I'd assume that a plain read would be free, as a VarHandle is an intrinsic. But perhaps it is less obvious compared to your longer form. I'll review for style before merging the fixes, as I'm not sure which I prefer. It looks like |
1. When an entry is updated then a concurrent reader should either observe the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then the an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removoal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely. A stale read of the fixed duration caused the loop to retry the ineligible entry, but instead it can terminatee as it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <[email protected]>
1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <[email protected]>
When adding more coverage on resurrected entries, I found a bug if the fixed expiration time is updated. This was stuck in a retry loop since it did not re-read the duration but rather kept it in a stale variable outside of the loop. Thus it kept evaluating the next entry as expired. Since this queue is ordered the loop can terminate when I'll backport these changes to 2.x tomorrow and cut a release. You should have a new version (2.x, 3.x preferred) by Friday morning. |
1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <[email protected]>
1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <[email protected]>
1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <[email protected]>
1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <[email protected]>
1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <[email protected]>
Released in 2.9.2 and 3.0.3. Thanks again for all your help on this. |
Thank you! |
Contains critical fixes for ben-manes/caffeine#568. Closes #13651. Signed-off-by: Philipp Wollermann <[email protected]>
Upon migrating from Guava cache to Caffeine in bazel we are seeing some flakiness in tests that heavily exercise weak/soft-valued caches. It appears that the behavior of
put()
for an already present key in Caffeine diverges from Guava cache (andConcurrentHashMap
) in that a concurrent thread callinggetIfPresent
can getnull
, as opposed to either the old value or the new value.Note that this strange behavior occurs even if the
put
call does not change the existing value. I believe it's the result of non-atomically clearing the existing value reference before setting it.The following reproduces the issue within a few iterations:
The text was updated successfully, but these errors were encountered: