Refresh cache with refreshAfterWrite doesn't update value #714

bschlie · 2022-05-24T10:12:04Z

Hi.

We are experiencing some issues when trying to manually refresh a cache with refreshAfterWrite.

I have created a minimal project to demonstrate the problem:

public class main {
    public static void main(String[] args) throws Exception {

        ExecutorService executor = Executors.newCachedThreadPool();

        LoadingCache<Object, String> cache = Caffeine
            .newBuilder()
            .executor(executor)
            .refreshAfterWrite(Duration.ofSeconds(100))
            .buildAsync((key -> {
                //      Issue only occur when computations are slow
                Thread.sleep(100);
                return String.valueOf(new Random().nextInt());
            }))
            .synchronous();

        Object obj = new Object();

        String firstGet = cache.get(obj);
        String reloadedValue = cache.refresh(obj).get(); // wait for it to finish reloading
        String secondGet = cache.get(obj);
        String thirdGet = cache.get(obj);

        System.out.println(firstGet);
        System.out.println(reloadedValue);
        System.out.println(secondGet);
        System.out.println(thirdGet);

        executor.shutdown();

        assert !firstGet.equals(secondGet);
    }
}

Output

> Task :main.main() FAILED
1764635026
1549536415
1764635026
1764635026
Exception in thread "main" java.lang.AssertionError
	at main.main.main(main.java:43)

By calling refresh on the cache and waiting for the future to be complete, a new value has been returned. However the subsequently .get call (both secondGet and thirdGet) to the cache still returns the old value. I would expect that the cache would provide the new value.

It seems like the issue only occur when the computations in the buildAsync callback are slow (hence the Thread.sleep).

I'm using caffeine version 3.1.0.

Is this is a bug or is it something I have misunderstood?

The text was updated successfully, but these errors were encountered:

ben-manes · 2022-05-24T17:37:36Z

This is due to how CompetableFuture works and other races between threads. Unfortunately I don't think there is anything I can do except explain the problem. There are multiple challenges that makes your test go awry.

CompetableFuture returns a new instance when registering dependent actions and operations on those futures are not back propagated to the source. In other words, you can cancel or complete a dependent future but those upstream will not receive this signal. If for example you wanted to cancel an http request then this must be on the original future or else the client would not be aware to close the connection by its own dependent action. The cache always returns what was produced by the attached AsyncCacheLoader which avoids a surprise of disconnected operations. Ideally the dependent actions would return the same future instance and one would use copy() to explicitly disconnect.

The cache attaches a whenComplete action to update the cache, either removing if a failure or else updating the policy metadata (e.g. assigning capacity and triggering an eviction). Since we return the loader's future, waiting on future's value does not mean the cache's dependent actions have completed. As the futures are executed asynchronously, the client can observe the result before the other thread runs the dependents. In your case that race causes the reloadedValue to be started before the firstGet could update the policy metadata. When the refresh completes, it validates that the entry was not modified since then and, if so, discards the refresh as an ABA conflict. This is done by comparing the write timestamps before and after refreshing, which due to the races are in conflict and the cache must be drop the new value to maintain linearizability (e.g. put(k, a), refresh(k), put(k, b) => get(k) == b).

The problems are exhausterated in application by the fact that dependent actions are processed in LIFO order. A slow action from the application will delay the cache's to be delayed, which could make this happen more frequently.

Ideally if we returned the cache's dependent future some of these problems would go away, but at the cost of whomever produced the future not receiving any downstream signals. The write conflict is a less pronounced race that I'd need some more time to revisit, but I don't think that would be easy to weaken without losing linearizability which is a more important to keep.

…714) Linearization means that the refresh should be dropped if another write for that entry occurs, as we may populate the cache with stale data. The cache uses a write timestamp to help detect this. However, when this could be too aggressive when a refresh runs concurrently with the async load completing. An inflight load is given an infinite timestamp so that it never expires. A whenComplete callback then updates the entry with its policy metadata, such as its expiration time. That changes the write timestamp and if a refresh runs concurrently with the callback then the refresh may see the infinite timestamp, the replace updates it, and the refresh drops the entry as a conflict. Instead we can detect this case by seeing if the timestamp delta exceeds our maximum value (120yrs). IF so, then we know this race occurred and can trust the update. Of course intermediate writes might still have occurred, so we have to check that our refresh future was sucessfully unregistered, which would have been discarded in had any other write occurred.

ben-manes · 2022-05-25T03:21:07Z

I might have found a refinement for the write timestamp conflict which will resolve the problem. We'll see if the test suite passes, as this is a complex area. If so, then the changes allow your test case to pass by being a little less aggressive in this race condition. However I am unsure if there is any way to write a reliable unit test, so it might remain best-effort.

Since a refresh is an optimistic read, you can still observe that different behavior if you replace the entry during its reload. In that case while the cache will discard the refresh, the future will return the new value. Here I am trying to be less aggressive by smarter detection of when we must discard it, but that will always be possible given the linearization requirements.

Hopefully I can make this a bit better for you, though.

…714) Linearization means that the refresh should be dropped if another write for that entry occurs, as we may populate the cache with stale data. The cache uses a write timestamp to help detect this. However, when this could be too aggressive when a refresh runs concurrently with the async load completing. An inflight load is given an infinite timestamp so that it never expires. A whenComplete callback then updates the entry with its policy metadata, such as its expiration time. That changes the write timestamp and if a refresh runs concurrently with the callback then the refresh may see the infinite timestamp, the replace updates it, and the refresh drops the entry as a conflict. Instead we can detect this case by seeing if the timestamp delta exceeds our maximum value (120yrs). IF so, then we know this race occurred and can trust the update. Of course intermediate writes might still have occurred, so we have to check that our refresh future was sucessfully unregistered, which would have been discarded in had any other write occurred.

bschlie · 2022-05-25T11:45:54Z

Hi @ben-manes.

Thanks very much for the detailed description of the cause of this issue. I will try to revisit this issue when/if your latest change is applied.

As I understand, due to the nature of CompletableFuture there is currently no way around this. I understand that the problem is two folded:

refresh() immediately after get() can cause the refreshed value to be discarded as the meta data from the first computation might not be set.
get() immediately after refresh() might still provide me with the old value, as the meta data has yet to be updated. Subsequent get()'s might return the new value though when the cache has finished processing.

I which to have a cache with similar behavior to refreshAfterWrite, but I want to be able to manually discard/refresh the old value from the cache and be guaranteed that next time the key is requested, a new value is computed. Is there an alternative to my approach?

ben-manes · 2022-05-25T19:58:35Z

Yes, fundamentally we cannot work around this. I might be able to make it handle some edge cases better, but there will always be races that violate the initial assumption.

You can guarantee by invalidating or explicitly replacing the value. The refresh tries to optimistically load it while serving the latest copy, and then try to swap it if no changes occurred that would invalidate data consistency. This is good for hiding latency instead of an explicit removal and load (e.g. expiration), but it may have to discard the optimistic load. As reads are lock-free, you could use asMap().compute to reload and replace the value while blocking writers, which is likely acceptable. Would either of those work for you?

…714) [no ci] Linearization means that the refresh should be dropped if another write for that entry occurs, as we may populate the cache with stale data. The cache uses a write timestamp to help detect this. However, when this could be too aggressive when a refresh runs concurrently with the async load completing. An inflight load is given an infinite timestamp so that it never expires. A whenComplete callback then updates the entry with its policy metadata, such as its expiration time. That changes the write timestamp and if a refresh runs concurrently with the callback then the refresh may see the infinite timestamp, the replace updates it, and the refresh drops the entry as a conflict. Instead we can detect this case by seeing if the timestamp delta exceeds our maximum value (120yrs). IF so, then we know this race occurred and can trust the update. Of course intermediate writes might still have occurred, so we have to check that our refresh future was sucessfully unregistered, which would have been discarded in had any other write occurred.

ben-manes · 2022-05-26T03:25:46Z

My idea to detect and handle one of the race conditions is not strong enough to fix this case. Other races still exist and when I do get your test to pass, it is touchy enough to not be relied upon (e.g. must be in a debugger to ensure certain event ordering). Since the optimization only helps in a rare case and not fully, I don't think it is worth adding if it risks that I accidentally violate linearizability.

The idea was to detect if the original read was the infinite timestamp and that the current is the same future. If the latest timestamp was the same then no writes or if greater than the user allowed maximum then the metadata write back then it is okay, else an explicit ABA-style of write would be discarded. I am not positive that this would pass linearizability as I think an ABA could still happen in the delta case. That can't be detected because we discard the refresh future on a write, so a conditional removal for the future during the reload's callback would always fail as the metadata write discarded it.

Anyway, I won't be making any improvements here. You have to be explicit by removing or replacing the entry, which will ensure the new value is visible. This optimistic reload can't offer that strictness as it is best-effort so it has to discard in those racy cases.

bschlie · 2022-05-27T10:34:52Z

Alright @ben-manes, thanks a lot for your effort. At least I got a better understanding of the issue. I will go with your other suggestions instead 👍

Linearization means that the refresh should be dropped if another write for that entry occurs, as we may populate the cache with stale data. The cache used a write timestamp to help detect this. However, when this could be too aggressive when a refresh runs concurrently with the async load completing. An inflight load is given an infinite timestamp so that it never expires. A whenComplete callback then updates the entry with its policy metadata, such as its expiration time. That changes the write timestamp and if a refresh runs concurrently with the callback then the refresh may see the infinite timestamp, the replace updates it, and the refresh drops the entry as a conflict. Instead, if the refresh is successfully unregistered within the entry's compute to swap the value, then it can be assumed to be valid. Any other write will discard the refresh under the entry's compute, so we retain linearizability. This requires that the load callback to update the metadata does not discard the refresh, since that is not needed for a psuedo write. The write timestamp trick for ABA detection is a layover from the previous implementation (2.x) which did not promise linearization. Note that a refresh is optimistic and races with other writes, so it may be discarded. As it relies on callbacks to write back into the cache, one cannot expect the entry to be updated after the reload completes (e.g. see #714).

ben-manes · 2022-05-29T19:35:24Z

Thanks for the prompt to revisit this. I was able to improve the implementation to handle conflicts better and simplify away from the timestamp check. That doesn’t change our discussion, but did help make things better. 😁

ben-manes mentioned this issue May 26, 2022

asyncLoad occasionally called with Object due to race between refreshAfterWrite and expireAfterAccess #715

Closed

ben-manes closed this as completed May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh cache with refreshAfterWrite doesn't update value #714

Refresh cache with refreshAfterWrite doesn't update value #714

bschlie commented May 24, 2022 •

edited

Loading

ben-manes commented May 24, 2022

ben-manes commented May 25, 2022

bschlie commented May 25, 2022 •

edited

Loading

ben-manes commented May 25, 2022

ben-manes commented May 26, 2022

bschlie commented May 27, 2022

ben-manes commented May 29, 2022

Refresh cache with refreshAfterWrite doesn't update value #714

Refresh cache with refreshAfterWrite doesn't update value #714

Comments

bschlie commented May 24, 2022 • edited Loading

ben-manes commented May 24, 2022

ben-manes commented May 25, 2022

bschlie commented May 25, 2022 • edited Loading

ben-manes commented May 25, 2022

ben-manes commented May 26, 2022

bschlie commented May 27, 2022

ben-manes commented May 29, 2022

bschlie commented May 24, 2022 •

edited

Loading

bschlie commented May 25, 2022 •

edited

Loading