Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh cache with refreshAfterWrite doesn't update value #714

Closed
bschlie opened this issue May 24, 2022 · 7 comments
Closed

Refresh cache with refreshAfterWrite doesn't update value #714

bschlie opened this issue May 24, 2022 · 7 comments

Comments

@bschlie
Copy link

bschlie commented May 24, 2022

Hi.

We are experiencing some issues when trying to manually refresh a cache with refreshAfterWrite.

I have created a minimal project to demonstrate the problem:

public class main {
    public static void main(String[] args) throws Exception {

        ExecutorService executor = Executors.newCachedThreadPool();

        LoadingCache<Object, String> cache = Caffeine
            .newBuilder()
            .executor(executor)
            .refreshAfterWrite(Duration.ofSeconds(100))
            .buildAsync((key -> {
                //      Issue only occur when computations are slow
                Thread.sleep(100);
                return String.valueOf(new Random().nextInt());
            }))
            .synchronous();

        Object obj = new Object();

        String firstGet = cache.get(obj);
        String reloadedValue = cache.refresh(obj).get(); // wait for it to finish reloading
        String secondGet = cache.get(obj);
        String thirdGet = cache.get(obj);

        System.out.println(firstGet);
        System.out.println(reloadedValue);
        System.out.println(secondGet);
        System.out.println(thirdGet);

        executor.shutdown();

        assert !firstGet.equals(secondGet);
    }
}

Output

> Task :main.main() FAILED
1764635026
1549536415
1764635026
1764635026
Exception in thread "main" java.lang.AssertionError
	at main.main.main(main.java:43)

By calling refresh on the cache and waiting for the future to be complete, a new value has been returned. However the subsequently .get call (both secondGet and thirdGet) to the cache still returns the old value. I would expect that the cache would provide the new value.

It seems like the issue only occur when the computations in the buildAsync callback are slow (hence the Thread.sleep).

I'm using caffeine version 3.1.0.

Is this is a bug or is it something I have misunderstood?

@ben-manes
Copy link
Owner

This is due to how CompetableFuture works and other races between threads. Unfortunately I don't think there is anything I can do except explain the problem. There are multiple challenges that makes your test go awry.

CompetableFuture returns a new instance when registering dependent actions and operations on those futures are not back propagated to the source. In other words, you can cancel or complete a dependent future but those upstream will not receive this signal. If for example you wanted to cancel an http request then this must be on the original future or else the client would not be aware to close the connection by its own dependent action. The cache always returns what was produced by the attached AsyncCacheLoader which avoids a surprise of disconnected operations. Ideally the dependent actions would return the same future instance and one would use copy() to explicitly disconnect.

The cache attaches a whenComplete action to update the cache, either removing if a failure or else updating the policy metadata (e.g. assigning capacity and triggering an eviction). Since we return the loader's future, waiting on future's value does not mean the cache's dependent actions have completed. As the futures are executed asynchronously, the client can observe the result before the other thread runs the dependents. In your case that race causes the reloadedValue to be started before the firstGet could update the policy metadata. When the refresh completes, it validates that the entry was not modified since then and, if so, discards the refresh as an ABA conflict. This is done by comparing the write timestamps before and after refreshing, which due to the races are in conflict and the cache must be drop the new value to maintain linearizability (e.g. put(k, a), refresh(k), put(k, b) => get(k) == b).

The problems are exhausterated in application by the fact that dependent actions are processed in LIFO order. A slow action from the application will delay the cache's to be delayed, which could make this happen more frequently.

Ideally if we returned the cache's dependent future some of these problems would go away, but at the cost of whomever produced the future not receiving any downstream signals. The write conflict is a less pronounced race that I'd need some more time to revisit, but I don't think that would be easy to weaken without losing linearizability which is a more important to keep.

ben-manes added a commit that referenced this issue May 25, 2022
…714)

Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache uses a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead we can detect this case by seeing if the timestamp delta exceeds
our maximum value (120yrs). IF so, then we know this race occurred and
can trust the update. Of course intermediate writes might still have
occurred, so we have to check that our refresh future was sucessfully
unregistered, which would have been discarded in had any other write
occurred.
@ben-manes
Copy link
Owner

I might have found a refinement for the write timestamp conflict which will resolve the problem. We'll see if the test suite passes, as this is a complex area. If so, then the changes allow your test case to pass by being a little less aggressive in this race condition. However I am unsure if there is any way to write a reliable unit test, so it might remain best-effort.

Since a refresh is an optimistic read, you can still observe that different behavior if you replace the entry during its reload. In that case while the cache will discard the refresh, the future will return the new value. Here I am trying to be less aggressive by smarter detection of when we must discard it, but that will always be possible given the linearization requirements.

Hopefully I can make this a bit better for you, though.

ben-manes added a commit that referenced this issue May 25, 2022
…714)

Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache uses a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead we can detect this case by seeing if the timestamp delta exceeds
our maximum value (120yrs). IF so, then we know this race occurred and
can trust the update. Of course intermediate writes might still have
occurred, so we have to check that our refresh future was sucessfully
unregistered, which would have been discarded in had any other write
occurred.
@bschlie
Copy link
Author

bschlie commented May 25, 2022

Hi @ben-manes.

Thanks very much for the detailed description of the cause of this issue. I will try to revisit this issue when/if your latest change is applied.

As I understand, due to the nature of CompletableFuture there is currently no way around this. I understand that the problem is two folded:

  1. refresh() immediately after get() can cause the refreshed value to be discarded as the meta data from the first computation might not be set.
  2. get() immediately after refresh() might still provide me with the old value, as the meta data has yet to be updated. Subsequent get()'s might return the new value though when the cache has finished processing.

I which to have a cache with similar behavior to refreshAfterWrite, but I want to be able to manually discard/refresh the old value from the cache and be guaranteed that next time the key is requested, a new value is computed. Is there an alternative to my approach?

@ben-manes
Copy link
Owner

Yes, fundamentally we cannot work around this. I might be able to make it handle some edge cases better, but there will always be races that violate the initial assumption.

You can guarantee by invalidating or explicitly replacing the value. The refresh tries to optimistically load it while serving the latest copy, and then try to swap it if no changes occurred that would invalidate data consistency. This is good for hiding latency instead of an explicit removal and load (e.g. expiration), but it may have to discard the optimistic load. As reads are lock-free, you could use asMap().compute to reload and replace the value while blocking writers, which is likely acceptable. Would either of those work for you?

ben-manes added a commit that referenced this issue May 26, 2022
…714) [no ci]

Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache uses a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead we can detect this case by seeing if the timestamp delta exceeds
our maximum value (120yrs). IF so, then we know this race occurred and
can trust the update. Of course intermediate writes might still have
occurred, so we have to check that our refresh future was sucessfully
unregistered, which would have been discarded in had any other write
occurred.
@ben-manes
Copy link
Owner

My idea to detect and handle one of the race conditions is not strong enough to fix this case. Other races still exist and when I do get your test to pass, it is touchy enough to not be relied upon (e.g. must be in a debugger to ensure certain event ordering). Since the optimization only helps in a rare case and not fully, I don't think it is worth adding if it risks that I accidentally violate linearizability.

The idea was to detect if the original read was the infinite timestamp and that the current is the same future. If the latest timestamp was the same then no writes or if greater than the user allowed maximum then the metadata write back then it is okay, else an explicit ABA-style of write would be discarded. I am not positive that this would pass linearizability as I think an ABA could still happen in the delta case. That can't be detected because we discard the refresh future on a write, so a conditional removal for the future during the reload's callback would always fail as the metadata write discarded it.

Anyway, I won't be making any improvements here. You have to be explicit by removing or replacing the entry, which will ensure the new value is visible. This optimistic reload can't offer that strictness as it is best-effort so it has to discard in those racy cases.

@bschlie
Copy link
Author

bschlie commented May 27, 2022

Alright @ben-manes, thanks a lot for your effort. At least I got a better understanding of the issue. I will go with your other suggestions instead 👍

ben-manes added a commit that referenced this issue May 29, 2022
Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache used a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead, if the refresh is successfully unregistered within the entry's
compute to swap the value, then it can be assumed to be valid. Any other
write will discard the refresh under the entry's compute, so we retain
linearizability. This requires that the load callback to update the
metadata does not discard the refresh, since that is not needed for a
psuedo write.

The write timestamp trick for ABA detection is a layover from the
previous implementation (2.x) which did not promise linearization.

Note that a refresh is optimistic and races with other writes, so it
may be discarded. As it relies on callbacks to write back into the
cache, one cannot expect the entry to be updated after the reload
completes (e.g. see #714).
ben-manes added a commit that referenced this issue May 29, 2022
Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache used a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead, if the refresh is successfully unregistered within the entry's
compute to swap the value, then it can be assumed to be valid. Any other
write will discard the refresh under the entry's compute, so we retain
linearizability. This requires that the load callback to update the
metadata does not discard the refresh, since that is not needed for a
psuedo write.

The write timestamp trick for ABA detection is a layover from the
previous implementation (2.x) which did not promise linearization.

Note that a refresh is optimistic and races with other writes, so it
may be discarded. As it relies on callbacks to write back into the
cache, one cannot expect the entry to be updated after the reload
completes (e.g. see #714).
ben-manes added a commit that referenced this issue May 29, 2022
Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache used a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead, if the refresh is successfully unregistered within the entry's
compute to swap the value, then it can be assumed to be valid. Any other
write will discard the refresh under the entry's compute, so we retain
linearizability. This requires that the load callback to update the
metadata does not discard the refresh, since that is not needed for a
psuedo write.

The write timestamp trick for ABA detection is a layover from the
previous implementation (2.x) which did not promise linearization.

Note that a refresh is optimistic and races with other writes, so it
may be discarded. As it relies on callbacks to write back into the
cache, one cannot expect the entry to be updated after the reload
completes (e.g. see #714).
ben-manes added a commit that referenced this issue May 29, 2022
Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache used a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead, if the refresh is successfully unregistered within the entry's
compute to swap the value, then it can be assumed to be valid. Any other
write will discard the refresh under the entry's compute, so we retain
linearizability. This requires that the load callback to update the
metadata does not discard the refresh, since that is not needed for a
psuedo write.

The write timestamp trick for ABA detection is a layover from the
previous implementation (2.x) which did not promise linearization.

Note that a refresh is optimistic and races with other writes, so it
may be discarded. As it relies on callbacks to write back into the
cache, one cannot expect the entry to be updated after the reload
completes (e.g. see #714).
ben-manes added a commit that referenced this issue May 29, 2022
Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache used a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead, if the refresh is successfully unregistered within the entry's
compute to swap the value, then it can be assumed to be valid. Any other
write will discard the refresh under the entry's compute, so we retain
linearizability. This requires that the load callback to update the
metadata does not discard the refresh, since that is not needed for a
psuedo write.

The write timestamp trick for ABA detection is a layover from the
previous implementation (2.x) which did not promise linearization.

Note that a refresh is optimistic and races with other writes, so it
may be discarded. As it relies on callbacks to write back into the
cache, one cannot expect the entry to be updated after the reload
completes (e.g. see #714).
ben-manes added a commit that referenced this issue May 29, 2022
Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache used a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead, if the refresh is successfully unregistered within the entry's
compute to swap the value, then it can be assumed to be valid. Any other
write will discard the refresh under the entry's compute, so we retain
linearizability. This requires that the load callback to update the
metadata does not discard the refresh, since that is not needed for a
psuedo write.

The write timestamp trick for ABA detection is a layover from the
previous implementation (2.x) which did not promise linearization.

Note that a refresh is optimistic and races with other writes, so it
may be discarded. As it relies on callbacks to write back into the
cache, one cannot expect the entry to be updated after the reload
completes (e.g. see #714).
ben-manes added a commit that referenced this issue May 29, 2022
Linearization means that the refresh should be dropped if another write
for that entry occurs, as we may populate the cache with stale data. The
cache used a write timestamp to help detect this. However, when this
could be too aggressive when a refresh runs concurrently with the async
load completing.

An inflight load is given an infinite timestamp so that it never expires.
A whenComplete callback then updates the entry with its policy metadata,
such as its expiration time. That changes the write timestamp and if a
refresh runs concurrently with the callback then the refresh may see the
infinite timestamp, the replace updates it, and the refresh drops the
entry as a conflict.

Instead, if the refresh is successfully unregistered within the entry's
compute to swap the value, then it can be assumed to be valid. Any other
write will discard the refresh under the entry's compute, so we retain
linearizability. This requires that the load callback to update the
metadata does not discard the refresh, since that is not needed for a
psuedo write.

The write timestamp trick for ABA detection is a layover from the
previous implementation (2.x) which did not promise linearization.

Note that a refresh is optimistic and races with other writes, so it
may be discarded. As it relies on callbacks to write back into the
cache, one cannot expect the entry to be updated after the reload
completes (e.g. see #714).
@ben-manes
Copy link
Owner

Thanks for the prompt to revisit this. I was able to improve the implementation to handle conflicts better and simplify away from the timestamp check. That doesn’t change our discussion, but did help make things better. 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants