Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected expireAfterWrite behavior when computation of entry duration > expireAfterWrite value #191

Closed
perlambaek opened this issue Oct 10, 2017 · 9 comments

Comments

@perlambaek
Copy link

We are experiencing some behavior we cannot really make sense of. When running the following test, we expect:

  • The first 2 seconds of running to produce cache entries instantly. (Untill expireAfterWrite is reached)
  • Thereafter we expect each of the 50 threads to take 3 seconds to get a new entry, (since recomputing takes 3 seconds now)
  • Then for the next 2 seconds, instantly get their cache entries. (Since we just wrote a new entry that is viable for expireAfterWrite duration)
  • Rinse and repeat.

Instead we see non deterministic behavior with threads waiting from 3 seconds to 12 with no real pattern. The output from the test will show the timings becoming off fairly fast.

import com.github.benmanes.caffeine.cache.Caffeine;
import com.github.benmanes.caffeine.cache.LoadingCache;
import com.google.common.base.Stopwatch;
import org.junit.Test;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class CacheTest {
    int count = 0;

    @Test
    public void caffeine() throws Exception {
        ExecutorService executor = Executors.newCachedThreadPool();

        com.github.benmanes.caffeine.cache.CacheLoader<String, String> loadingFunc = (String key) -> {
            count++;
            if (count > 1) {
                Thread.sleep(3000);
            }
            return "foo";
        };


        LoadingCache<String, String> testCache = Caffeine.newBuilder()
                                                         .executor(executor)
                                                         .expireAfterWrite(2, TimeUnit.SECONDS)
                                                         .build(loadingFunc);

        ExecutorService pool = Executors.newFixedThreadPool(50);
        for (int i = 0; i < 50; i++) {
            pool.execute(() -> {
                while (true) {
                    Stopwatch stopwatch = Stopwatch.createStarted();
                    testCache.get("");
                    stopwatch.stop();
                    System.out.println("normal: " + stopwatch.elapsed(TimeUnit.MILLISECONDS));
                    try {
                        Thread.sleep(100);
                    } catch (InterruptedException ignored) {
                    }

                }
            });
        }

        pool.awaitTermination(1, TimeUnit.HOURS);
    }
}
@ben-manes
Copy link
Owner

ben-manes commented Oct 10, 2017

My initial guess is that it is due to the compute time exceeding the expiration time. Currently the clock is read once to minimize expensive calls to System.nanoTime(). This is used to determine if the cached value has expired and is the writeTime set when computed. Thus, your threads might be having to recompute immediately, which causes the next one in line to recompute as well. That chain of C->B->A would lead to ~9 seconds and printouts that vary by ~3s depending on how long the chain was.

We could perform a re-read of the current time on the write to avoid adding the penalty of wait times and computation. For loads that are relatively quick compared to the expiration time it doesn't matter. For those like yours which are very tight it does. But I am also unsure if that's used and, if so, not already problematic? It would be nice to understand the use-case better.

@perlambaek
Copy link
Author

So we kinda stumbled on this trying to debug an issue, where our caches seem to get stuck, unable to ever really return any value. We are not entirely sure this is the cause, but still felt this was odd, we will do more testing for this issue with the expireAfterWrite, and see if the issue persists.

As for the use case: in our case we have some work that is normally pretty fast (<1 second), but during certain peak times, our work may suddenly take up 90 seconds to complete instead of 1 second, due to unavailable database connections, I/O and so on. This leads us to exceed our expireAfterWrite (not by design, and not something we desire.), even though we in 99.999% of the time are well within it. I am then guessing we get a whole chain put together of threads wanting to recompute the value, putting our system to a grinding halt, as no cache value ever really seems to be produced in that case. (We are taking steps to reduce the 90 seconds, as this is also not exactly desired functionality, but the contract of expireAfterWrite still seems broken.)

We often support peak loads of >500 pageviews/sec so having 50 threads hitting the caches at 100ms interval is not at all unusual.

I am not at all into the inner workings of this, but it seems to me that when 1 thread is done recomputing, all others waiting for the result should be given the result of that computation, no matter if the expiration has been reached or not, as this will surely be valid since it was just produced, and thus cannot have exceeded the expireAfterWrite unless it was effectively 0? Subsequent access would then be able to read a correct writeTime if it was set at the computation end?

Please let me know if you need more details or something is unclear.

@ben-manes
Copy link
Owner

I think that is fair. I don't know if I will get a chance to walk through the code during the week, but will try to get a release for you over the weekend.

If you use an AsyncLoadingCache then the behavior will be as you expect. Is that an okay short-term fix? You merely change build(loadingFunc) to buildAsync(loadingFunc).synchronous().

The async cache stores a future that subsequent callers block on and when it completes the write time is set, so that it doesn't expire while in-flight like you are seeing now. If we set it on the synchronous case to, then a waiter or subsequent read will get the full duration as expected. It only impacts narrow cases like yours, but the cost is tiny so I just need to work out the tests to assert the change.

@perlambaek
Copy link
Author

Yea I think we can work around it, so no rush on our account. Thank you very much for the swift response, and a very great library!

@ben-manes
Copy link
Owner

Sorry that I didn't get a chance to work on this over the weekend. Hopefully I'll be able to catch up during my commute this week. Thanks for being patient.

@ben-manes
Copy link
Owner

Really sorry that I'm not on top of this. Usually try to have a week turn around on bugs. Just been hectic lately and hard to decompress enough to switch contexts over.

@ben-manes
Copy link
Owner

I found a little time Sunday night and I have the fix passing your test. I'll try to work on the unit tests on the train tomorrow. Again, apologies for not being able to work on this more promptly.

@ben-manes
Copy link
Owner

I stole some time at work to wrap this up. I'll release this evening if the CI passes.

ben-manes added a commit that referenced this issue Oct 30, 2017
Previously, the current time was read once at the beginning of the
computation. This was used to determine if the entry had expired and,
when computed, the associated timestamp. For long computations and short
expiration times, this could result in the newly computed entry being
expired. A chain of computations would build up, each with timestamps
older than the last due to waiting for its predecessor.

Now the ticker is read again after the computation completes and the
timestamps are set. This means there are two reads on a computation,
but only 1 in the happy path of a cache hit. Note that obtaining the
time can be an expensive operation, so minimizing this is useful for
high loads.

Updated dependencies and fixed issues discovered by Spotbugs and
ErrorProne.
ben-manes added a commit that referenced this issue Oct 30, 2017
Previously, the current time was read once at the beginning of the
computation. This was used to determine if the entry had expired and,
when computed, the associated timestamp. For long computations and short
expiration times, this could result in the newly computed entry being
expired. A chain of computations would build up, each with timestamps
older than the last due to waiting for its predecessor.

Now the ticker is read again after the computation completes and the
timestamps are set. This means there are two reads on a computation,
but only 1 in the happy path of a cache hit. Note that obtaining the
time can be an expensive operation, so minimizing this is useful for
high loads.

Updated dependencies and fixed issues discovered by Spotbugs and
ErrorProne.
ben-manes added a commit that referenced this issue Oct 30, 2017
Previously, the current time was read once at the beginning of the
computation. This was used to determine if the entry had expired and,
when computed, the associated timestamp. For long computations and short
expiration times, this could result in the newly computed entry being
expired. A chain of computations would build up, each with timestamps
older than the last due to waiting for its predecessor.

Now the ticker is read again after the computation completes and the
timestamps are set. This means there are two reads on a computation,
but only 1 in the happy path of a cache hit. Note that obtaining the
time can be an expensive operation, so minimizing this is useful for
high loads.

Updated dependencies and fixed issues discovered by Spotbugs and
ErrorProne.
@ben-manes
Copy link
Owner

Released. Thank you for being patient and apologies for being slow to resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants