Expiry javadoc is a bit misleading #217

facboy · 2018-01-31T19:20:32Z

The javadoc for com.github.benmanes.caffeine.cache.Expiry states:

To indicate no expiration an entry may be given an excessively long period, such as Long#MAX_VALUE.

This isn't strictly accurate as if you use Long.MAX_VALUE the expiry (now + duration) overflows and you end up with a negative eviction time, ie the entry gets evicted almost immediately.

It looks like this scenario is guarded against in the Async code by restricting the effective duration to (Long.MAX_VALUE >> 1) but this isn't done in com.github.benmanes.caffeine.cache.BoundedLocalCache.

The text was updated successfully, but these errors were encountered:

ben-manes · 2018-01-31T19:41:02Z

Sorry about that. We should add some unit tests against this then.

The time calculation should be tolerant to overflows by following the advice in System.nanoTime. It should be using t1 - t0 < 0 instead of t1 < t0 to avoid sign assumptions. Its easy mistake to do, so if I did that again then we should fix it. Otherwise, an overflow should be okay.

If I recall correctly, the Async situation is slightly different. It is used as a signal that the future is in-flight and should not be evaluated for expiration. Since the future can complete anytime, it is helpful to have that flag separated for querying the future's state. By capping the expiry that a user can set, it lets us differentiate between the states. But I don't recall enough if that safe-guard was necessary, as there is a lot of mental overhead to keep track of these details.

Can you write a unit test and we can fix your issue?

ben-manes · 2018-02-03T04:17:29Z

Hi @facboy,

I did a quick sanity check and was unable to reproduce this issue. We'll keep digging and make sure there is test coverage. Did you observe this bug or was it a mistake when reading the JavaDoc?

facboy · 2018-02-04T10:41:19Z

sorry i've not had a chance to write a unit test yet. i observed the bug.

ben-manes · 2018-02-04T18:57:58Z

In case this helps, here is my passing test. Hopefully you can tweak it to fail and we can understand the problem.

@Test
public void issue217() {
  FakeTicker ticker = new FakeTicker();
  ticker.advance(ThreadLocalRandom.current().nextLong());
  Cache<Integer, Integer> cache = Caffeine.newBuilder()
      .ticker(ticker::read)
      .executor(Runnable::run)
      .expireAfter(new Expiry<Integer, Integer>() {
        @Override public long expireAfterCreate(Integer key, Integer value, long currentTime) {
          return Long.MAX_VALUE;
        }
        @Override public long expireAfterUpdate(Integer key, Integer value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }
        @Override public long expireAfterRead(Integer key, Integer value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }})
      .removalListener((k, v, c) -> {
        Assert.fail(k + " " + v + " " + c);
      })
      .build();
  cache.put(1, 2);
  assertThat(cache.getIfPresent(1), is(2));
  assertThat(cache.estimatedSize(), is(1L));

  ticker.advance(1, TimeUnit.MINUTES);
  assertThat(cache.getIfPresent(1), is(2));
  assertThat(cache.estimatedSize(), is(1L));
}

facboy · 2018-02-05T11:29:47Z

well this is pretty annoying, i cannot reproduce it in a unit test so started wondering if i'd imagined the whole thing. back to our production code and when i use Long.MAX_VALUE, it definitely breaks, so i'll have to whittle back from there.

facboy · 2018-02-05T12:18:19Z

ah, the simple approach: https://github.com/facboy/caffeine-test

ben-manes · 2018-02-05T19:27:41Z

Thanks! That is very odd, but something we can debug with. Interestingly it takes 1.5M operations to fail, though this is using real time instead of a fake source so that may be understandable. Since my test suite primarily uses TestNG, here is a slight adaption to run there.

@rash67 wants to get his hands dirty, so hopefully he can take a look. If not, I'll try to resolve it over the weekend.

@Test
@SuppressWarnings({"unchecked", "RedundantStringConstructorCall", "InfiniteLoopStatement"})
void maxExpiryShouldNotExpireImmediately() {
  LoadingCache<Integer, String> cache = Caffeine.newBuilder()
      .expireAfter(new Expiry<Integer, String>() {
        @Override
        public long expireAfterCreate(Integer key, String value, long currentTime) {
          return Long.MAX_VALUE;
        }
        @Override
        public long expireAfterUpdate(Integer key, String value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }
        @Override
        public long expireAfterRead(Integer key, String value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }
      })
      .removalListener((k, v, c) -> {
        //Assert.fail(k + " " + v + " " + c);
      }).build(key -> new String("10"));

  int counter = 0;
  String expected = cache.get(10);
  Stopwatch stopwatch = Stopwatch.createStarted();
  while (stopwatch.elapsed(TimeUnit.MINUTES) < 1) {
    String val = cache.get(10);
    assertThat("Failed after " + counter + " iterations", val, sameInstance(expected));
    counter++;
  }
}

facboy · 2018-02-09T17:16:47Z

I don't think it's related but is there also a bug with the Expiry.expireAfterUpdate behaviour? perhaps it's not a supported use-case but I expect an entry to expire (ie be evicted and invoke removalListener) after the delay returned from Expiry.expireAfterUpdate() has passed. It seems that this doesn't happen if I do not further access the cache. I didn't follow all the ins-and-outs of it but it seems that Caffeine.expiresAfterWrite() doesn't return true if Caffeine.expireAfter(Expiry) is used, so no eviction is scheduled on a write.

Possibly the error is here in the generated LocalCacheFactory:

    if (builder.expiresAfterAccess() || builder.expiresVariable()) {
      sb.append('A');
    }
    if (builder.expiresAfterWrite()) {
      sb.append('W');
    }

Should it append 'W' if builder.expiresVariable() is true as well?

ben-manes · 2018-02-09T17:25:00Z

To expire without any interaction with the cache would require a thread to schedule on. In Guava's Cache timeframe, there were some environments like JavaEE and AppEngine that disallowed creating threads. It is a fairly high cost when typically the promptness is not necessary.

Java 9 added a JVM-wide scheduler (via CompletableFuture), which we might leverage as an option. I am way behind on a JDK9 version, unfortunately.

facboy · 2018-02-09T17:29:15Z

oh i see, i thought that was what the timerWheel did but I didn't look at it in great detail. i guess the wheel only advances on interaction with the cache?

EDIT: doesn't it schedule evictions etc using the ForkJoinPool?

ben-manes · 2018-02-09T17:36:58Z

Yes, the timerwheel is a passive data structure. There is a nice article regarding Kafka's implementation that schedules on a thread. They combine the timerwheel with a ScheduledExecutorService (heap-based) and schedule the bucket instead of the entry. This way the heap is bounded to O(lg # of wheels) instead of O(lg # of timers), which in our case would be a maximum of O(lg 165) ~= 7.

facboy · 2018-02-13T14:35:37Z

would be interesting if this was added as an option :). for my current use-case i don't need them to expire in a particularly timely manner, i've just added an external background thread that invokes cleanUp() periodically.

ben-manes · 2018-02-14T18:41:37Z

#195 is basically asking for this. It has definitely come up many times for Guava, etc. The cost of maintaining dedicated threads is high for a library, and while we can use FJP in JDK8 it doesn't support any scheduling semantics.

The hope is that we can use optionally enable using CompletableFuture.html#delayedExecutor to trigger a cleanUp() on the next expiration event. This uses a global JDK-wide scheduling thread, so we do not impose our own. We might want to offer a simple Scheduler abstraction for unit tests, etc. (like we did with Ticker for time). I don't know what we'd call the enablement of this feature, as it would be off by default.

That's the idea of how to do it properly, at least. I am pretty behind (on this bug fix and v3.0 release in general).

ben-manes · 2018-02-20T05:37:48Z

@facboy Can you try testing with .executor(Runnable::run) added to the builder? It doesn't fail for me then.

I think the issue might be that threads read different timestamps when calling System.nanoTime(). When enough reads occur, they are replayed on the policy asynchronously. Then it somehow overflows and goes negative. I haven't debugged more than that, but it seems like a clock synchronization issue if my theory holds.

facboy · 2018-02-20T16:27:17Z

hmm. so it fixes the test-case i gave you, but it doesn't fix the 'actual' problem i was having which is still in private code. if it helps this is the partial stack trace of the removalListener when it is expired (the reason cause[0] is EXPIRED):

	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$notifyRemoval$1(BoundedLocalCache.java:283)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:289)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2080)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2010)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:113)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:65)

there might be an additional issue. the code where I first observed this (ie not the test-case) is not doing a lot of reads...probably in the low 10s at most as it's a small integration test.

ben-manes · 2018-02-20T16:30:52Z

It’s likely that the async thread has an older time stamp and the overflow occurs since it’s a tight boundary. I just need to better accustom myself to this. I think that the solution you gave will be the right fix, though.

… see JDK-8174749)

…217) When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds, the calcuation of expirationTime - currentTime > 0 may overflow and be negative. This will not occur if the same thread calculates both timestamps. It may occur across threads when the expirationTime is concurrently updated using a later base time than t1's reading of the currentTime. This can occur whenever the maintenance work is triggered to sweep expired entries and a user thread accesses the entry. The later timestamp plus the maximum duration results in an overflow, causing the remaining time to be negative, and therefore causes the cache to expire the entry. The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This should give a broad safety net to avoid these concurrency-inducing overflows in normal code.

ben-manes · 2018-02-21T07:42:39Z

I generalized the test to a pass/fail scenario, showing that the concurrency due to threading was the cause. When the read buffer is full (~16 reads per stripe), the async maintenance is triggered. The entry continues to be read, bumping its expiration time beyond the overflow point. This means that the maintenance's currentTime read is less than the entry's currentTime read, so t1 - (t0 + duration) will overflow with Long.MAX_VALUE. If all operations are using a single thread, this cannot occur due to order of operations causing linear timestamps. Since the unit tests default to same-thread cleanup for simplicity, it was an oversight on my part.

I updated the internals to always enforce a Long.MAX_VALUE / 2 as you suggested to provide enough leeway for the overflow to not occur in practical code.

I'll make a release tomorrow and will let the CI do its thing tonight. Thanks for reporting the issue.

…217) When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds, the calcuation of expirationTime - currentTime > 0 may overflow and be negative. This will not occur if the same thread calculates both timestamps. It may occur across threads when the expirationTime is concurrently updated using a later base time than t1's reading of the currentTime. This can occur whenever the maintenance work is triggered to sweep expired entries and a user thread accesses the entry. The later timestamp plus the maximum duration results in an overflow, causing the remaining time to be negative, and therefore causes the cache to expire the entry. The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This should give a broad safety net to avoid these concurrency-inducing overflows in normal code.

facboy · 2018-02-21T10:16:01Z

ah great, thanks for that!

…217) When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds, the calcuation of expirationTime - currentTime > 0 may overflow and be negative. This will not occur if the same thread calculates both timestamps. It may occur across threads when the expirationTime is concurrently updated using a later base time than t1's reading of the currentTime. This can occur whenever the maintenance work is triggered to sweep expired entries and a user thread accesses the entry. The later timestamp plus the maximum duration results in an overflow, causing the remaining time to be negative, and therefore causes the cache to expire the entry. The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This should give a broad safety net to avoid these concurrency-inducing overflows in normal code.

ben-manes · 2018-02-22T03:23:53Z

Released. Thanks!

ben-manes referenced this issue Feb 20, 2018

Prefer reflection over method handles due to memory leak (fixes #222,…

20d9e0c

… see JDK-8174749)

ben-manes closed this as completed in 8cc7c74 Feb 21, 2018

ben-manes mentioned this issue Nov 18, 2021

Multiple in-flight requests for one key #625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expiry javadoc is a bit misleading #217

Expiry javadoc is a bit misleading #217

facboy commented Jan 31, 2018

ben-manes commented Jan 31, 2018

ben-manes commented Feb 3, 2018

facboy commented Feb 4, 2018

ben-manes commented Feb 4, 2018

facboy commented Feb 5, 2018

facboy commented Feb 5, 2018

ben-manes commented Feb 5, 2018

facboy commented Feb 9, 2018 •

edited

Loading

ben-manes commented Feb 9, 2018

facboy commented Feb 9, 2018 •

edited

Loading

ben-manes commented Feb 9, 2018 •

edited

Loading

facboy commented Feb 13, 2018

ben-manes commented Feb 14, 2018

ben-manes commented Feb 20, 2018

facboy commented Feb 20, 2018

ben-manes commented Feb 20, 2018

ben-manes commented Feb 21, 2018

facboy commented Feb 21, 2018

ben-manes commented Feb 22, 2018

Expiry javadoc is a bit misleading #217

Expiry javadoc is a bit misleading #217

Comments

facboy commented Jan 31, 2018

ben-manes commented Jan 31, 2018

ben-manes commented Feb 3, 2018

facboy commented Feb 4, 2018

ben-manes commented Feb 4, 2018

facboy commented Feb 5, 2018

facboy commented Feb 5, 2018

ben-manes commented Feb 5, 2018

facboy commented Feb 9, 2018 • edited Loading

ben-manes commented Feb 9, 2018

facboy commented Feb 9, 2018 • edited Loading

ben-manes commented Feb 9, 2018 • edited Loading

facboy commented Feb 13, 2018

ben-manes commented Feb 14, 2018

ben-manes commented Feb 20, 2018

facboy commented Feb 20, 2018

ben-manes commented Feb 20, 2018

ben-manes commented Feb 21, 2018

facboy commented Feb 21, 2018

ben-manes commented Feb 22, 2018

facboy commented Feb 9, 2018 •

edited

Loading

facboy commented Feb 9, 2018 •

edited

Loading

ben-manes commented Feb 9, 2018 •

edited

Loading