`asyncLoad` occasionally called with `Object` due to race between `refreshAfterWrite` and `expireAfterAccess` #715

wiggzz · 2022-05-24T16:31:27Z

We're seeing some ClassCastExceptions (in our code) after our cache loader asyncLoad seems to be invoked with an Object instead of the expected key type K. We are using the cache configured with refreshAfterWrite and expireAfterAccess and I suspect this may be happening because the expireAfterAccess will cause the cache Node to eventually be evicted and then die, which will set the Node key to DEAD_STRONG_KEY (new Object()). At the same time, the node is being refreshed, and so the asyncLoad method is called with the dead key.

I'll take a look if I can reproduce in a unit test, but wanted to open this issue to see if you had seen this before or if we are using it wrong.

Our cache is roughly configured like this

Caffeine.newBuilder()
        .refreshAfterWrite(10, TimeUnit.MINUTES)
        .expireAfterAccess(15, TimeUnit.MINUTES)
        .maximumSize(325_000)
        .recordStats(/* custom StatsCounter, not doing anything really special */ )
        .scheduler(forScheduledExecutorService(/* scheduled executor for the application */))
        .buildAsync(/* our loader, which does coalescing similar to the `CoalescingBulkLoader` */);

The text was updated successfully, but these errors were encountered:

ben-manes · 2022-05-24T17:38:47Z

Thanks! If you can put a unit test together that would be really great. It hasn't been reported before, but your description makes sense.

ben-manes · 2022-05-24T17:58:56Z

The logic to trigger the refresh reads the key but doesn't check afterwards that Node.isAlive() is true. The bug makes sense, testing could be hard due to races.

caffeine/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java

Lines 1265 to 1268 in fa552a5

    
           if (((now - writeTime) > refreshAfterWriteNanos()) && (keyReference != null) 
        
               && ((key = node.getKey()) != null) && ((oldValue = node.getValue()) != null) 
        
               && ((writeTime & 1L) == 0L) && !(refreshes = refreshes()).containsKey(keyReference) 
        
               && node.casWriteTime(writeTime, refreshWriteTime)) {

wiggzz · 2022-05-24T19:10:31Z

The logic to trigger the refresh reads the key but doesn't check afterwards that Node.isAlive() is true. The bug makes sense, testing could be hard due to races.

caffeine/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java

Lines 1265 to 1268 in fa552a5

if (((now - writeTime) > refreshAfterWriteNanos()) && (keyReference != null)

&& ((key = node.getKey()) != null) && ((oldValue = node.getValue()) != null)

&& ((writeTime & 1L) == 0L) && !(refreshes = refreshes()).containsKey(keyReference)

&& node.casWriteTime(writeTime, refreshWriteTime)) {

Yep that is exactly what I was thinking. And also agree testing is tricky. Need to get the node to be evicted while the read is happening. Could possibly hook into the StatsCounter to tick the ticker and then call cleanup all during a read?

wiggzz · 2022-05-24T19:16:54Z

Actually, this theory can't be 100% correct because the value is set to null in the same place that the key is set to the DEAD_STRONG_KEY. So the refresh logic shouldn't process the dead node just because the value should be null... 🤔

wiggzz · 2022-05-24T19:18:52Z

Ah - but it could be RETIRED_STRONG_KEY. Thinking out loud here.

ben-manes · 2022-05-24T19:19:47Z

right. Since those are two different fields so one could be set before the other is and cause it to pass the visibility check. Those nulls are mostly for weak/soft reference collection and not liveliness.

When a read triggers a refresh, it captures the key and value into local variables and performs validation checks. This should have included verifying that the entry is still alive (vs retired, dead). Otherwise, the key may be an internal sentinel and the refresh will fail. This has no harmful effects due to being logged and swallowed.

ben-manes · 2022-05-25T00:30:11Z

I was able to write a unit test using your idea of stats as a hook. I also made the sentinel keys proper classes so that it will be easier to debug in the future. Would you be able to canary 3.1.1-SNAPSHOT? If not, I can release and we can verify as a follow-up.

Thankfully this should be a benign bug since the ClassCastExceptions is logged, swallowed, and the entry left discarded by its removal. Other than log noise it shouldn't be causing any problems.

When a read triggers a refresh, it captures the key and value into local variables and performs validation checks. This should have included verifying that the entry is still alive (vs retired, dead). Otherwise, the key may be an internal sentinel and the refresh will fail. This has no harmful effects due to being logged and swallowed.

wiggzz · 2022-05-25T01:55:17Z

Interesting. In our case the error is not being fully swallowed by caffeine because we are coalescing the refreshes and normal loads. So once an invalid key is requested to be refreshed, it is added to a batch, then every other key that is requested with it also fails because of the class cast exception because the batch is polluted.

So in our case it is more than noise, although it's probably not having much of a customer impact since it is fairly infrequent.

I will see if we can canary this. If not, I will definitely follow up if it still happens after we upgrade.

ben-manes · 2022-05-26T04:11:58Z

I will see if we can canary this. If not, I will definitely follow up if it still happens after we upgrade.

Any updates? I'm all set on my side

Refresh cache with refreshAfterWrite doesn't update value #714 as won't do (working as intended)
Problem when adding Caffeine to a Tar in Gradle #716 fixed (weird gradle metadata stuff)

wiggzz · 2022-05-26T04:48:11Z

I will see if we can canary this. If not, I will definitely follow up if it still happens after we upgrade.

Any updates? I'm all set on my side

Refresh cache with refreshAfterWrite doesn't update value #714 as won't do (working as intended)

Problem when adding Caffeine to a Tar in Gradle #716 fixed (weird gradle metadata stuff)

I wasn't able to get to this today and will be on leave for a week, so feel free to cut the release and I will followup when I get a chance.

ben-manes · 2022-05-26T04:48:59Z

thanks, will do!

ben-manes · 2022-05-26T18:54:40Z

Released 3.1.1

wiggzz · 2022-06-17T02:16:13Z

@ben-manes Hey sorry for the delayed follow up on this. We finally got this out into production and it's fixed the issue. Thanks for the quick resolution!

ben-manes · 2022-06-17T02:17:00Z

awesome 🥳

ben-manes closed this as completed in 5a31bcb May 25, 2022

ben-manes mentioned this issue May 25, 2022

Problem when adding Caffeine to a Tar in Gradle #716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`asyncLoad` occasionally called with `Object` due to race between `refreshAfterWrite` and `expireAfterAccess` #715

`asyncLoad` occasionally called with `Object` due to race between `refreshAfterWrite` and `expireAfterAccess` #715

wiggzz commented May 24, 2022

ben-manes commented May 24, 2022 •

edited

Loading

ben-manes commented May 24, 2022

wiggzz commented May 24, 2022

wiggzz commented May 24, 2022

wiggzz commented May 24, 2022

ben-manes commented May 24, 2022

ben-manes commented May 25, 2022

wiggzz commented May 25, 2022

ben-manes commented May 26, 2022

wiggzz commented May 26, 2022

ben-manes commented May 26, 2022

ben-manes commented May 26, 2022

wiggzz commented Jun 17, 2022

ben-manes commented Jun 17, 2022

asyncLoad occasionally called with Object due to race between refreshAfterWrite and expireAfterAccess #715

asyncLoad occasionally called with Object due to race between refreshAfterWrite and expireAfterAccess #715

Comments

wiggzz commented May 24, 2022

ben-manes commented May 24, 2022 • edited Loading

ben-manes commented May 24, 2022

wiggzz commented May 24, 2022

wiggzz commented May 24, 2022

wiggzz commented May 24, 2022

ben-manes commented May 24, 2022

ben-manes commented May 25, 2022

wiggzz commented May 25, 2022

ben-manes commented May 26, 2022

wiggzz commented May 26, 2022

ben-manes commented May 26, 2022

ben-manes commented May 26, 2022

wiggzz commented Jun 17, 2022

ben-manes commented Jun 17, 2022

`asyncLoad` occasionally called with `Object` due to race between `refreshAfterWrite` and `expireAfterAccess` #715

`asyncLoad` occasionally called with `Object` due to race between `refreshAfterWrite` and `expireAfterAccess` #715

ben-manes commented May 24, 2022 •

edited

Loading