Make a dedicated reclaim thread #802

ryao · 2012-07-03T01:01:02Z

Currently, the kernel calls arc_shrinker_func() whenever it needs to free memory. Unfortunately, concurrent threads operating during memory pressure can invoke this function simultaneously, which causes locking contention and causes ARC to free excessive quantities of memory.

For example, it is possible for 5 threads block on direct reclaim, run through their reclaim paths and each come to the conclusion that ZFS will need to free 100MB of RAM. Then, ZFS will be invoked and the threads thread will be serialized, with each thread unblocking after 100MB of RAM has been freed. The total amount of memory freed will be 500MB of RAM and an excessive amount of time will have been spent waiting.

My basic idea to resolve this is as follows. I propose that we rework ARC to maintain a dedicated shrinker thread that will sleep until signaled to free memory. When the kernel invokes the arc_shrinker_func() callback, the thread that invokes it will signal the shrinker thread, which will wake up and proceed to free memory. If additional threads signal it during this time, their requests will be aggregated, with all of them being notified when the maximum of their free memory requests has been freed. After they are notified, the shrinker thread will go back to sleep.

In the above scenario, only 100MB will be freed, all 5 threads will only have to wait the time that the first thread would have had to wait and locking contention would decrease. My plan is to produce a patch for this when my time permits it, but I am posting this here so that it can be discussed in parallel to that.

behlendorf · 2012-07-03T21:12:08Z

Is this a problem your able to reproduce? If so how?

When I originally wrote the arc_shrinker_func() I was careful to try and handle these concerns. For example, you'll notice that the function uses mutex_tryenter() instead of just mutex_enter(), so if the locks contended we don't block. We just return -1 to the upper layers which prompts the kernel to try and reclaim memory from another cache. Before this cache is tried again the allocation will be retried. So we shouldn't have excessive lock contention here.

Additionally, the arc_shrinker_func() will never be called with a ->nr_to_scan value larger than 128 objects. Now the callback treats each object as a page so that should limit it to 512K chunks for x86 architectures. So I would expect the arc_shrinker_func() to adjust things fairly gradually.

I think that before we go designing a new solution for this we need to clearly identify exactly what is happening and why. It may be the right fix is to create a dedicated thread for this but we need to know more before diving in to that. Also keep in mind in the medium to long term I really way to back the ARC by page cache pages. When that happens we won't need to depend so heavily on the shrinkers to perform reclaim. We'll just drop page cache pages which is much more inline with how the kernel is designed to work.

ryao · 2012-07-03T22:12:56Z

The lags only occur under heavy memory pressure and will result in things like a terminal or X Windows console freezing temporarily. Reproducing it can be accomplished by doing so much on the system that it has to enter direct reclaim.

@DeHackEd and I discussed this issue in IRC. He found my patches to remove PF_MEMALLOC alleviated issue #676, which occurs on his system under heavy loads. Stabilizing the system exposed it to an extended period of memory pressure, which resulted in random lags. He found that the kernel still responded to interrupts during lags and used the Magic System Request key to get backtraces during them:

http://pastebin.com/pH7KyBKa

From the backtrace, it looked like threads appeared to be contending for locks and my main thought was that preventing the contention would resolve the issue. It was late at night and I seem to have looked at the ARC code by mistake. Looking at this again, this contention involves slabs, not ARC. Another thing that he told me was that his system consistently had 1GB of RAM free, despite direct reclaim running, which combined with my mistake led me to conjecture that the above scenario was occurring.

It would seem some variant of the above scenario occurs in the slab code. I am unfamiliar with it, so I will need familiarize myself with the slab code before thinking of ways to improve it.

On the topic of arc_shrinker_cache(), I am not certain if returning -1 to the upper layers is the right thing to do. If many other threads are trying to clear caches, they will continue to do locking elsewhere. That locking will incur the cost of atomic instructions. It will also exacerbate the situation in the backtrace where many kernel threads try to shrink the slabs simultaneously. Having the threads block would minimize the number of atomic instructions used and reduce contention, which might be better.

behlendorf · 2012-09-17T21:32:45Z

We should keep this in mind, but I believe that once we integrate more tightly with the VFS page cache this will be less of a issue. With that in mind I'm moving this feature to the 0.7.0 milestone where that work will occur.

behlendorf · 2016-03-26T03:15:01Z

Things are significantly improved after the multilist ARC support was merged. Closing.

ryao mentioned this issue Jul 3, 2012

Fix deadlocks in DMU #726

Closed

ryao mentioned this issue Jul 27, 2013

Simplify reclaim #1613

Closed

behlendorf removed this from the 0.7.0 milestone Oct 6, 2014

behlendorf added Bug - Minor and removed Type: Feature Feature request or new feature labels Oct 6, 2014

behlendorf closed this as completed Mar 26, 2016

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023

update transitive dependencies (openzfs#802)

7357530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a dedicated reclaim thread #802

Make a dedicated reclaim thread #802

ryao commented Jul 3, 2012

behlendorf commented Jul 3, 2012

ryao commented Jul 3, 2012

behlendorf commented Sep 17, 2012

behlendorf commented Mar 26, 2016

Make a dedicated reclaim thread #802

Make a dedicated reclaim thread #802

Comments

ryao commented Jul 3, 2012

behlendorf commented Jul 3, 2012

ryao commented Jul 3, 2012

behlendorf commented Sep 17, 2012

behlendorf commented Mar 26, 2016