Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid memory allocations in the ARC eviction thread #12985

Merged
merged 1 commit into from
Jan 21, 2022

Conversation

markjdb
Copy link
Contributor

@markjdb markjdb commented Jan 17, 2022

This change mitigates a low-memory deadlock observed on FreeBSD, where the ARC eviction thread fails to allocate memory and sleeps, blocking the page reclamation thread.

Motivation and Context

The change removes some memory allocations from the ARC eviction thread, thus helping ensure that arc_wait_for_eviction() won't block a thread that's attempting to reclaim pages on behalf of the rest of the system.

Description

The change modifies arc_init() to pre-allocate a set of marker buffer headers for use as placeholders in arc_evict_state(). When the ARC eviction thread calls arc_evict_state(), it uses the pre-allocated markers rather than performing a blocking memory allocation.

How Has This Been Tested?

I was able to reproduce the deadlock occasionally using parallel pgbench and FreeBSD kernel builds to incur memory pressure. With this patch I haven't been able to trigger any hangs.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Signed-off-by: Mark Johnston <[email protected]>
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree that memory allocations in evict thread is a bad idea. I don't see any problems in this patch, but generally in perspective I'd really like eviction to be multi-threaded, since single thread is a known bottleneck. That would require some changes to this code, but that may be handled later.

@behlendorf
Copy link
Contributor

If we can avoid the allocation here, I agree that would be a good thing. On a related note, we may have avoided this particular deadlock on Linux because of the spl_fstrans_mark() / spl_fstrans_unmark() calls in arc_evict_cb(). They're wired up on Linux to allow us to mark critical sections where we need to ensure the memory allocator won't ask the VFS layer to free up memory to satisfy the allocation (which might deadlock). I'm not sure if there's a need for this on FreeBSD, but I thought it worth mentioning.

Copy link
Contributor

@gamanakis gamanakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the proposed change, don't see any problems with it, thank you. As mentioned in #12987, it seems to help with the arc_evict and arc_prune contention when there is pressure on the ARC.

@markjdb
Copy link
Contributor Author

markjdb commented Jan 21, 2022

If we can avoid the allocation here, I agree that would be a good thing. On a related note, we may have avoided this particular deadlock on Linux because of the spl_fstrans_mark() / spl_fstrans_unmark() calls in arc_evict_cb(). They're wired up on Linux to allow us to mark critical sections where we need to ensure the memory allocator won't ask the VFS layer to free up memory to satisfy the allocation (which might deadlock). I'm not sure if there's a need for this on FreeBSD, but I thought it worth mentioning.

I think the larger difference is that FreeBSD doesn't do direct reclaim: if a page allocation fails, the allocating thread must either handle the failure or sleep waiting for system threads (the "page daemon") to reclaim memory. That approach has some advantages - it makes system behaviour in low memory conditions much easier to reason about for one - but it does mean that threads responsible for reclaiming memory have to be disciplined and avoid memory allocations (or pre-allocate memory reserves to avoid deadlocks). This is easy to handle within the page daemon, but gets tricky once the page daemon has to request work from other threads (the ARC eviction thread in this case; system I/O threads if the page daemon is writing dirty pages to swap; etc.). For memory reclamation purposes, FreeBSD doesn't maintain a real distinction between anonymous pages and file-backed pages the way Linux does, so there's no real way to implement spl_fstrans_mark() in any case.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 21, 2022
@behlendorf behlendorf merged commit 6e2a591 into openzfs:master Jan 21, 2022
markjdb added a commit to markjdb/zfs that referenced this pull request Feb 1, 2022
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#12985
tonyhutter pushed a commit that referenced this pull request Feb 3, 2022
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes #12985
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#12985
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#12985
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 22, 2022
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#12985
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 22, 2022
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#12985
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 23, 2022
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.

This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
   arc_wait_for_eviction()

In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.

Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time.  When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#12985
bsdjhb pushed a commit to CTSRD-CHERI/zfs that referenced this pull request Jul 13, 2023
This is not associated with a specific upstream commit but apparently
a local diff applied as part of:

commit e92ffd9b626833ebdbf2742c8ffddc6cd94b963e
Merge: 3c3df3660072 17b2ae0
Author: Martin Matuska <[email protected]>
Date:   Sat Jan 22 23:05:15 2022 +0100

    zfs: merge openzfs/zfs@17b2ae0b2 (master) into main

    Notable upstream pull request merges:
      openzfs#12766 Fix error propagation from lzc_send_redacted
      openzfs#12805 Updated the lz4 decompressor
      openzfs#12851 FreeBSD: Provide correct file generation number
      openzfs#12857 Verify dRAID empty sectors
      openzfs#12874 FreeBSD: Update argument types for VOP_READDIR
      openzfs#12896 Reduce number of arc_prune threads
      openzfs#12934 FreeBSD: Fix zvol_*_open() locking
      openzfs#12947 lz4: Cherrypick fix for CVE-2021-3520
      openzfs#12961 FreeBSD: Fix leaked strings in libspl mnttab
      openzfs#12964 Fix handling of errors from dmu_write_uio_dbuf() on FreeBSD
      openzfs#12981 Introduce a flag to skip comparing the local mac when raw sending
      openzfs#12985 Avoid memory allocations in the ARC eviction thread

    Obtained from:  OpenZFS
    OpenZFS commit: 17b2ae0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants