Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock contention on arcs mtx #3481

Closed

Conversation

behlendorf
Copy link
Contributor

The finalized patch stack for the ARC mutex lock contention changes.

dweeezil and others added 11 commits June 10, 2015 09:03
…loc"

This reverts commit 16fcdea in preparation
for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates
"marker" within the ARC code.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
This reverts commit 037763e in
preparation for the illumos 5497 "lock contention on arcs_mtx" patch
which includes a fix for this very problem.

ZoL had picked up a subset of the illumos 5497 patch to deal with the
l2arc compression buffer leak.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates
the need for this.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
This reverts only the l2arc_hdr part of commit
ecf3d9b in preparation for the illumos
5497 "lock contention on arcs_mtx" patch which does the same thing
but uses the newer two-level ARC structure following the Illumos 5408
"managing ZFS cache devices requires lots of RAM" patch.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
5369 arc flags should be an enum
5370 consistent arc_buf_hdr_t naming scheme
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Alex Reece <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Richard Lowe <[email protected]>

Porting notes:

ZoL has moved some ARC definitions into arc_impl.h.

Signed-off-by: Brian Behlendorf <[email protected]>
Ported by: Tim Chase <[email protected]>
5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Don Brady <[email protected]>
Reviewed by: Josef 'Jeff' Sipek <[email protected]>
Approved by: Garrett D'Amore <[email protected]>

Porting notes:

Due to the restructuring of the ARC-related structures, this
patch conflicts with at least the following existing ZoL commits:

    6e1d727
    Fix inaccurate arcstat_l2_hdr_size calculations

        The ARC_SPACE_HDRS constant no longer exists and has been
        somewhat equivalently replaced by HDR_L2ONLY_SIZE.

    e0b0ca9
    Add visibility in to cached dbufs

        The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr
        struct requires additional structure member names to be used
        when referencing the inner items.  Also, the presence of L1 or L2
        inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR
        macros.

Ported by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Dan McDonald <[email protected]>

Porting notes and other significant code changes:

The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.

Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.

The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.

Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:

    302f753 - Integrate ARC more tightly with Linux
    39e055c - Adjust arc_p based on "bytes" in arc_shrink
    f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
    77765b5 - Remove "arc_meta_used" from arc_adjust calculation
    94520ca - Prune metadata from ghost lists in arc_adjust_meta

Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:

    fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
    fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }

The following arcstats have been removed:

    recycle_miss - Used by arcstat.py and arc_summary.py, both of which
    have been updated appropriately.

    l2_writes_hdr_miss

The following arcstats have been added:

    evict_not_enough - Number of times arc_evict_state() was unable to
    evict enough buffers to reach its target amount.

    evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
    because it was being written to the l2arc.

    l2_writes_lock_retry - Replaces l2_writes_hdr_miss.  Number of times
    l2arc_write_done() failed to acquire hash_lock (and re-tries).

    arc_meta_min - Shows the value of the zfs_arc_meta_min module
    parameter (see below).

The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme.  Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.

The following module parameters have been added:

    zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
    before moving on to the next sub-list.

    zfs_arc_meta_min - Enforce a floor on the amount of metadata in
    the ARC.

    zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
    ARC state.

    zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
    the target size to be considered "overflowing".

Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]
SPL commit behlendorf/spl@9cef1b5 adds the taskq_wait_outstanding()
interface.  See the commit log for the full justification for this
addition.  This patch adds the required user space counterpart.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Replace taskq_wait() with taskq_wait_oustanding().  This way callers
will only block until previously submitted tasks have been completed.
This was the previous behavior of task_wait() prior to the introduction
of taskq_wait_outstanding() so this isn't really a functionalty change
for these callers.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock.  This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous.  Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.

To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:

* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq.  This makes it
suitable to use in the context of the arc_adapt_thread().

* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete.  This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.

This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured.  It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny
systems such as the raspberry pi, however, as of addition of large block
support, the arc_adapt() function depends on arc_c being >= 32MiB (2 *
SPA_MAXBLOCKSIZE).

This patch raises the minimum ARC size to 32MiB and adds a VERIFY test
to arc_adapt() for future-proofing.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
@behlendorf behlendorf force-pushed the lock-contention-on-arcs_mtx-final branch from e253e18 to 21fcadc Compare June 10, 2015 16:16
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Jun 11, 2015
…account-for-ashift' into zfs_kOT_11.06.2015

adapted to patchstack from openzfs#3481
@FransUrbo
Copy link
Contributor

In short, what does this all mean in plain english? :)

@behlendorf
Copy link
Contributor Author

Significantly reduced lock contention in the ARC which should translate in to improved cache performance once a couple other bottlenecks are removed. But this was a biggie.

@FransUrbo
Copy link
Contributor

Ok, thanx.

@FransUrbo
Copy link
Contributor

Maybe we should have cut a 0.6.5 first, before accepting this… This seems like a big thing, and it will take some time for it to 'percolate' on peoples machines before we can cut a new version.

To late now, but something to think about for next time perhaps?

@odoucet
Copy link

odoucet commented Jun 11, 2015

well, it depends on when you were expecting 0.6.5 ... there is ~ one release per year, so if 0.6.5 is released in february 2016, we will have plenty of time to check this commit :)

@FransUrbo
Copy link
Contributor

there is ~ one release per year

That was just hugely bad luck. We've said we'd try to do one every three months (about one month left on that :), but this one seems to depend on others (or rather, others depend on this one) and maybe that's not enough time.

@AndCycle
Copy link

I can live with unstable release that been tested over a week without losing data :x

@behlendorf
Copy link
Contributor Author

Right, I'd like to shoot for every 3 months or so but it looks like we might run a bit over because we'd planned to also get the ABD changes in the the next tag too. Which is what I'm going to push to get merged that. That way the next major tag will contain a big chunk of the memory / ARC restructuring which is a big step forward.

In the meanwhile we'll definitely keep the release branch up to date with critical fixes. Which means right now we have a few build issues for newer kernels which need to be finalized.

@FransUrbo
Copy link
Contributor

Right, I'd like to shoot for every 3 months or so but it looks like we might run a bit over because we'd planned to also get the ABD changes in the the next tag too.

Is this safe? Only having 'a few weeks' (depending on how soon you can get the ABD changes in) of testing in master before tagged as a release?

Technically (according to the schedule), we're only about four, five weeks away from next release…

@behlendorf
Copy link
Contributor Author

@FransUrbo it's definitely ambitious and it likely means the release date will slide a month or two. But this is one of the major items planned for the next tag. Roughly the full remaining planned list looks like this. We could bump ABD to the next tag but I'm inclined for the moment to try and get it finalized and merged. If that turns out to be unworkable we can bump it.

  • ABD (optional)
  • Linux 4.1 build fixes (mandatory)
  • Fix .zfs/snapshot for newer kernels (mandatory)
  • zfs-initramfs (optional)
  • Any critical bug fixes (mandatory)

As an aside what do you think about re-enabling the wiki and using it to post this kind of information.

@behlendorf behlendorf deleted the lock-contention-on-arcs_mtx-final branch April 19, 2021 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants