Remove kmutex_t & kcondvar out of arc_buf_hdr_t #1971

DeHackEd · 2013-12-13T17:31:46Z

Under Linux these data types together consume around 100 bytes of space.
Having them in the arc_buf_hdr_t makes them nearly 50% larger than they
would be otherwise. When lock debugging is enabled it gets even larger.

This patch moves them to a single global array to be shared similar to
how the znode hold locks in zfs_sb work. We're trading some potential
for artificial lock contention for drastically smaller kernel memory
objects.

DeHackEd · 2013-12-13T17:34:56Z

Completely mindful that this might not be very well received, I'm offering it up anyway.

Under linux each of my new "struct arc_buf_hdr_mutex" objects are over 100 bytes each. When doing lock debugging it's even higher. This saves quite a bit of space from each arc_buf_hdr_t at the cost of having shared mutexes and conditions.

I've been running it on my low-load workstations for a week now and not seen any problems with it, but low load is not what I'm worried about. It's whether contention will become an issue at high load

TODO:

I'm hoping that hashing is fair and that collisions are distributed appropriately.
Should the structure be padded to better cache alignments?
Is this relevant to other OpenZFS users?

behlendorf · 2013-12-13T20:43:48Z

@DeHackEd This certainly isn't the craziest thing I've seen proposed and I'm not strictly against it. But we need to definitively answer the questions you've posed. Is the hashing fair? How much contention is this causing? What's the impact on performance? How much memory are we saving by doing this? We just need some hard numbers.

DeHackEd · 2013-12-13T23:03:07Z

Here's the numbers I have. I have a lot of different kernels, etc so take them appropriately.

Kernel 3.4 (without patch):

# grep arc_buf_hdr_t /proc/slabinfo
arc_buf_hdr_t     2521505 2558780    288   28    2 : tunables    0    0    0 : slabdata  91385  91385      0

288 bytes per object, ~2.5 million objects exist

Kernel 3.10 (without patch, some kernel debugging is enabled)

# grep /proc/spl/kmem/slab
arc_buf_hdr_t                         0x00020  51044352  16064512     8192      296   6231  6231  9983  149544 54272 239592      0     0     0

296 bytes per object

Kernel 2.6.32 (RHEL 6.4) (without patch)
= 264 bytes per object

Kernel 3.10 (with patch):

# grep arc /proc/slabinfo
arc_buf_hdr_t     155405 170478    184   21    1 : tunables  120   60    8 : slabdata   8118   8118      0

184 bytes per object, savings of 112 bytes per object.

At 2.5 million headers that's a lot.

DeHackEd · 2013-12-28T13:11:39Z

(Updated: rebased to current master and fixed style and comment issues. No code changes)

DeHackEd · 2015-05-10T23:45:06Z

Just rebasing to the current version. It's not functionally changed yet.

behlendorf · 2015-05-11T23:57:03Z

@DeHackEd Thanks! I was going to ask that you refresh this so we could have another look and getting it merged.

sempervictus · 2015-05-15T19:59:09Z

@behlendorf: with #3115 in the pipeline, would the contention concern still be present, or would that ameliorate the problem?

behlendorf · 2015-05-15T20:34:06Z

@sempervictus this tackles a slightly different problem so the concern is still valid. But we should absolutely look at this change in the context of the #3115 work.

dweeezil · 2015-05-15T20:42:45Z

I should have the #3115 work tested and ready for further review by the end of this weekend. I think it's likely fairly close now except for the buildbot failures under the 2.6.32 kernels which I also need to look into.

DeHackEd · 2015-05-15T23:50:44Z

Additional rebase plus added @behlendorf 's suggestion to do collision tracking. There's a new arcstat called freeze_collisions

DeHackEd · 2015-05-16T00:18:09Z

Gave it a test run (plus a few other patches like 2129 ABD) with 4 instances of Bonnie++.

Testing machine: KVM with 4 CPUs and 5 GB of RAM. Host has 16 GB and Core i7.

# zpool import testpool
# cd /testpool/bigfiles  # pre-created mostly-empty filesystem with recordsize=1M
# for i in 1 2 3 4; do bonnie++ -r 500 -s 2000 -u nobody -d `pwd` & done

A total of 6613 collisions occurred. They were rather rare during the main file operations and spiked up towards the end.

@sempervictus patch 3115 is about relieving pressure on the global locks on the ARC. This patch is about eliminating lock objects from each ARC header in memory in order to reduce the size of said objects. Per-object size shrinks from 296 to 184 bytes; exact size varies by kernel version and build options. The penalty is artificial contention on these objects since a lock collision does not necessarily mean two threads are accessing the same block.

behlendorf · 2015-05-18T18:10:47Z

@DeHackEd this turned out nicely. Let's plan on getting this in after the other significant ARC changes which are being worked on to avoid any additional conflicts.

Along similar lines to this change it occurred to me it would probably useful to run pahole over the module to look for additional savings. It's a tool which can be used to dump the resulting structure alignments and look for holes and cache line alignment issues. There may be significant extra memory saving to be had, although we've been pretty careful about how things are load out in memory. For example, here's the option for the existing arc_buf_hdr.

pahole module/zfs/zfs.ko -C arc_buf_hdr
struct arc_buf_hdr {
    dva_t                      b_dva;                /*     0    16 */
    uint64_t                   b_birth;              /*    16     8 */
    uint64_t                   b_cksum0;             /*    24     8 */
    kmutex_t                   b_freeze_lock;        /*    32    48 */
    /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
    zio_cksum_t *              b_freeze_cksum;       /*    80     8 */
    arc_buf_hdr_t *            b_hash_next;          /*    88     8 */
    arc_buf_t *                b_buf;                /*    96     8 */
    uint32_t                   b_flags;              /*   104     4 */
    uint32_t                   b_datacnt;            /*   108     4 */
    arc_callback_t *           b_acb;                /*   112     8 */
    kcondvar_t                 b_cv;                 /*   120    72 */
    /* --- cacheline 3 boundary (192 bytes) --- */
    arc_buf_contents_t         b_type;               /*   192     4 */

    /* XXX 4 bytes hole, try to pack */

    uint64_t                   b_size;               /*   200     8 */
    uint64_t                   b_spa;                /*   208     8 */
    arc_state_t *              b_state;              /*   216     8 */
    list_node_t                b_arc_node;           /*   224    16 */
    clock_t                    b_arc_access;         /*   240     8 */
    uint32_t                   b_mru_hits;           /*   248     4 */
    uint32_t                   b_mru_ghost_hits;     /*   252     4 */
    /* --- cacheline 4 boundary (256 bytes) --- */
    uint32_t                   b_mfu_hits;           /*   256     4 */
    uint32_t                   b_mfu_ghost_hits;     /*   260     4 */
    uint32_t                   b_l2_hits;            /*   264     4 */

    /* XXX 4 bytes hole, try to pack */

    refcount_t                 b_refcnt;             /*   272     8 */
    l2arc_buf_hdr_t *          b_l2hdr;              /*   280     8 */
    list_node_t                b_l2node;             /*   288    16 */

    /* size: 304, cachelines: 5, members: 25 */
    /* sum members: 296, holes: 2, sum holes: 8 */
    /* last cacheline: 48 bytes */
};

behlendorf · 2015-05-18T18:23:51Z

Oh, one more thing. This is subjective but the collision numbers feel high to me for only four concurrent bonnie++ processes. It would be useful to verify that in fact you're seeing a good distribution across the hash buckets. You could do this by implementing another stat similar to the arcstat_hash_chains. Just keep track of the first time a hash bucket is used and increment a counter. It should rapidly approach the hash size if this are working properly, if it doesn't that's a concern.

DeHackEd · 2015-05-18T21:13:57Z

I'm working on making such a test, adding the check to verify that the collision is, in fact, a hash collision and not just a legitimate lock contention on a single arc_buf_hdr_t. Unfortunately I think I need a spinlock in front of the whole thing to do it properly. Oh well.

DeHackEd · 2015-05-19T23:14:35Z

Here's what I came up with for validation. I hope I got the logic right. Obviously you would not actually take this newest commit.

Running 10 instances of Bonnie++ as described above on the same machine, I had 102 freeze collisions which would not have happened without this patch.

behlendorf · 2015-05-19T23:42:32Z

@DeHackEd considering the significant memory saves that's sounds like a level we can reasonably live with. Did you happen to somehow verify that the addressed are being hashed fairly? Or explore increasing the hash size to further reduce this.

DeHackEd · 2015-05-19T23:47:24Z

The code exists - see arc_buf_hdr_hash_hits - but silly me forgot to put in a nice way to dump it without the assistance of a debugger which I do not have for my running kernel. ztest results were too low on the hit count to be conclusive.

dweeezil · 2015-05-20T13:17:43Z

I finally got a chance to peek at this and without making a whole bunch of disjoint code line comments, I can't help but thinking would be a bit cleaner and more congruent with the existing code to embed arc_buf_hdr_mutexes within the existing buf_hash_table and then to use use the same access methods as are currently used for buf_hash_table.ht_locks (hashing method, etc.).

I've instrumented the crc64-based hashing scheme as part of my #3115 work and can say that it works very well in large-scale testing. It seems to make sense to (re)use it for the new arc_buf_hdr_mutexes.

Not related to this patch at all, but related to memory use and locking efficiency: the method by which buf_hash_table.ht_table and buf_hash_table.ht_locks are sized can create a serious imbalance between the two on systems with a lot of memory. The same holds true on the dbuf side of things with dbuf_hash_table and its inner sub-hash table. For starters, it would seem to make more sense to size each of the main hash tables relative to physmem/2 (rather than physmem) and have a tunable which allows it to be made larger for systems on which a lot more memory is used for the arc. On systems with a lot of memory, the "inner" hash tables become very small relative to their parent table. While working on #3115, I cobbled together https://github.com/dweeezil/zfs/tree/dbuf-mutex-hash to allow the inner dbuf hash table to be sized relative to its parent table. In the end, I ran into too many other lock contentions for this to make a significant difference.

behlendorf · 2015-05-20T17:18:29Z

@dweeezil I think that makes a lot of sense. I had a similar thought when looking at this but I was concerned that relying on a more complicated crc64-based hashing scheme to get a good distribution might be too expensive. The current prime/module scheme is very cheap and known to work well for predictable values such as memory addresses.

But if in fact using crc64's does work well I completely agree it would be nicer, cleaner, and simpler to reuse that same code. Let's plan on giving it a try after #3115 is merged (soon I hope).

the method by which buf_hash_table.ht_table and buf_hash_table.ht_locks are sized...

I completely agree, and in fact I'd go one step farther. There is no known right size for these hash tables it depends heavily on the workload. My suggestion would be that we update this code such that the hash table can be dynamically re-sized and have them start out fairly small. If you constrain the hash table to be a power of two this is a relatively straight forward thing to implement efficiently. This is a case where ZFS can and should dynamically size things and we don't need a user tunable.

5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Tim Chase <[email protected]> Porting notes: Due to the restructuring of the ARC-related structures, this patch conflicts with at least the following existing ZoL commits: 6e1d727 Fix inaccurate arcstat_l2_hdr_size calculations The ARC_SPACE_HDRS constant no longer exists and has been somewhat equivalently replaced by HDR_L2ONLY_SIZE. e0b0ca9 Add visibility in to cached dbufs The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr struct requires additional structure member names to be used when referencing the inner items. Also, the presence of L1 or L2 inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR macros.

behlendorf · 2015-06-18T19:55:05Z

@DeHackEd with the #3115 changes merged we're in a position where this kind of change could be merged if you can rebase it on master and incorporate the various feedback.

DeHackEd · 2015-06-18T21:19:58Z

One concern with the advice is that adding any fields, even a uint32_t array offset, increases the size of these objects.

Also I still didn't do the hash fairness verification check. I already know that the allocations are from slabs which will affect the distribution of the memory addresses.

Newest push covers:

Rebase to master
Uses crc64 to find slot to use

Does not:

Store a pointer in the arc_buf headers
Allow array resizing, dynamic or at module load time

The freeze_collisions arcstat counts both legitimate and artificial collisions making it an unreliable means of measuring the performance impact this patch can have on lock contention. See DeHackEd/zfs@b5946cd for the (old) patch that counts only artificial collisions at the expense of serializing everything through a spinlock.

Under Linux these data types together consume around 100 bytes of space. Having them in the arc_buf_hdr_t makes them nearly 50% larger than they would be otherwise. When lock debugging is enabled it gets even larger. This patch moves them to a single global array to be shared similar to how the znode hold locks in zfs_sb work. We're trading some potential for artificial lock contention for drastically smaller kernel memory objects.

behlendorf · 2016-05-13T16:54:35Z

Closing as stale.

DeHackEd mentioned this pull request Feb 17, 2014

Heavy lock contention in zfs_zinactive causing ARC metadata to grow #1932

Closed

DeHackEd force-pushed the arc-mutexes branch from 3634c6c to f6077b5 Compare October 13, 2014 21:02

behlendorf removed this from the 0.6.7 milestone Nov 8, 2014

behlendorf added Type: Performance Performance improvement or performance problem Component: Memory Management kernel memory management Bug - Minor labels Nov 8, 2014

sempervictus mentioned this pull request Feb 6, 2015

Illumos 5368 - ARC should cache more metadata #2971

Closed

DeHackEd force-pushed the arc-mutexes branch from f6077b5 to fe67663 Compare May 10, 2015 23:45

DeHackEd force-pushed the arc-mutexes branch 2 times, most recently from 11043bb to 89fa67e Compare May 15, 2015 23:49

DeHackEd closed this Jun 18, 2015

DeHackEd force-pushed the arc-mutexes branch from b5946cd to 57ae840 Compare June 18, 2015 21:18

DeHackEd reopened this Jun 18, 2015

behlendorf mentioned this pull request Jun 22, 2015

Lock contention on arcs mtx #3115

Closed

DeHackEd force-pushed the arc-mutexes branch from 6770280 to 4994ba6 Compare September 13, 2015 11:29

behlendorf closed this May 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove kmutex_t & kcondvar out of arc_buf_hdr_t #1971

Remove kmutex_t & kcondvar out of arc_buf_hdr_t #1971

DeHackEd commented Dec 13, 2013

DeHackEd commented Dec 13, 2013

behlendorf commented Dec 13, 2013

DeHackEd commented Dec 13, 2013

DeHackEd commented Dec 28, 2013

DeHackEd commented May 10, 2015

behlendorf commented May 11, 2015

sempervictus commented May 15, 2015

behlendorf commented May 15, 2015

dweeezil commented May 15, 2015

DeHackEd commented May 15, 2015

DeHackEd commented May 16, 2015

behlendorf commented May 18, 2015

behlendorf commented May 18, 2015

DeHackEd commented May 18, 2015

DeHackEd commented May 19, 2015

behlendorf commented May 19, 2015

DeHackEd commented May 19, 2015

dweeezil commented May 20, 2015

behlendorf commented May 20, 2015

behlendorf commented Jun 18, 2015

DeHackEd commented Jun 18, 2015

behlendorf commented May 13, 2016

Remove kmutex_t & kcondvar out of arc_buf_hdr_t #1971

Remove kmutex_t & kcondvar out of arc_buf_hdr_t #1971

Conversation

DeHackEd commented Dec 13, 2013

DeHackEd commented Dec 13, 2013

behlendorf commented Dec 13, 2013

DeHackEd commented Dec 13, 2013

DeHackEd commented Dec 28, 2013

DeHackEd commented May 10, 2015

behlendorf commented May 11, 2015

sempervictus commented May 15, 2015

behlendorf commented May 15, 2015

dweeezil commented May 15, 2015

DeHackEd commented May 15, 2015

DeHackEd commented May 16, 2015

behlendorf commented May 18, 2015

behlendorf commented May 18, 2015

DeHackEd commented May 18, 2015

DeHackEd commented May 19, 2015

behlendorf commented May 19, 2015

DeHackEd commented May 19, 2015

dweeezil commented May 20, 2015

behlendorf commented May 20, 2015

behlendorf commented Jun 18, 2015

DeHackEd commented Jun 18, 2015

behlendorf commented May 13, 2016