Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove kmutex_t & kcondvar out of arc_buf_hdr_t #1971

Closed
wants to merge 1 commit into from

Conversation

DeHackEd
Copy link
Contributor

Under Linux these data types together consume around 100 bytes of space.
Having them in the arc_buf_hdr_t makes them nearly 50% larger than they
would be otherwise. When lock debugging is enabled it gets even larger.

This patch moves them to a single global array to be shared similar to
how the znode hold locks in zfs_sb work. We're trading some potential
for artificial lock contention for drastically smaller kernel memory
objects.

@DeHackEd
Copy link
Contributor Author

Completely mindful that this might not be very well received, I'm offering it up anyway.

Under linux each of my new "struct arc_buf_hdr_mutex" objects are over 100 bytes each. When doing lock debugging it's even higher. This saves quite a bit of space from each arc_buf_hdr_t at the cost of having shared mutexes and conditions.

I've been running it on my low-load workstations for a week now and not seen any problems with it, but low load is not what I'm worried about. It's whether contention will become an issue at high load

TODO:

  • I'm hoping that hashing is fair and that collisions are distributed appropriately.
  • Should the structure be padded to better cache alignments?
  • Is this relevant to other OpenZFS users?

@behlendorf
Copy link
Contributor

@DeHackEd This certainly isn't the craziest thing I've seen proposed and I'm not strictly against it. But we need to definitively answer the questions you've posed. Is the hashing fair? How much contention is this causing? What's the impact on performance? How much memory are we saving by doing this? We just need some hard numbers.

@DeHackEd
Copy link
Contributor Author

Here's the numbers I have. I have a lot of different kernels, etc so take them appropriately.

Kernel 3.4 (without patch):

# grep arc_buf_hdr_t /proc/slabinfo
arc_buf_hdr_t     2521505 2558780    288   28    2 : tunables    0    0    0 : slabdata  91385  91385      0

288 bytes per object, ~2.5 million objects exist

Kernel 3.10 (without patch, some kernel debugging is enabled)

# grep /proc/spl/kmem/slab
arc_buf_hdr_t                         0x00020  51044352  16064512     8192      296   6231  6231  9983  149544 54272 239592      0     0     0

296 bytes per object

Kernel 2.6.32 (RHEL 6.4) (without patch)
= 264 bytes per object

Kernel 3.10 (with patch):

# grep arc /proc/slabinfo
arc_buf_hdr_t     155405 170478    184   21    1 : tunables  120   60    8 : slabdata   8118   8118      0

184 bytes per object, savings of 112 bytes per object.

At 2.5 million headers that's a lot.

@DeHackEd
Copy link
Contributor Author

(Updated: rebased to current master and fixed style and comment issues. No code changes)

@DeHackEd
Copy link
Contributor Author

Just rebasing to the current version. It's not functionally changed yet.

@behlendorf
Copy link
Contributor

@DeHackEd Thanks! I was going to ask that you refresh this so we could have another look and getting it merged.

@sempervictus
Copy link
Contributor

@behlendorf: with #3115 in the pipeline, would the contention concern still be present, or would that ameliorate the problem?

@behlendorf
Copy link
Contributor

@sempervictus this tackles a slightly different problem so the concern is still valid. But we should absolutely look at this change in the context of the #3115 work.

@dweeezil
Copy link
Contributor

I should have the #3115 work tested and ready for further review by the end of this weekend. I think it's likely fairly close now except for the buildbot failures under the 2.6.32 kernels which I also need to look into.

@DeHackEd DeHackEd force-pushed the arc-mutexes branch 2 times, most recently from 11043bb to 89fa67e Compare May 15, 2015 23:49
@DeHackEd
Copy link
Contributor Author

Additional rebase plus added @behlendorf 's suggestion to do collision tracking. There's a new arcstat called freeze_collisions

@DeHackEd
Copy link
Contributor Author

Gave it a test run (plus a few other patches like 2129 ABD) with 4 instances of Bonnie++.

Testing machine: KVM with 4 CPUs and 5 GB of RAM. Host has 16 GB and Core i7.

# zpool import testpool
# cd /testpool/bigfiles  # pre-created mostly-empty filesystem with recordsize=1M
# for i in 1 2 3 4; do bonnie++ -r 500 -s 2000 -u nobody -d `pwd` & done

A total of 6613 collisions occurred. They were rather rare during the main file operations and spiked up towards the end.

@sempervictus patch 3115 is about relieving pressure on the global locks on the ARC. This patch is about eliminating lock objects from each ARC header in memory in order to reduce the size of said objects. Per-object size shrinks from 296 to 184 bytes; exact size varies by kernel version and build options. The penalty is artificial contention on these objects since a lock collision does not necessarily mean two threads are accessing the same block.

@behlendorf
Copy link
Contributor

@DeHackEd this turned out nicely. Let's plan on getting this in after the other significant ARC changes which are being worked on to avoid any additional conflicts.

Along similar lines to this change it occurred to me it would probably useful to run pahole over the module to look for additional savings. It's a tool which can be used to dump the resulting structure alignments and look for holes and cache line alignment issues. There may be significant extra memory saving to be had, although we've been pretty careful about how things are load out in memory. For example, here's the option for the existing arc_buf_hdr.

pahole module/zfs/zfs.ko -C arc_buf_hdr
struct arc_buf_hdr {
    dva_t                      b_dva;                /*     0    16 */
    uint64_t                   b_birth;              /*    16     8 */
    uint64_t                   b_cksum0;             /*    24     8 */
    kmutex_t                   b_freeze_lock;        /*    32    48 */
    /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
    zio_cksum_t *              b_freeze_cksum;       /*    80     8 */
    arc_buf_hdr_t *            b_hash_next;          /*    88     8 */
    arc_buf_t *                b_buf;                /*    96     8 */
    uint32_t                   b_flags;              /*   104     4 */
    uint32_t                   b_datacnt;            /*   108     4 */
    arc_callback_t *           b_acb;                /*   112     8 */
    kcondvar_t                 b_cv;                 /*   120    72 */
    /* --- cacheline 3 boundary (192 bytes) --- */
    arc_buf_contents_t         b_type;               /*   192     4 */

    /* XXX 4 bytes hole, try to pack */

    uint64_t                   b_size;               /*   200     8 */
    uint64_t                   b_spa;                /*   208     8 */
    arc_state_t *              b_state;              /*   216     8 */
    list_node_t                b_arc_node;           /*   224    16 */
    clock_t                    b_arc_access;         /*   240     8 */
    uint32_t                   b_mru_hits;           /*   248     4 */
    uint32_t                   b_mru_ghost_hits;     /*   252     4 */
    /* --- cacheline 4 boundary (256 bytes) --- */
    uint32_t                   b_mfu_hits;           /*   256     4 */
    uint32_t                   b_mfu_ghost_hits;     /*   260     4 */
    uint32_t                   b_l2_hits;            /*   264     4 */

    /* XXX 4 bytes hole, try to pack */

    refcount_t                 b_refcnt;             /*   272     8 */
    l2arc_buf_hdr_t *          b_l2hdr;              /*   280     8 */
    list_node_t                b_l2node;             /*   288    16 */

    /* size: 304, cachelines: 5, members: 25 */
    /* sum members: 296, holes: 2, sum holes: 8 */
    /* last cacheline: 48 bytes */
};

@behlendorf
Copy link
Contributor

Oh, one more thing. This is subjective but the collision numbers feel high to me for only four concurrent bonnie++ processes. It would be useful to verify that in fact you're seeing a good distribution across the hash buckets. You could do this by implementing another stat similar to the arcstat_hash_chains. Just keep track of the first time a hash bucket is used and increment a counter. It should rapidly approach the hash size if this are working properly, if it doesn't that's a concern.

@DeHackEd
Copy link
Contributor Author

I'm working on making such a test, adding the check to verify that the collision is, in fact, a hash collision and not just a legitimate lock contention on a single arc_buf_hdr_t. Unfortunately I think I need a spinlock in front of the whole thing to do it properly. Oh well.

@DeHackEd
Copy link
Contributor Author

Here's what I came up with for validation. I hope I got the logic right. Obviously you would not actually take this newest commit.

Running 10 instances of Bonnie++ as described above on the same machine, I had 102 freeze collisions which would not have happened without this patch.

@behlendorf
Copy link
Contributor

@DeHackEd considering the significant memory saves that's sounds like a level we can reasonably live with. Did you happen to somehow verify that the addressed are being hashed fairly? Or explore increasing the hash size to further reduce this.

@DeHackEd
Copy link
Contributor Author

The code exists - see arc_buf_hdr_hash_hits - but silly me forgot to put in a nice way to dump it without the assistance of a debugger which I do not have for my running kernel. ztest results were too low on the hit count to be conclusive.

@dweeezil
Copy link
Contributor

I finally got a chance to peek at this and without making a whole bunch of disjoint code line comments, I can't help but thinking would be a bit cleaner and more congruent with the existing code to embed arc_buf_hdr_mutexes within the existing buf_hash_table and then to use use the same access methods as are currently used for buf_hash_table.ht_locks (hashing method, etc.).

I've instrumented the crc64-based hashing scheme as part of my #3115 work and can say that it works very well in large-scale testing. It seems to make sense to (re)use it for the new arc_buf_hdr_mutexes.

Not related to this patch at all, but related to memory use and locking efficiency: the method by which buf_hash_table.ht_table and buf_hash_table.ht_locks are sized can create a serious imbalance between the two on systems with a lot of memory. The same holds true on the dbuf side of things with dbuf_hash_table and its inner sub-hash table. For starters, it would seem to make more sense to size each of the main hash tables relative to physmem/2 (rather than physmem) and have a tunable which allows it to be made larger for systems on which a lot more memory is used for the arc. On systems with a lot of memory, the "inner" hash tables become very small relative to their parent table. While working on #3115, I cobbled together https://github.com/dweeezil/zfs/tree/dbuf-mutex-hash to allow the inner dbuf hash table to be sized relative to its parent table. In the end, I ran into too many other lock contentions for this to make a significant difference.

@behlendorf
Copy link
Contributor

@dweeezil I think that makes a lot of sense. I had a similar thought when looking at this but I was concerned that relying on a more complicated crc64-based hashing scheme to get a good distribution might be too expensive. The current prime/module scheme is very cheap and known to work well for predictable values such as memory addresses.

But if in fact using crc64's does work well I completely agree it would be nicer, cleaner, and simpler to reuse that same code. Let's plan on giving it a try after #3115 is merged (soon I hope).

the method by which buf_hash_table.ht_table and buf_hash_table.ht_locks are sized...

I completely agree, and in fact I'd go one step farther. There is no known right size for these hash tables it depends heavily on the workload. My suggestion would be that we update this code such that the hash table can be dynamically re-sized and have them start out fairly small. If you constrain the hash table to be a power of two this is a relatively straight forward thing to implement efficiently. This is a case where ZFS can and should dynamically size things and we don't need a user tunable.

behlendorf referenced this pull request in dweeezil/zfs May 20, 2015
5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Don Brady <[email protected]>
Reviewed by: Josef 'Jeff' Sipek <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
Ported by: Tim Chase <[email protected]>

Porting notes:

Due to the restructuring of the ARC-related structures, this
patch conflicts with at least the following existing ZoL commits:

    6e1d727
    Fix inaccurate arcstat_l2_hdr_size calculations

        The ARC_SPACE_HDRS constant no longer exists and has been
        somewhat equivalently replaced by HDR_L2ONLY_SIZE.

    e0b0ca9
    Add visibility in to cached dbufs

        The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr
        struct requires additional structure member names to be used
        when referencing the inner items.  Also, the presence of L1 or L2
        inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR
        macros.
@behlendorf
Copy link
Contributor

@DeHackEd with the #3115 changes merged we're in a position where this kind of change could be merged if you can rebase it on master and incorporate the various feedback.

@DeHackEd
Copy link
Contributor Author

One concern with the advice is that adding any fields, even a uint32_t array offset, increases the size of these objects.

Also I still didn't do the hash fairness verification check. I already know that the allocations are from slabs which will affect the distribution of the memory addresses.

Newest push covers:

  • Rebase to master
  • Uses crc64 to find slot to use

Does not:

  • Store a pointer in the arc_buf headers
  • Allow array resizing, dynamic or at module load time

The freeze_collisions arcstat counts both legitimate and artificial collisions making it an unreliable means of measuring the performance impact this patch can have on lock contention. See DeHackEd/zfs@b5946cd for the (old) patch that counts only artificial collisions at the expense of serializing everything through a spinlock.

Under Linux these data types together consume around 100 bytes of space.
Having them in the arc_buf_hdr_t makes them nearly 50% larger than they
would be otherwise. When lock debugging is enabled it gets even larger.

This patch moves them to a single global array to be shared similar to
how the znode hold locks in zfs_sb work. We're trading some potential
for artificial lock contention for drastically smaller kernel memory
objects.
@behlendorf
Copy link
Contributor

Closing as stale.

@behlendorf behlendorf closed this May 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants