-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove kmutex_t & kcondvar out of arc_buf_hdr_t #1971
Conversation
Completely mindful that this might not be very well received, I'm offering it up anyway. Under linux each of my new "struct arc_buf_hdr_mutex" objects are over 100 bytes each. When doing lock debugging it's even higher. This saves quite a bit of space from each arc_buf_hdr_t at the cost of having shared mutexes and conditions. I've been running it on my low-load workstations for a week now and not seen any problems with it, but low load is not what I'm worried about. It's whether contention will become an issue at high load TODO:
|
@DeHackEd This certainly isn't the craziest thing I've seen proposed and I'm not strictly against it. But we need to definitively answer the questions you've posed. Is the hashing fair? How much contention is this causing? What's the impact on performance? How much memory are we saving by doing this? We just need some hard numbers. |
Here's the numbers I have. I have a lot of different kernels, etc so take them appropriately. Kernel 3.4 (without patch): # grep arc_buf_hdr_t /proc/slabinfo arc_buf_hdr_t 2521505 2558780 288 28 2 : tunables 0 0 0 : slabdata 91385 91385 0 288 bytes per object, ~2.5 million objects exist Kernel 3.10 (without patch, some kernel debugging is enabled) # grep /proc/spl/kmem/slab arc_buf_hdr_t 0x00020 51044352 16064512 8192 296 6231 6231 9983 149544 54272 239592 0 0 0 296 bytes per object Kernel 2.6.32 (RHEL 6.4) (without patch) Kernel 3.10 (with patch): # grep arc /proc/slabinfo arc_buf_hdr_t 155405 170478 184 21 1 : tunables 120 60 8 : slabdata 8118 8118 0 184 bytes per object, savings of 112 bytes per object. At 2.5 million headers that's a lot. |
(Updated: rebased to current master and fixed style and comment issues. No code changes) |
Just rebasing to the current version. It's not functionally changed yet. |
@DeHackEd Thanks! I was going to ask that you refresh this so we could have another look and getting it merged. |
@behlendorf: with #3115 in the pipeline, would the contention concern still be present, or would that ameliorate the problem? |
@sempervictus this tackles a slightly different problem so the concern is still valid. But we should absolutely look at this change in the context of the #3115 work. |
I should have the #3115 work tested and ready for further review by the end of this weekend. I think it's likely fairly close now except for the buildbot failures under the 2.6.32 kernels which I also need to look into. |
11043bb
to
89fa67e
Compare
Additional rebase plus added @behlendorf 's suggestion to do collision tracking. There's a new arcstat called |
Gave it a test run (plus a few other patches like 2129 ABD) with 4 instances of Bonnie++. Testing machine: KVM with 4 CPUs and 5 GB of RAM. Host has 16 GB and Core i7. # zpool import testpool # cd /testpool/bigfiles # pre-created mostly-empty filesystem with recordsize=1M # for i in 1 2 3 4; do bonnie++ -r 500 -s 2000 -u nobody -d `pwd` & done A total of 6613 collisions occurred. They were rather rare during the main file operations and spiked up towards the end. @sempervictus patch 3115 is about relieving pressure on the global locks on the ARC. This patch is about eliminating lock objects from each ARC header in memory in order to reduce the size of said objects. Per-object size shrinks from 296 to 184 bytes; exact size varies by kernel version and build options. The penalty is artificial contention on these objects since a lock collision does not necessarily mean two threads are accessing the same block. |
@DeHackEd this turned out nicely. Let's plan on getting this in after the other significant ARC changes which are being worked on to avoid any additional conflicts. Along similar lines to this change it occurred to me it would probably useful to run
|
Oh, one more thing. This is subjective but the collision numbers feel high to me for only four concurrent bonnie++ processes. It would be useful to verify that in fact you're seeing a good distribution across the hash buckets. You could do this by implementing another stat similar to the arcstat_hash_chains. Just keep track of the first time a hash bucket is used and increment a counter. It should rapidly approach the hash size if this are working properly, if it doesn't that's a concern. |
I'm working on making such a test, adding the check to verify that the collision is, in fact, a hash collision and not just a legitimate lock contention on a single arc_buf_hdr_t. Unfortunately I think I need a spinlock in front of the whole thing to do it properly. Oh well. |
Here's what I came up with for validation. I hope I got the logic right. Obviously you would not actually take this newest commit. Running 10 instances of Bonnie++ as described above on the same machine, I had 102 freeze collisions which would not have happened without this patch. |
@DeHackEd considering the significant memory saves that's sounds like a level we can reasonably live with. Did you happen to somehow verify that the addressed are being hashed fairly? Or explore increasing the hash size to further reduce this. |
The code exists - see |
I finally got a chance to peek at this and without making a whole bunch of disjoint code line comments, I can't help but thinking would be a bit cleaner and more congruent with the existing code to embed I've instrumented the crc64-based hashing scheme as part of my #3115 work and can say that it works very well in large-scale testing. It seems to make sense to (re)use it for the new Not related to this patch at all, but related to memory use and locking efficiency: the method by which |
@dweeezil I think that makes a lot of sense. I had a similar thought when looking at this but I was concerned that relying on a more complicated crc64-based hashing scheme to get a good distribution might be too expensive. The current prime/module scheme is very cheap and known to work well for predictable values such as memory addresses. But if in fact using crc64's does work well I completely agree it would be nicer, cleaner, and simpler to reuse that same code. Let's plan on giving it a try after #3115 is merged (soon I hope).
I completely agree, and in fact I'd go one step farther. There is no known right size for these hash tables it depends heavily on the workload. My suggestion would be that we update this code such that the hash table can be dynamically re-sized and have them start out fairly small. If you constrain the hash table to be a power of two this is a relatively straight forward thing to implement efficiently. This is a case where ZFS can and should dynamically size things and we don't need a user tunable. |
5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Tim Chase <[email protected]> Porting notes: Due to the restructuring of the ARC-related structures, this patch conflicts with at least the following existing ZoL commits: 6e1d727 Fix inaccurate arcstat_l2_hdr_size calculations The ARC_SPACE_HDRS constant no longer exists and has been somewhat equivalently replaced by HDR_L2ONLY_SIZE. e0b0ca9 Add visibility in to cached dbufs The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr struct requires additional structure member names to be used when referencing the inner items. Also, the presence of L1 or L2 inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR macros.
One concern with the advice is that adding any fields, even a uint32_t array offset, increases the size of these objects. Also I still didn't do the hash fairness verification check. I already know that the allocations are from slabs which will affect the distribution of the memory addresses. Newest push covers:
Does not:
The |
Under Linux these data types together consume around 100 bytes of space. Having them in the arc_buf_hdr_t makes them nearly 50% larger than they would be otherwise. When lock debugging is enabled it gets even larger. This patch moves them to a single global array to be shared similar to how the znode hold locks in zfs_sb work. We're trading some potential for artificial lock contention for drastically smaller kernel memory objects.
6770280
to
4994ba6
Compare
Closing as stale. |
Under Linux these data types together consume around 100 bytes of space.
Having them in the arc_buf_hdr_t makes them nearly 50% larger than they
would be otherwise. When lock debugging is enabled it gets even larger.
This patch moves them to a single global array to be shared similar to
how the znode hold locks in zfs_sb work. We're trading some potential
for artificial lock contention for drastically smaller kernel memory
objects.