-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
root zfs freezes system #154
Comments
It's memory, definitely. Free memory drops rapidly as the rsync goes, then the machine locks up. echo 3 > /proc/sys/vm/drop_caches will free the memory that got eaten, but it will take several seconds to complete. |
curiously the cached memory went down only about a hundred M, but the free memory went up like eight hundred M. So I don't know exactly what is being freed that is not tallied up in the cached memory counter. |
Also if I stop the rsync as the machine is about to hang, arc_reclaim kicks in with 10000000% percent of cpu (exaggeration of mine). it seems the problem is that arc_reclaim is simply kicking in too late, and by that time the machine is already effectively hung. remember none of that memory is swappable. |
This is my number 1 issue to get fixed. I've just started to look at it now that things are pretty stable. Unfortunately, because of the way ZFS has to manage the ARC, cached data isn't reported under 'cached' instead you'll see it under 'vmalloc_used'. That won't change soon (but I have long term plans), but in the short term we should be able to do something about the thrashing. |
how do i reduce the use of the arc in the meantime? |
what takes a long time is the sync previous to the drop_caches. during that time, the free memory increases. |
sorry, i take my last comment back. dropping the caches is what takes long. |
You can cap the ARC size by setting the 'zfs_arc_max' module option. However, after the module is loaded you won't be able to change this value through /sys/module/zfs/. Arguably that would be a nice thing to support. |
What units is the zfs_arc_max knob? How do I set it? Only through modprobe? So I'd have to add it to the dracut module then. I think I will put the dracut thingie on version control and push it to my repo. |
my first question is, let's say I say zfs_arc_max=1024, what is that? kilobytes? |
Right now the arc cache is sized to a maximum 3/4 of all system memory. The zfs_arc_max module option actually takes a value in bytes because that's what the same hook on Solaris expects. We could make this as fancy as we need. |
The unit for zfs_arc_max is bytes, and it will only matter if it's > 64M (otherwise the code ignores the setting). I set it to 65M (68157440) right now for reliable operation, and a (very rough) look shows memory usage is at most 512MB higher compared to without zfs. |
Naaah, 64 MB is just TOO SLOW for a root file system to use. I tried that. So I went to 256 MB, and for the most part my machine only stalls two to three times a day (as opposed to right after boot with the unbounded setting). That's progress. Now, if syscall response could be made faster... I feel like reading files that are already in cache is excruciatingly slow in ZFS compared to ext4, and it really affects application performance:
An order of magnitude slower, NOT GOOD! |
IIRC you shouldn't be able to use 64M unless you hack the code (64MB + 1 byte, maybe, but not 64MB) :D Anyway, related to cached files, I assume you've seen this: behlendorf@450dc14 |
FWIW, I tried 128. not 64. The code should recognize 64 and add a +1 to it, or change to accept 64 since many people with small systems will try that. I remember hacking zfs-fuse deeply to have it accept 16, since I wanted to run it on a 128 MB box. IT WORKED. |
That commit improves performance for read()'s, not for access() es, which constitute the VAST majority of file-related syscalls applications execute. Try an strace on kmail when it starts and you will quickly see what I mean. What I don't undersatnd is, access() / stat() are supposed to be cached in the Linux kernel dentry cache. Does ZFS somehow bypass that and provides its own cache for dentries? That would be the only way I could understand access() stat() being slower on ZFS. |
As you said the default ARC behavior is to put a minimum bound of 64MiB on ARC, setting it to one byte larger than this should work. Changing this check to a '>=' instead of a '>' seems reasonable to me. I really have no idea how badly it will behave with less than 64MiB but feel free to try it!
Reading access() performance your right an order of magnitude is NO GOOD! This can and will be improved but in the interests of getting something working sooner rather than latter I haven't optimized this code. While the zfs code does use the Linux dentry and inode caches, the values stored in the inode are not 100% authoritative. Currently these values still need to be pulled from the znode. Finishing the unification of the znode and inode is listed as "Inode/Znode Refactoring" in the list of development items and I expect it will help this issue. There are also a couple of memory allocations which if eliminated I'm sure would improve things. Finally, I suspect using oprofile to profile to getattr would reveal some other ways to improve things. All of these changes move us a little further away using the unmodified Solaris code but I think that's a price we have to pay in the Posix layer if we want good performance. Anyway, I'm happy to have help with this work. :) In the meanwhile I want to fix the other memory issues which I feel I have a pretty good handle on now and with luck can get fixed this week. |
I'm having a horrible time reproducing this VM thrashing issue which has been reported. Nothing I've tried today has been able to recreate the issue. I've seen kswapd pop up briefly (fraction of a second) to a large CPU percentage but it quickly frees the required memory. I want to get this fixed, but I need a good test case. Can someone who is seeing the problem determine exactly what is needed to recreate the issue. Also if you do manage to recreate the issue (and have a somewhat responsive system), running the following will give me a lot of what I need to get it fixed. echo t >/proc/sysrq-trigger echo m >/proc/sysrq-trigger cat /proc/spl/kstat/zfs/arcstats cat /proc/meminfo |
It's simple to reproduce:
I will try to upgrade to a newer kernel today. Running 2.6.37-pre7 |
Okay, then I'll work on reviewing and pulling your your Dracut changes tomorrow. Then I can setup zfs as a root filesystem and see if I can't hit the issue. I know others have hit the problem with it as a non-root filesystem but I wasn't able too. That's what I get for using test systems however, there really isn't anything quite like real usage. |
Certainly! |
Same thing as reported in issue 149. One way to reproduce this easily would be to boot your Linux system with smaller amount of RAM than it actually has. Try booting with 512MB using mem=512M as kernel parameter. You don't need ZFS as rootfs to trigger this. A simple 'find' or 'du' on a fairly large FS with small amount of RAM will trigger this. I have run into this issue (eventual hard lockup!) so many times that I gave up on rootfs. Its just a matter of amount of RAM. e.g. given 48GB of RAM, you will probably not run into this issue....:-) |
I can try less RAM, I've been booting my box with 2 GiB to try and tickle the issue but thus far no luck. I'll try less. |
Ahh, found it! Or rather I refound the issue. Ricardo was working on this issue with me, I thought it would only effect a Lustre/ZFS server but it looks like that's not the case. KQ has identified the same bug and opened the following issue with the upstream kernel. That' progress. :) https://bugzilla.kernel.org/show_bug.cgi?id=30702 http://marc.info/?l=linux-mm&m=128942194520631&w=4 Ricardo had worked out a patch but it has yet to be merged in to the upstream kernel. Joshi from KQ has also attached a proposed fix. Rudd-O, Devsk since you two aren't squeamish about rebuilding your kernel, if you like you can apply the proposed fix to your kernel and rebuild it. It should resolve the deadlock. |
I am going to test the patch soon. |
I think that patch has nothing to do with memory issue that we are facing with ZFS. I have seen a hard lock (which could be the deadlock mentioned above) as well but mostly thrashing. We do need to concentrate on thrashing aspect. |
Well my new home NAS hardware just arrived so once I get it installed (tonight maybe) hopefully I'll start seeing these real usage issues too. It's an Intel atom with only 4GiB of memory so I'm well motivated to make ZFS work well within those limits. :) It's hard to say for sure what the problem is without stack traces from the problem. If anyone hits it again please run echo t >/proc/sysrq-trigger; dmesg. That way we'll get stacks from all the processes on the system and we should be able to see what it's trashing on |
One issue we may run into while doing echo t to sysrq is that there are so many tasks, dmesg buffer may not be enough to hold them all. And of course, the system doesn't give you much time to do diagnostics when this happens. |
Kernel 2.6. 38 here, no preempt, no thrashing anymore. So far. Still haven't gotten to test the patch. I will keep you informed. |
I have been running 2.6.38-rc's for a while and I still see thrashing. Unless something changed in last 1 week (between rc8 and release), I don't thrashing is fixed. |
brian: I don't see any kswapd thrashing when free memory is available, but when it starts getting tight (~20M free), it surely appears again. But that is to be expected, as kswapd is scrambling to find pages to swap out? fajarnugraha: the problem with 65 max arc is that, well, eh, the system is excruciatingly slow. I'd rather have a choice to just use the linux dentry and page caches, and forget about ARC. I understand that presently ZFS VFS code will still be run even in cases where the dentry cache contains a dentry that userspace is trying to access, or the ARC contains block data that userspace is trying to read, so yeah, that is probably the reason we have this huge overhead even in cached data. That is not the case in other file systems -- as soon as something is cached in either the page or the dentry caches, none of the filesystem code needs to be invoked. Bummer for ZFS here. |
Also I would like to point out that this is not the case for ZFS-FUSE, for when data has already been cached, the ZFS code is never invoked again. I wrote the patch to make this possible, because without that, ZFS-FUSE was much much slower than even ZFS in kernel. |
I agree with pretty much everything you just said. :) Unfortunately, ZFS is a very different beast then every other Linux filesystem. I would love to integrate it more closely with the Linux dentry/inode/page caches to get exactly the performance improvement you suggest. I'm happy to to have a detailed development discussion on exactly how this should be done. I would suggest the place for it would be on the zfs-devel mailing list. But my first priority is getting the existing implementation stable. |
Rudd-O: I know that running with arc=65M is slow :) My point is currently there's a lot more to zfs usage instead of arc, so I wouldn't be surprised if 512M arc means much higher memory usage. Having a specified limit for all zfs usage will be good, but as Brian mentioned we don't have one right now. Since your patch to zfs-fuse was good performance-wise, can you easily port this to in kernel zfs? |
No, I cannot. The patch I wrote merely told FUSE to start caching stuff in the dentry / pagecache. ZFS kernel has a different road to travel -- one that involves integrating znodes with inodes to enable proper dentry caching, and other types of work to enable reliance on pagecache alone. |
Gawd damn. http://pastebin.com/yZy2TVY4 Kernel BUGs galore, and it's always when checking that BAT0 file. ALWAYS. They have been happenin since I patched my kernel. I will have to revert that patch and work with the vanilla kernel. |
Not a lot to go on there. The kernel patch shouldn't be needed anymore with the source from master so it will be interesting to see if you still see this with the vanilla kernel. |
With your latest code AND without the kernel patch, I see no freezes... yet. El Wednesday, March 23, 2011, behlendorf escribió:
|
You could try increasing /proc/sys/vm/min_free_kbytes. This is the threshold used the the kernel for how much memory it wants to keep free. Bumping it up a little bit for now might help with the stuttering but leaving some more headroom. |
interesting. i will try that if i continue seeing stuttering. El Wednesday, March 23, 2011, behlendorf escribió:
|
stuttering comes from swapping code path. min_free_kbytes won't help with that. In fact, larger the min_free_kbytes, earlier the kernel will try to swap. Unless we fix the loop and have zfs free memory inline instead of asynchronously in a separate thread (which may have scheduling trouble with kswapd hogging the CPU trying to find free pages), kernel will continue to swap. There is a lot of potentially conflicting machinery at work here. |
Devsk, your right. Thanks for reminding me. Before I got distracted with the GFP_NOFS badness I started to work on a patch to do direct on the ARC. Right now all reclaim is done by the arc_reclaim thread and it basically just checks one a second and shrinks the ARC if memory looks low. That of course isn't fast enough for a dynamic environment like a desktop so swapping kicks in. Adding the direct reclaim path (via a shrinker) should improve things... at least that's the current theory. |
YES! THAT IS EXACTLY THE PROBLEM. This is why you see kswapd and arc_reclaim So when can we have that juicy bit? :-D El Wednesday, March 23, 2011, behlendorf escribió:
|
I tested a build from master on RHEL with 2.6.32 kernel, the dracut change cause an error during "make rpm". The cause is simple: RHEL5 doesnot recognize %{_datarootdir} macro in zfs.spec. I had to change it manually to %{_datadir}, then the build process completed successfully. |
Not anytime soon I guess...:-D Surprise me pleasant Brian...;-) |
I have a branch now which implements much of this but it still requires some tuning. Give me a few more days to chew on it and I'll make it public for other to test and see if it improves their workloads. I've been using your rsync/find test work load on a 2GiB atom system, once it works there I'll verify it works well on some 16-core 128GiB memory systems. In the process I have also gotten a good handle on your memory usage issue. There's no leak, but find does cause some pretty nasty fragmentation. I'll write up my understanding perhaps in a bug or on the mailing list next week. There may be a few easy things which can be done to improve things. There are also some much harder things which will have to wait. :) |
These changes are the result of my recent work to getting a handle of vmalloc() and memory usage. They are still a work in progress but available for testing. Once I'm happy with everything I'll make a detailed post to zfs-devel explaining the memory usage on Linux as it stands today. If you want to test these changes you must update both the spl and zfs source to the shrinker branch. https://github.com/behlendorf/spl/tree/shrinker https://github.com/behlendorf/zfs/tree/shrinker Here's a high level summary of these changes:
There is one major and one minor outstanding issue I know of which are preventing these changes from being merged in to master.
kernel BUG at fs/inode.c:1333! [ in iput() ]
|
Wow! That's a lot of work for a week...:-) Earliest I can test is the weekend though...:( Swamped with work of my own. |
I try and keep busy. No rush to get this tested it's going to take some some time to run down the iput() issue mentioned above. I just wanted to make what I'm thinking about public for comment. |
This was fixed in what will be 0.6.0-rc4. Closing issue, see comments in issue #149 starting here. https://github.com/behlendorf/zfs/issues/149#issuecomment-1042925 |
kswapd is back to trashing again (in my observation, any time that c_max is set via kernel module option to 18 or more GB of RAM in a swapless system with 48 GB memory). both kswapd0 and kswapd1 will spin 100% CPU, pegging two cores of the eight-core machine. This is bad, I have had to resort to limiting the ARC to around 15 GB and I am testing again with these params: hash_elements_max 4 1150514 |
Is this with the latest spl+zfs master source? Sevealr recent VM changes were merged in there which I expected to make this sort of this much less likely. If in fact they have had an adverse impact I'd like to get it resolved right away. In particular are you running with the following commits. SPL ZFS |
The kern_path_parent() function was removed from Linux 3.6 because it was observed that all the callers just want the parent dentry. The simpler kern_path_locked() function replaces kern_path_parent() and does the lookup while holding the ->i_mutex lock. This is good news for the vn implementation because it removes the need for us to handle the locking. However, it makes it harder to implement a single readable vn_remove()/vn_rename() function which is usually what we prefer. Therefore, we implement a new version of vn_remove()/vn_rename() for Linux 3.6 and newer kernels. This allows us to leave the existing working implementation untouched, and to add a simpler version for newer kernels. Long term I would very much like to see all of the vn code removed since what this code enabled is generally frowned upon in the kernel. But that can't happen util we either abondon the zpool.cache file or implement alternate infrastructure to update is correctly in user space. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#154
…penzfs#154) Signed-off-by: mayank <[email protected]>
= Motivation We've recently been hitting issues where the crash-kernel runs out of memory while its being bootstrapped after the main kernel panicks. We could always keep increasing the memory reserved from the crash kernel at the expense of the memory available to main kernel. For customers that may not be an issue, but for developer VMs generally provisioned with 8GB of memory the impact is higher. As a result, I've been working on making our crash kernel lighter. = Patch Description The DWARF info from the ZFS-related kernel modules "dwarf"s in size the sections of these binaries that are actually needed for them to run. This patch splits the debug info of all the ZFS modules into a separate debian package and strips down the original modules. This way the initramfs of the original kernel and the crash kernel is reduced in size allowing for potential reduction in the amount of memory needed to be reserved for the crash kernel. = Implementation Details When I first decided to tackle this I thought this was a misconfiguration on our side and `dh_strip` wasn't working as expected. After contacting the Debian folks that seem to own the Debian packaging of ZFS for Debian, we realized that this is a general issue and not specific to Delphix's packaging of ZFS. `dh_strip` doesn't work with kernel modules and it was effective a no-op. Mo Zhou (one of the aforementioned Debian folks) suggested that the most straightforward way would be to manually stip these modules and place them in the right directories, and this is what I did. Note: The decision to install the debug info under the directory /usr/lib/debug/lib/modules was inspired by Fedora which does the same (couldn't find anything consistent for Debian). Furthermore, it seems like Ubuntu conventionally does the same. References: [1] man debhelper & manuals referenced there [2] man strip [3] https://sourceware.org/gdb/current/onlinedocs/gdb/Separate-Debug-Files.html [4] https://wiki.debian.org/DebugPackage [5] https://wiki.debian.org/AutomaticDebugPackages = Results Initramfs before on ESX: ``` delphix@sd-drgn:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-42-generic -rw-r--r-- 1 root root 99M Apr 2 21:57 /var/lib/kdump/initrd.img-5.3.0-42-generic delphix@sd-drgn:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-42-generic | sort -k 5 -n | tail -n 5 -rw-r--r-- 1 root root 6341913 Feb 28 12:40 lib/modules/5.3.0-42-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko -rw-r--r-- 1 root root 8932776 Apr 2 21:55 lib/modules/5.3.0-42-generic/extra/lua/zlua.ko -rw-r--r-- 1 root root 13886056 Apr 2 21:55 lib/modules/5.3.0-42-generic/extra/icp/icp.ko -rwxr-xr-x 1 root root 14185264 Apr 2 21:55 lib/libzpool.so.2.0.0 -rw-r--r-- 1 root root 73782264 Apr 2 21:55 lib/modules/5.3.0-42-generic/extra/zfs/zfs.ko ``` Initramfs after on ESX: ``` delphix@sd-zfs-pkg:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-45-generic -rw-r--r-- 1 root root 62M Apr 18 02:08 /var/lib/kdump/initrd.img-5.3.0-45-generic delphix@sd-zfs-pkg:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-45-generic | sort -k 5 -n | tail -n 5 -rw-r--r-- 1 root root 2922456 Apr 18 01:33 lib/libzpool.so.2.0.0 -rw-r--r-- 1 root root 3024337 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/nouveau/nouveau.ko -rw-r--r-- 1 root root 3161065 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/i915/i915.ko -rw-r--r-- 1 root root 4065288 Apr 18 01:33 lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko -rw-r--r-- 1 root root 6341585 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko ``` This will cause a ~37% reduction to the size of initramfs in ESX. The reduction percentage is higher for AWS - from 139MB (47MB compressed) to 52MB(19MB compressed), so 62~64% reduction. = Testing Artifacts created by linux-pkg: ``` $ ls linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws* linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb ``` Artifact content list (note the paths and sizes of *.ko files): ``` $ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko -rw-r--r-- root/root 11496 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko -rw-r--r-- root/root 356048 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/icp/icp.ko -rw-r--r-- root/root 220232 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko -rw-r--r-- root/root 107984 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko -rw-r--r-- root/root 157480 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/spl/spl.ko -rw-r--r-- root/root 328880 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko -rw-r--r-- root/root 114504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko -rw-r--r-- root/root 4049504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko $ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko -rw-r--r-- root/root 241000 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko -rw-r--r-- root/root 12478312 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/icp/icp.ko -rw-r--r-- root/root 8095720 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko -rw-r--r-- root/root 1300336 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko -rw-r--r-- root/root 4055432 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/spl/spl.ko -rw-r--r-- root/root 1064112 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko -rw-r--r-- root/root 3977680 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko -rw-r--r-- root/root 65634856 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko ``` ab-pre-push link: (AWS) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3356/ (ESX) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3357/ Ensure everything is installed in a VM created from the above pre-push: ``` delphix@sd-zfs-pkg:~$ apt search zfs-modules-5.3.0-45-generic Sorting... Done Full Text Search... Done zfs-modules-5.3.0-45-generic/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local] OpenZFS filesystem kernel modules for Linux (kernel 5.3.0-45-generic) zfs-modules-5.3.0-45-generic-dbg/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local] Debugging symbols for OpenZFS userland libraries and tools delphix@sd-zfs-pkg:~$ dpkg -L zfs-modules-5.3.0-45-generic-dbg /. /usr /usr/lib /usr/lib/debug /usr/lib/debug/lib /usr/lib/debug/lib/modules /usr/lib/debug/lib/modules/5.3.0-45-generic /usr/lib/debug/lib/modules/5.3.0-45-generic/extra /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl/zavl.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp/icp.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua/zlua.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair/znvpair.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl/spl.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon/zcommon.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko /usr/lib/debug/modules /usr/lib/debug/modules/5.3.0-45-generic /usr/lib/debug/modules/5.3.0-45-generic/extra /usr/share /usr/share/doc /usr/share/doc/zfs-modules-5.3.0-45-generic-dbg /usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/changelog.Debian.gz /usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/copyright ``` Ensure that the binaries are connected with the debuglink: ``` delphix@sd-zfs-pkg:~$ ls -lh /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko -rw-r--r-- 1 root root 3.9M Apr 18 01:33 /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko delphix@sd-zfs-pkg:~$ ls -lh /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko -rw-r--r-- 1 root root 68M Apr 18 01:33 /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko delphix@sd-zfs-pkg:~$ readelf -S /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info delphix@sd-zfs-pkg:~$ readelf -S /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info [51] .debug_info PROGBITS 0000000000000000 002082dc [52] .rela.debug_info RELA 0000000000000000 0266de30 delphix@sd-zfs-pkg:~$ readelf -x.gnu_debuglink /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko Hex dump of section '.gnu_debuglink': 0x00000000 7a66732e 6b6f0000 facaadd7 zfs.ko...... ``` Ensure that SDB and crash-python still work: ``` delphix@sd-zfs-pkg:~$ sudo sdb sdb> spa ADDR NAME ------------------------------------------------------------ 0xffff8c2616588000 rpool ^D delphix@sd-zfs-pkg:~$ sudo crash-kcore.sh ...<cropped>... add symbol table from file "/usr/lib/debug/boot/vmlinux-5.3.0-45-generic" with all sections offset by 0x24800000 Slick Debugger (sdb) initializing... Loading tasks....... done. (443 tasks total) Loading modules for 5.3.0-45-genericLoading /lib/modules/5.3.0-45-generic/kernel/net/connstat/connstat.ko at 0xffffffffc0ea1000 ...<cropped>... Loading /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko at 0xffffffffc0774000 Loading /lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko at 0xffffffffc0722000 Loading /lib/modules/5.3.0-45-generic/extra/lua/zlua.ko at 0xffffffffc06f3000 Loading /lib/modules/5.3.0-45-generic/extra/avl/zavl.ko at 0xffffffffc036d000 ...<cropped>... (gdb) spa ADDR NAME ------------------------------------------------------------ 0xffff8c2616588000 rpool ``` I also wanted to make sure that we can still access these files from the crash kernel if we boot into single-user mode. We can even run sdb as in the original kernel: ``` You are in rescue mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" to boot into default mode. Give root password for maintenance (or press Control-D to continue): root@localhost:~# sdb sdb> spa ADDR NAME ------------------------------------------------------------ 0xffff99c16988c000 rpool ``` Finally I made sure to capture a normal crash-dump and analyze it as before: ``` delphix@sd-zfs-pkg:/var/crash/202004181853$ sudo sdb /usr/lib/debug/boot/vmlinux-5.3.0-45-generic dump.202004181853 sdb> spa ADDR NAME ------------------------------------------------------------ 0xffff899b558bc000 rpool ``` = Future Work [1] Add modifications to zfs-load so the same thing happens for developers iterating on ZFS changes internally. [2] Continue making the crash-kernel lighter by disabling unecessary subsystems and functionality that consume memory (e.g. memory hotplugging, systemd cgroup memory statistic metadata, etc..). [3] Reach out to our contacts from LKDC for further improvements.
= Motivation We've recently been hitting issues where the crash-kernel runs out of memory while its being bootstrapped after the main kernel panicks. We could always keep increasing the memory reserved from the crash kernel at the expense of the memory available to main kernel. For customers that may not be an issue, but for developer VMs generally provisioned with 8GB of memory the impact is higher. As a result, I've been working on making our crash kernel lighter. = Patch Description The DWARF info from the ZFS-related kernel modules "dwarf"s in size the sections of these binaries that are actually needed for them to run. This patch splits the debug info of all the ZFS modules into a separate debian package and strips down the original modules. This way the initramfs of the original kernel and the crash kernel is reduced in size allowing for potential reduction in the amount of memory needed to be reserved for the crash kernel. = Implementation Details When I first decided to tackle this I thought this was a misconfiguration on our side and `dh_strip` wasn't working as expected. After contacting the Debian folks that seem to own the Debian packaging of ZFS for Debian, we realized that this is a general issue and not specific to Delphix's packaging of ZFS. `dh_strip` doesn't work with kernel modules and it was effective a no-op. Mo Zhou (one of the aforementioned Debian folks) suggested that the most straightforward way would be to manually stip these modules and place them in the right directories, and this is what I did. Note: The decision to install the debug info under the directory /usr/lib/debug/lib/modules was inspired by Fedora which does the same (couldn't find anything consistent for Debian). Furthermore, it seems like Ubuntu conventionally does the same. References: [1] man debhelper & manuals referenced there [2] man strip [3] https://sourceware.org/gdb/current/onlinedocs/gdb/Separate-Debug-Files.html [4] https://wiki.debian.org/DebugPackage [5] https://wiki.debian.org/AutomaticDebugPackages = Results Initramfs before on ESX: ``` delphix@sd-drgn:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-42-generic -rw-r--r-- 1 root root 99M Apr 2 21:57 /var/lib/kdump/initrd.img-5.3.0-42-generic delphix@sd-drgn:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-42-generic | sort -k 5 -n | tail -n 5 -rw-r--r-- 1 root root 6341913 Feb 28 12:40 lib/modules/5.3.0-42-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko -rw-r--r-- 1 root root 8932776 Apr 2 21:55 lib/modules/5.3.0-42-generic/extra/lua/zlua.ko -rw-r--r-- 1 root root 13886056 Apr 2 21:55 lib/modules/5.3.0-42-generic/extra/icp/icp.ko -rwxr-xr-x 1 root root 14185264 Apr 2 21:55 lib/libzpool.so.2.0.0 -rw-r--r-- 1 root root 73782264 Apr 2 21:55 lib/modules/5.3.0-42-generic/extra/zfs/zfs.ko ``` Initramfs after on ESX: ``` delphix@sd-zfs-pkg:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-45-generic -rw-r--r-- 1 root root 62M Apr 18 02:08 /var/lib/kdump/initrd.img-5.3.0-45-generic delphix@sd-zfs-pkg:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-45-generic | sort -k 5 -n | tail -n 5 -rw-r--r-- 1 root root 2922456 Apr 18 01:33 lib/libzpool.so.2.0.0 -rw-r--r-- 1 root root 3024337 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/nouveau/nouveau.ko -rw-r--r-- 1 root root 3161065 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/i915/i915.ko -rw-r--r-- 1 root root 4065288 Apr 18 01:33 lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko -rw-r--r-- 1 root root 6341585 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko ``` This will cause a ~37% reduction to the size of initramfs in ESX. The reduction percentage is higher for AWS - from 139MB (47MB compressed) to 52MB(19MB compressed), so 62~64% reduction. = Testing Artifacts created by linux-pkg: ``` $ ls linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws* linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb ``` Artifact content list (note the paths and sizes of *.ko files): ``` $ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko -rw-r--r-- root/root 11496 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko -rw-r--r-- root/root 356048 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/icp/icp.ko -rw-r--r-- root/root 220232 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko -rw-r--r-- root/root 107984 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko -rw-r--r-- root/root 157480 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/spl/spl.ko -rw-r--r-- root/root 328880 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko -rw-r--r-- root/root 114504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko -rw-r--r-- root/root 4049504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko $ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko -rw-r--r-- root/root 241000 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko -rw-r--r-- root/root 12478312 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/icp/icp.ko -rw-r--r-- root/root 8095720 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko -rw-r--r-- root/root 1300336 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko -rw-r--r-- root/root 4055432 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/spl/spl.ko -rw-r--r-- root/root 1064112 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko -rw-r--r-- root/root 3977680 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko -rw-r--r-- root/root 65634856 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko ``` ab-pre-push link: (AWS) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3356/ (ESX) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3357/ Ensure everything is installed in a VM created from the above pre-push: ``` delphix@sd-zfs-pkg:~$ apt search zfs-modules-5.3.0-45-generic Sorting... Done Full Text Search... Done zfs-modules-5.3.0-45-generic/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local] OpenZFS filesystem kernel modules for Linux (kernel 5.3.0-45-generic) zfs-modules-5.3.0-45-generic-dbg/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local] Debugging symbols for OpenZFS userland libraries and tools delphix@sd-zfs-pkg:~$ dpkg -L zfs-modules-5.3.0-45-generic-dbg /. /usr /usr/lib /usr/lib/debug /usr/lib/debug/lib /usr/lib/debug/lib/modules /usr/lib/debug/lib/modules/5.3.0-45-generic /usr/lib/debug/lib/modules/5.3.0-45-generic/extra /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl/zavl.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp/icp.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua/zlua.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair/znvpair.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl/spl.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon/zcommon.ko /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko /usr/lib/debug/modules /usr/lib/debug/modules/5.3.0-45-generic /usr/lib/debug/modules/5.3.0-45-generic/extra /usr/share /usr/share/doc /usr/share/doc/zfs-modules-5.3.0-45-generic-dbg /usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/changelog.Debian.gz /usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/copyright ``` Ensure that the binaries are connected with the debuglink: ``` delphix@sd-zfs-pkg:~$ ls -lh /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko -rw-r--r-- 1 root root 3.9M Apr 18 01:33 /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko delphix@sd-zfs-pkg:~$ ls -lh /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko -rw-r--r-- 1 root root 68M Apr 18 01:33 /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko delphix@sd-zfs-pkg:~$ readelf -S /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info delphix@sd-zfs-pkg:~$ readelf -S /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info [51] .debug_info PROGBITS 0000000000000000 002082dc [52] .rela.debug_info RELA 0000000000000000 0266de30 delphix@sd-zfs-pkg:~$ readelf -x.gnu_debuglink /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko Hex dump of section '.gnu_debuglink': 0x00000000 7a66732e 6b6f0000 facaadd7 zfs.ko...... ``` Ensure that SDB and crash-python still work: ``` delphix@sd-zfs-pkg:~$ sudo sdb sdb> spa ADDR NAME ------------------------------------------------------------ 0xffff8c2616588000 rpool ^D delphix@sd-zfs-pkg:~$ sudo crash-kcore.sh ...<cropped>... add symbol table from file "/usr/lib/debug/boot/vmlinux-5.3.0-45-generic" with all sections offset by 0x24800000 Slick Debugger (sdb) initializing... Loading tasks....... done. (443 tasks total) Loading modules for 5.3.0-45-genericLoading /lib/modules/5.3.0-45-generic/kernel/net/connstat/connstat.ko at 0xffffffffc0ea1000 ...<cropped>... Loading /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko at 0xffffffffc0774000 Loading /lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko at 0xffffffffc0722000 Loading /lib/modules/5.3.0-45-generic/extra/lua/zlua.ko at 0xffffffffc06f3000 Loading /lib/modules/5.3.0-45-generic/extra/avl/zavl.ko at 0xffffffffc036d000 ...<cropped>... (gdb) spa ADDR NAME ------------------------------------------------------------ 0xffff8c2616588000 rpool ``` I also wanted to make sure that we can still access these files from the crash kernel if we boot into single-user mode. We can even run sdb as in the original kernel: ``` You are in rescue mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" to boot into default mode. Give root password for maintenance (or press Control-D to continue): root@localhost:~# sdb sdb> spa ADDR NAME ------------------------------------------------------------ 0xffff99c16988c000 rpool ``` Finally I made sure to capture a normal crash-dump and analyze it as before: ``` delphix@sd-zfs-pkg:/var/crash/202004181853$ sudo sdb /usr/lib/debug/boot/vmlinux-5.3.0-45-generic dump.202004181853 sdb> spa ADDR NAME ------------------------------------------------------------ 0xffff899b558bc000 rpool ``` = Future Work [1] Add modifications to zfs-load so the same thing happens for developers iterating on ZFS changes internally. [2] Continue making the crash-kernel lighter by disabling unecessary subsystems and functionality that consume memory (e.g. memory hotplugging, systemd cgroup memory statistic metadata, etc..). [3] Reach out to our contacts from LKDC for further improvements.
The bincode variable-length integer encoding (varint) is more compact than fixed-length integer encoding (fixint), but uses more CPU to process. This commit changes BlockBasedLog's to use fixint encoding. This increases index merge speed by 40%, while increasing the on-disk size of the index by 7%. The change is made in a backwards compatible way.
Merge after upstream zfs-2.2-release rc3 tag
My system freezes when rsyncing large volumes of data.
The first rsync finished just fine, and copied over 80 GB of data from my closet.
However, a second rsync pass (which is nothing more than stats()) on the same files (my /home directory), will eventually -- into about a minute or so of reading massive amounts of files -- grind the system to a halt. the top view freezes completely while kswapd is at the top of the process list, and ps ax hangs in the middle of the process listing. Obviously I cannot provide a screenshot of that.
What workarounds can I providde to tell ZFS not to use so much memory? Even if it is slower, I need to see if this iis a memory problem.
Swap rests on another partition of the same SSD. It is not swapping with a file on the zfs volume.
The text was updated successfully, but these errors were encountered: