Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

root zfs freezes system #154

Closed
Rudd-O opened this issue Mar 10, 2011 · 82 comments
Closed

root zfs freezes system #154

Rudd-O opened this issue Mar 10, 2011 · 82 comments
Milestone

Comments

@Rudd-O
Copy link
Contributor

Rudd-O commented Mar 10, 2011

My system freezes when rsyncing large volumes of data.

The first rsync finished just fine, and copied over 80 GB of data from my closet.

However, a second rsync pass (which is nothing more than stats()) on the same files (my /home directory), will eventually -- into about a minute or so of reading massive amounts of files -- grind the system to a halt. the top view freezes completely while kswapd is at the top of the process list, and ps ax hangs in the middle of the process listing. Obviously I cannot provide a screenshot of that.

What workarounds can I providde to tell ZFS not to use so much memory? Even if it is slower, I need to see if this iis a memory problem.

Swap rests on another partition of the same SSD. It is not swapping with a file on the zfs volume.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 10, 2011

It's memory, definitely. Free memory drops rapidly as the rsync goes, then the machine locks up. echo 3 > /proc/sys/vm/drop_caches will free the memory that got eaten, but it will take several seconds to complete.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 10, 2011

curiously the cached memory went down only about a hundred M, but the free memory went up like eight hundred M. So I don't know exactly what is being freed that is not tallied up in the cached memory counter.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 10, 2011

Also if I stop the rsync as the machine is about to hang, arc_reclaim kicks in with 10000000% percent of cpu (exaggeration of mine).

it seems the problem is that arc_reclaim is simply kicking in too late, and by that time the machine is already effectively hung. remember none of that memory is swappable.

@behlendorf
Copy link
Contributor

This is my number 1 issue to get fixed. I've just started to look at it now that things are pretty stable. Unfortunately, because of the way ZFS has to manage the ARC, cached data isn't reported under 'cached' instead you'll see it under 'vmalloc_used'. That won't change soon (but I have long term plans), but in the short term we should be able to do something about the thrashing.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 10, 2011

how do i reduce the use of the arc in the meantime?

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 10, 2011

what takes a long time is the sync previous to the drop_caches. during that time, the free memory increases.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 10, 2011

sorry, i take my last comment back. dropping the caches is what takes long.

@behlendorf
Copy link
Contributor

You can cap the ARC size by setting the 'zfs_arc_max' module option. However, after the module is loaded you won't be able to change this value through /sys/module/zfs/. Arguably that would be a nice thing to support.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 11, 2011

What units is the zfs_arc_max knob? How do I set it? Only through modprobe? So I'd have to add it to the dracut module then. I think I will put the dracut thingie on version control and push it to my repo.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 11, 2011

my first question is, let's say I say zfs_arc_max=1024, what is that? kilobytes?

@behlendorf
Copy link
Contributor

Right now the arc cache is sized to a maximum 3/4 of all system memory. The zfs_arc_max module option actually takes a value in bytes because that's what the same hook on Solaris expects. We could make this as fancy as we need.

@fajarnugraha
Copy link
Contributor

The unit for zfs_arc_max is bytes, and it will only matter if it's > 64M (otherwise the code ignores the setting). I set it to 65M (68157440) right now for reliable operation, and a (very rough) look shows memory usage is at most 512MB higher compared to without zfs.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 14, 2011

~/Projects/Mine/[email protected] α:
cat /etc/modprobe.d/zfs.conf 
options zfs zfs_arc_max=268435456 zfs_arc_min=0

~/Projects/Mine/[email protected] α:
echo $(( 268435456 / 1024 / 1024 ))
256

Naaah, 64 MB is just TOO SLOW for a root file system to use. I tried that. So I went to 256 MB, and for the most part my machine only stalls two to three times a day (as opposed to right after boot with the unbounded setting). That's progress.

Now, if syscall response could be made faster... I feel like reading files that are already in cache is excruciatingly slow in ZFS compared to ext4, and it really affects application performance:

~/Projects/Mine/[email protected] α:
sudo /home/rudd-o/Projects/Mine/fs-benchmarks/testers/tightloops read /boot/grub/grub.conf 
[sudo] password for rudd-o: 
Beginning to read() /boot/grub/grub.conf...
Stopping...
Speed: 3930443 read calls in 5 seconds, 786088/s

~/Projects/Mine/[email protected] α:
sudo /home/rudd-o/Projects/Mine/fs-benchmarks/testers/tightloops read /etc/localtime
Beginning to read() /etc/localtime...
Stopping...
Speed: 464700 read calls in 5 seconds, 92940/s

An order of magnitude slower, NOT GOOD!

@fajarnugraha
Copy link
Contributor

IIRC you shouldn't be able to use 64M unless you hack the code (64MB + 1 byte, maybe, but not 64MB) :D

Anyway, related to cached files, I assume you've seen this: behlendorf@450dc14

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 14, 2011

FWIW, I tried 128. not 64. The code should recognize 64 and add a +1 to it, or change to accept 64 since many people with small systems will try that. I remember hacking zfs-fuse deeply to have it accept 16, since I wanted to run it on a 128 MB box. IT WORKED.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 14, 2011

That commit improves performance for read()'s, not for access() es, which constitute the VAST majority of file-related syscalls applications execute. Try an strace on kmail when it starts and you will quickly see what I mean.

What I don't undersatnd is, access() / stat() are supposed to be cached in the Linux kernel dentry cache. Does ZFS somehow bypass that and provides its own cache for dentries? That would be the only way I could understand access() stat() being slower on ZFS.

@behlendorf
Copy link
Contributor

As you said the default ARC behavior is to put a minimum bound of 64MiB on ARC, setting it to one byte larger than this should work. Changing this check to a '>=' instead of a '>' seems reasonable to me. I really have no idea how badly it will behave with less than 64MiB but feel free to try it!

    if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE)
            arc_c_max = zfs_arc_max;
    if (zfs_arc_min > 64<<20 && zfs_arc_min <= arc_c_max)
            arc_c_min = zfs_arc_min;

Reading access() performance your right an order of magnitude is NO GOOD! This can and will be improved but in the interests of getting something working sooner rather than latter I haven't optimized this code.

While the zfs code does use the Linux dentry and inode caches, the values stored in the inode are not 100% authoritative. Currently these values still need to be pulled from the znode. Finishing the unification of the znode and inode is listed as "Inode/Znode Refactoring" in the list of development items and I expect it will help this issue. There are also a couple of memory allocations which if eliminated I'm sure would improve things. Finally, I suspect using oprofile to profile to getattr would reveal some other ways to improve things.

All of these changes move us a little further away using the unmodified Solaris code but I think that's a price we have to pay in the Posix layer if we want good performance. Anyway, I'm happy to have help with this work. :) In the meanwhile I want to fix the other memory issues which I feel I have a pretty good handle on now and with luck can get fixed this week.

@behlendorf
Copy link
Contributor

I'm having a horrible time reproducing this VM thrashing issue which has been reported. Nothing I've tried today has been able to recreate the issue. I've seen kswapd pop up briefly (fraction of a second) to a large CPU percentage but it quickly frees the required memory.

I want to get this fixed, but I need a good test case. Can someone who is seeing the problem determine exactly what is needed to recreate the issue. Also if you do manage to recreate the issue (and have a somewhat responsive system), running the following will give me a lot of what I need to get it fixed.

echo t >/proc/sysrq-trigger
echo m >/proc/sysrq-trigger
cat /proc/spl/kstat/zfs/arcstats
cat /proc/meminfo

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 15, 2011

It's simple to reproduce:

  1. run zfs as root file system
  2. use your system normally, load a lot of applications
  3. then start a disk-intensive operation such as a yum upgrade
  4. eventually it'll choke.

I will try to upgrade to a newer kernel today. Running 2.6.37-pre7

@behlendorf
Copy link
Contributor

Okay, then I'll work on reviewing and pulling your your Dracut changes tomorrow. Then I can setup zfs as a root filesystem and see if I can't hit the issue. I know others have hit the problem with it as a non-root filesystem but I wasn't able too. That's what I get for using test systems however, there really isn't anything quite like real usage.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 15, 2011

Certainly!

@devsk
Copy link

devsk commented Mar 15, 2011

Same thing as reported in issue 149.

One way to reproduce this easily would be to boot your Linux system with smaller amount of RAM than it actually has. Try booting with 512MB using mem=512M as kernel parameter.

You don't need ZFS as rootfs to trigger this. A simple 'find' or 'du' on a fairly large FS with small amount of RAM will trigger this.

I have run into this issue (eventual hard lockup!) so many times that I gave up on rootfs.

Its just a matter of amount of RAM. e.g. given 48GB of RAM, you will probably not run into this issue....:-)

@behlendorf
Copy link
Contributor

I can try less RAM, I've been booting my box with 2 GiB to try and tickle the issue but thus far no luck. I'll try less.

@behlendorf
Copy link
Contributor

Ahh, found it! Or rather I refound the issue. Ricardo was working on this issue with me, I thought it would only effect a Lustre/ZFS server but it looks like that's not the case. KQ has identified the same bug and opened the following issue with the upstream kernel. That' progress. :)

https://bugzilla.kernel.org/show_bug.cgi?id=30702

http://marc.info/?l=linux-mm&m=128942194520631&w=4

Ricardo had worked out a patch but it has yet to be merged in to the upstream kernel. Joshi from KQ has also attached a proposed fix. Rudd-O, Devsk since you two aren't squeamish about rebuilding your kernel, if you like you can apply the proposed fix to your kernel and rebuild it. It should resolve the deadlock.

https://bugzilla.kernel.org/attachment.cgi?id=50802

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 15, 2011

I am going to test the patch soon.

@devsk
Copy link

devsk commented Mar 16, 2011

I think that patch has nothing to do with memory issue that we are facing with ZFS. I have seen a hard lock (which could be the deadlock mentioned above) as well but mostly thrashing. We do need to concentrate on thrashing aspect.

@behlendorf
Copy link
Contributor

Well my new home NAS hardware just arrived so once I get it installed (tonight maybe) hopefully I'll start seeing these real usage issues too. It's an Intel atom with only 4GiB of memory so I'm well motivated to make ZFS work well within those limits. :)

It's hard to say for sure what the problem is without stack traces from the problem. If anyone hits it again please run echo t >/proc/sysrq-trigger; dmesg. That way we'll get stacks from all the processes on the system and we should be able to see what it's trashing on

@devsk
Copy link

devsk commented Mar 16, 2011

One issue we may run into while doing echo t to sysrq is that there are so many tasks, dmesg buffer may not be enough to hold them all.

And of course, the system doesn't give you much time to do diagnostics when this happens.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 16, 2011

Kernel 2.6. 38 here, no preempt, no thrashing anymore. So far. Still haven't gotten to test the patch. I will keep you informed.

@devsk
Copy link

devsk commented Mar 16, 2011

I have been running 2.6.38-rc's for a while and I still see thrashing. Unless something changed in last 1 week (between rc8 and release), I don't thrashing is fixed.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 23, 2011

brian: I don't see any kswapd thrashing when free memory is available, but when it starts getting tight (~20M free), it surely appears again. But that is to be expected, as kswapd is scrambling to find pages to swap out?

fajarnugraha: the problem with 65 max arc is that, well, eh, the system is excruciatingly slow.

I'd rather have a choice to just use the linux dentry and page caches, and forget about ARC. I understand that presently ZFS VFS code will still be run even in cases where the dentry cache contains a dentry that userspace is trying to access, or the ARC contains block data that userspace is trying to read, so yeah, that is probably the reason we have this huge overhead even in cached data. That is not the case in other file systems -- as soon as something is cached in either the page or the dentry caches, none of the filesystem code needs to be invoked. Bummer for ZFS here.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 23, 2011

Also I would like to point out that this is not the case for ZFS-FUSE, for when data has already been cached, the ZFS code is never invoked again. I wrote the patch to make this possible, because without that, ZFS-FUSE was much much slower than even ZFS in kernel.

@behlendorf
Copy link
Contributor

I agree with pretty much everything you just said. :) Unfortunately, ZFS is a very different beast then every other Linux filesystem. I would love to integrate it more closely with the Linux dentry/inode/page caches to get exactly the performance improvement you suggest. I'm happy to to have a detailed development discussion on exactly how this should be done. I would suggest the place for it would be on the zfs-devel mailing list. But my first priority is getting the existing implementation stable.

@fajarnugraha
Copy link
Contributor

Rudd-O: I know that running with arc=65M is slow :) My point is currently there's a lot more to zfs usage instead of arc, so I wouldn't be surprised if 512M arc means much higher memory usage. Having a specified limit for all zfs usage will be good, but as Brian mentioned we don't have one right now.

Since your patch to zfs-fuse was good performance-wise, can you easily port this to in kernel zfs?

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 23, 2011

No, I cannot. The patch I wrote merely told FUSE to start caching stuff in the dentry / pagecache. ZFS kernel has a different road to travel -- one that involves integrating znodes with inodes to enable proper dentry caching, and other types of work to enable reliance on pagecache alone.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 23, 2011

Gawd damn. http://pastebin.com/yZy2TVY4 Kernel BUGs galore, and it's always when checking that BAT0 file. ALWAYS. They have been happenin since I patched my kernel. I will have to revert that patch and work with the vanilla kernel.

@behlendorf
Copy link
Contributor

Not a lot to go on there. The kernel patch shouldn't be needed anymore with the source from master so it will be interesting to see if you still see this with the vanilla kernel.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 23, 2011

With your latest code AND without the kernel patch, I see no freezes... yet.
I haven't done the evil rsync that kills machines (TM). I do feel the machine
stuttering when memory gets low, and I also do see the memory getting freed in
like 300MB bunches when the memory watermark hits about ~15MB.

El Wednesday, March 23, 2011, behlendorf escribió:

Not a lot to go on there. The kernel patch shouldn't be needed anymore
with the source from master so it will be interesting to see if you still
see this with the vanilla kernel.

@behlendorf
Copy link
Contributor

You could try increasing /proc/sys/vm/min_free_kbytes. This is the threshold used the the kernel for how much memory it wants to keep free. Bumping it up a little bit for now might help with the stuttering but leaving some more headroom.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 23, 2011

interesting. i will try that if i continue seeing stuttering.

El Wednesday, March 23, 2011, behlendorf escribió:

You could try increasing /proc/sys/vm/min_free_kbytes. This is the
threshold used the the kernel for how much memory it wants to keep free.
Bumping it up a little bit for now might help with the stuttering but
leaving some more headroom.

@devsk
Copy link

devsk commented Mar 23, 2011

stuttering comes from swapping code path. min_free_kbytes won't help with that. In fact, larger the min_free_kbytes, earlier the kernel will try to swap.

Unless we fix the loop and have zfs free memory inline instead of asynchronously in a separate thread (which may have scheduling trouble with kswapd hogging the CPU trying to find free pages), kernel will continue to swap. There is a lot of potentially conflicting machinery at work here.

@behlendorf
Copy link
Contributor

Devsk, your right. Thanks for reminding me. Before I got distracted with the GFP_NOFS badness I started to work on a patch to do direct on the ARC. Right now all reclaim is done by the arc_reclaim thread and it basically just checks one a second and shrinks the ARC if memory looks low. That of course isn't fast enough for a dynamic environment like a desktop so swapping kicks in. Adding the direct reclaim path (via a shrinker) should improve things... at least that's the current theory.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Mar 23, 2011

YES! THAT IS EXACTLY THE PROBLEM. This is why you see kswapd and arc_reclaim
contending for 100% CPU on both cores when this happens (both are desperately
trying to free memory at all costs, both enter a very contended race, neither
succeeds, the kernel oopses and says it cannot allocate memory). I am sure
that when ARC reclaim is done inline as needed instead of on a separate
thread, this problem will be a thing of the past.

So when can we have that juicy bit? :-D

El Wednesday, March 23, 2011, behlendorf escribió:

Devsk, your right. Thanks for reminding me. Before I got distracted with
the GFP_NOFS badness I started to work on a patch to do direct on the ARC.
Right now all reclaim is done by the arc_reclaim thread and it basically
just checks one a second and shrinks the ARC if memory looks low. That of
course isn't fast enough for a dynamic environment like a desktop so
swapping kicks in. Adding the direct reclaim path (via a shrinker) should
improve things... at least that's the current theory.

@fajarnugraha
Copy link
Contributor

I tested a build from master on RHEL with 2.6.32 kernel, the dracut change cause an error during "make rpm". The cause is simple: RHEL5 doesnot recognize %{_datarootdir} macro in zfs.spec. I had to change it manually to %{_datadir}, then the build process completed successfully.

@devsk
Copy link

devsk commented Mar 26, 2011

So when can we have that juicy bit? :-D

Not anytime soon I guess...:-D

Surprise me pleasant Brian...;-)

@behlendorf
Copy link
Contributor

I have a branch now which implements much of this but it still requires some tuning. Give me a few more days to chew on it and I'll make it public for other to test and see if it improves their workloads. I've been using your rsync/find test work load on a 2GiB atom system, once it works there I'll verify it works well on some 16-core 128GiB memory systems.

In the process I have also gotten a good handle on your memory usage issue. There's no leak, but find does cause some pretty nasty fragmentation. I'll write up my understanding perhaps in a bug or on the mailing list next week. There may be a few easy things which can be done to improve things. There are also some much harder things which will have to wait. :)

@behlendorf
Copy link
Contributor

These changes are the result of my recent work to getting a handle of vmalloc() and memory usage. They are still a work in progress but available for testing. Once I'm happy with everything I'll make a detailed post to zfs-devel explaining the memory usage on Linux as it stands today. If you want to test these changes you must update both the spl and zfs source to the shrinker branch.

https://github.com/behlendorf/spl/tree/shrinker

https://github.com/behlendorf/zfs/tree/shrinker

Here's a high level summary of these changes:

  • Reduce spl slab fragmentation by halving the slab size. Excessive fragmentation for certain workloads, find, caused lots of wasted memory. Decreasing the slab size helps but there is still considerable overhead which will be difficult to address.
  • More useful slab statistics including the slab size (size) and how much of it is allocated to objects (alloc). This makes it easy to see which slabs are badly fragments and how much memory it is costing, /proc/spl/kmem/slab. Additionally, there are now some slab usage summaries in /proc/spl/kmem/slab*.
  • Honor the arc_meta_limit. Previously this limit was not enforced which could result in meta data consuming your entire ARC cache which hurts performance. This limit is now enforced and can be set with the zfs_arc_meta_limit module option. It defaults to 1/4 of the ARC cache.
  • Show the arc_meta_used and associated arc stats. Previously these values were not visible in /proc/spl/kstat/zfs/arcstats. Additionally there is a memory_direct_count and a memory_indirect_count which show how often you can hit the direct and indirect reclaim paths.
  • Added direct and indirect memory reclaim paths for the ARC. This should improve behavior issues under low memory conditions and prevent OOM events and arc_reclaim/kswapd thrashing.
  • Several bug fixes for issues exposed by testing in a low memory environment. Details in the commit logs.

There is one major and one minor outstanding issue I know of which are preventing these changes from being merged in to master.

  • Because the ARC now properly honors the arc_meta_limit there is additional pressure on the dcache. This additional pressure now regularly causes a long standing bug to be hit more regularly on low memory systems (2 GiB). This needs to be fixed before this change can be merged.

kernel BUG at fs/inode.c:1333! [ in iput() ]
Putting away a reference on already cleared inode

  • There also remains the smaller issue of the ARC cache being dropped to arc_min when memory pressure is encountered. This only impacts performance but needs to be explained and fixed, the ARC should reach a steady-state.

@devsk
Copy link

devsk commented Mar 30, 2011

Wow! That's a lot of work for a week...:-)

Earliest I can test is the weekend though...:( Swamped with work of my own.

@behlendorf
Copy link
Contributor

I try and keep busy. No rush to get this tested it's going to take some some time to run down the iput() issue mentioned above. I just wanted to make what I'm thinking about public for comment.

@behlendorf
Copy link
Contributor

This was fixed in what will be 0.6.0-rc4. Closing issue, see comments in issue #149 starting here.

https://github.com/behlendorf/zfs/issues/149#issuecomment-1042925

@Rudd-O
Copy link
Contributor Author

Rudd-O commented May 8, 2012

kswapd is back to trashing again (in my observation, any time that c_max is set via kernel module option to 18 or more GB of RAM in a swapless system with 48 GB memory).

both kswapd0 and kswapd1 will spin 100% CPU, pegging two cores of the eight-core machine.

This is bad, I have had to resort to limiting the ARC to around 15 GB and I am testing again with these params:

hash_elements_max 4 1150514
hash_chain_max 4 9
c 4 15372519301
c_min 4 3843129825
c_max 4 15372519301
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 17659
arc_meta_used 4 7688103744
arc_meta_limit 4 7686259650
arc_meta_max 4 7712895136

@behlendorf
Copy link
Contributor

Is this with the latest spl+zfs master source? Sevealr recent VM changes were merged in there which I expected to make this sort of this much less likely. If in fact they have had an adverse impact I'd like to get it resolved right away. In particular are you running with the following commits.

SPL
openzfs/spl@f90096c Modify KM_PUSHPAGE to use GFP_NOIO instead of GFP_NOFS
openzfs/spl@a9a7a01 Add SPLAT test to exercise slab direct reclaim
openzfs/spl@b78d4b9 Ensure a minimum of one slab is reclaimed
openzfs/spl@06089b9 Ensure direct reclaim forward progress
openzfs/spl@c0e0fc1 Ignore slab cache age and delay in direct reclaim
openzfs/spl@cef7605 Throttle number of freed slabs based on nr_to_scan

ZFS
518b487 Update ARC memory limits to account for SLUB internal fragmentation
302f753 Integrate ARC more tightly with Linux

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 1, 2015
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.

This is good news for the vn implementation because it removes the
need for us to handle the locking.  However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.

Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels.  This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.

Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.

Signed-off-by: Yuxuan Shui <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#154
prateekpandey14 pushed a commit to prateekpandey14/zfs that referenced this issue Mar 13, 2019
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Apr 30, 2020
= Motivation

We've recently been hitting issues where the crash-kernel runs out
of memory while its being bootstrapped after the main kernel panicks.
We could always keep increasing the memory reserved from the crash
kernel at the expense of the memory available to main kernel. For
customers that may not be an issue, but for developer VMs generally
provisioned with 8GB of memory the impact is higher. As a result,
I've been working on making our crash kernel lighter.

= Patch Description

The DWARF info from the ZFS-related kernel modules "dwarf"s in
size the sections of these binaries that are actually needed for
them to run. This patch splits the debug info of all the ZFS
modules into a separate debian package and strips down the original
modules. This way the initramfs of the original kernel and the
crash kernel is reduced in size allowing for potential reduction
in the amount of memory needed to be reserved for the crash kernel.

= Implementation Details

When I first decided to tackle this I thought this was a
misconfiguration on our side and `dh_strip` wasn't working as
expected. After contacting the Debian folks that seem to own
the Debian packaging of ZFS for Debian, we realized that this
is a general issue and not specific to Delphix's packaging
of ZFS. `dh_strip` doesn't work with kernel modules and it was
effective a no-op. Mo Zhou (one of the aforementioned Debian
folks) suggested that the most straightforward way would be
to manually stip these modules and place them in the right
directories, and this is what I did.

Note: The decision to install the debug info under the directory
/usr/lib/debug/lib/modules was inspired by Fedora which does the
same (couldn't find anything consistent for Debian). Furthermore,
it seems like Ubuntu conventionally does the same.

References:
[1] man debhelper & manuals referenced there
[2] man strip
[3] https://sourceware.org/gdb/current/onlinedocs/gdb/Separate-Debug-Files.html
[4] https://wiki.debian.org/DebugPackage
[5] https://wiki.debian.org/AutomaticDebugPackages

= Results

Initramfs before on ESX:
```
delphix@sd-drgn:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-42-generic
-rw-r--r-- 1 root root 99M Apr  2 21:57 /var/lib/kdump/initrd.img-5.3.0-42-generic

delphix@sd-drgn:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-42-generic | sort -k 5 -n | tail -n 5
-rw-r--r--   1 root     root      6341913 Feb 28 12:40 lib/modules/5.3.0-42-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
-rw-r--r--   1 root     root      8932776 Apr  2 21:55 lib/modules/5.3.0-42-generic/extra/lua/zlua.ko
-rw-r--r--   1 root     root     13886056 Apr  2 21:55 lib/modules/5.3.0-42-generic/extra/icp/icp.ko
-rwxr-xr-x   1 root     root     14185264 Apr  2 21:55 lib/libzpool.so.2.0.0
-rw-r--r--   1 root     root     73782264 Apr  2 21:55 lib/modules/5.3.0-42-generic/extra/zfs/zfs.ko
```

Initramfs after on ESX:
```
delphix@sd-zfs-pkg:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-45-generic
-rw-r--r-- 1 root root 62M Apr 18 02:08 /var/lib/kdump/initrd.img-5.3.0-45-generic

delphix@sd-zfs-pkg:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-45-generic | sort -k 5 -n | tail -n 5
-rw-r--r--   1 root     root      2922456 Apr 18 01:33 lib/libzpool.so.2.0.0
-rw-r--r--   1 root     root      3024337 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/nouveau/nouveau.ko
-rw-r--r--   1 root     root      3161065 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/i915/i915.ko
-rw-r--r--   1 root     root      4065288 Apr 18 01:33 lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
-rw-r--r--   1 root     root      6341585 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
```

This will cause a ~37% reduction to the size of initramfs in ESX.
The reduction percentage is higher for AWS - from 139MB (47MB
compressed) to 52MB(19MB compressed), so 62~64% reduction.

= Testing

Artifacts created by linux-pkg:
```
$ ls linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws*
linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb
linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb
```

Artifact content list (note the paths and sizes of *.ko files):
```
$ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko
-rw-r--r-- root/root     11496 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko
-rw-r--r-- root/root    356048 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/icp/icp.ko
-rw-r--r-- root/root    220232 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko
-rw-r--r-- root/root    107984 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko
-rw-r--r-- root/root    157480 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/spl/spl.ko
-rw-r--r-- root/root    328880 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko
-rw-r--r-- root/root    114504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko
-rw-r--r-- root/root   4049504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko

$ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko
-rw-r--r-- root/root    241000 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko
-rw-r--r-- root/root  12478312 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/icp/icp.ko
-rw-r--r-- root/root   8095720 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko
-rw-r--r-- root/root   1300336 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko
-rw-r--r-- root/root   4055432 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/spl/spl.ko
-rw-r--r-- root/root   1064112 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko
-rw-r--r-- root/root   3977680 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko
-rw-r--r-- root/root  65634856 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko
```

ab-pre-push link:
(AWS) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3356/
(ESX) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3357/

Ensure everything is installed in a VM created from the above pre-push:
```
delphix@sd-zfs-pkg:~$ apt search zfs-modules-5.3.0-45-generic
Sorting... Done
Full Text Search... Done
zfs-modules-5.3.0-45-generic/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local]
  OpenZFS filesystem kernel modules for Linux (kernel 5.3.0-45-generic)

zfs-modules-5.3.0-45-generic-dbg/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local]
  Debugging symbols for OpenZFS userland libraries and tools

delphix@sd-zfs-pkg:~$ dpkg -L zfs-modules-5.3.0-45-generic-dbg
/.
/usr
/usr/lib
/usr/lib/debug
/usr/lib/debug/lib
/usr/lib/debug/lib/modules
/usr/lib/debug/lib/modules/5.3.0-45-generic
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl/zavl.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp/icp.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua/zlua.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair/znvpair.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl/spl.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon/zcommon.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
/usr/lib/debug/modules
/usr/lib/debug/modules/5.3.0-45-generic
/usr/lib/debug/modules/5.3.0-45-generic/extra
/usr/share
/usr/share/doc
/usr/share/doc/zfs-modules-5.3.0-45-generic-dbg
/usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/changelog.Debian.gz
/usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/copyright
```

Ensure that the binaries are connected with the debuglink:
```
delphix@sd-zfs-pkg:~$ ls -lh /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
-rw-r--r-- 1 root root 3.9M Apr 18 01:33 /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko

delphix@sd-zfs-pkg:~$ ls -lh /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
-rw-r--r-- 1 root root 68M Apr 18 01:33 /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko

delphix@sd-zfs-pkg:~$ readelf -S /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info
delphix@sd-zfs-pkg:~$ readelf -S /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info
  [51] .debug_info       PROGBITS         0000000000000000  002082dc
  [52] .rela.debug_info  RELA             0000000000000000  0266de30

delphix@sd-zfs-pkg:~$ readelf -x.gnu_debuglink /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko

Hex dump of section '.gnu_debuglink':
  0x00000000 7a66732e 6b6f0000 facaadd7          zfs.ko......
```

Ensure that SDB and crash-python still work:
```
delphix@sd-zfs-pkg:~$  sudo sdb
sdb> spa
ADDR               NAME
------------------------------------------------------------
0xffff8c2616588000 rpool
^D

delphix@sd-zfs-pkg:~$ sudo crash-kcore.sh
...<cropped>...
add symbol table from file "/usr/lib/debug/boot/vmlinux-5.3.0-45-generic" with all sections offset by 0x24800000
Slick Debugger (sdb) initializing...
Loading tasks....... done. (443 tasks total)
Loading modules for 5.3.0-45-genericLoading /lib/modules/5.3.0-45-generic/kernel/net/connstat/connstat.ko at 0xffffffffc0ea1000
...<cropped>...
Loading /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko at 0xffffffffc0774000
Loading /lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko at 0xffffffffc0722000
Loading /lib/modules/5.3.0-45-generic/extra/lua/zlua.ko at 0xffffffffc06f3000
Loading /lib/modules/5.3.0-45-generic/extra/avl/zavl.ko at 0xffffffffc036d000
...<cropped>...
(gdb) spa
ADDR           NAME
------------------------------------------------------------
0xffff8c2616588000 rpool
```

I also wanted to make sure that we can still access these files from the
crash kernel if we boot into single-user mode. We can even run sdb as in
the original kernel:
```
You are in rescue mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):
root@localhost:~# sdb
sdb> spa
ADDR               NAME
------------------------------------------------------------
0xffff99c16988c000 rpool
```

Finally I made sure to capture a normal crash-dump and analyze it as before:
```
delphix@sd-zfs-pkg:/var/crash/202004181853$ sudo sdb /usr/lib/debug/boot/vmlinux-5.3.0-45-generic dump.202004181853
sdb> spa
ADDR               NAME
------------------------------------------------------------
0xffff899b558bc000 rpool
```

= Future Work

[1] Add modifications to zfs-load so the same thing happens for
developers iterating on ZFS changes internally.

[2] Continue making the crash-kernel lighter by disabling unecessary
subsystems and functionality that consume memory (e.g. memory
hotplugging, systemd cgroup memory statistic metadata, etc..).

[3] Reach out to our contacts from LKDC for further improvements.
sdimitro added a commit to sdimitro/zfs that referenced this issue May 1, 2020
= Motivation

We've recently been hitting issues where the crash-kernel runs out
of memory while its being bootstrapped after the main kernel panicks.
We could always keep increasing the memory reserved from the crash
kernel at the expense of the memory available to main kernel. For
customers that may not be an issue, but for developer VMs generally
provisioned with 8GB of memory the impact is higher. As a result,
I've been working on making our crash kernel lighter.

= Patch Description

The DWARF info from the ZFS-related kernel modules "dwarf"s in
size the sections of these binaries that are actually needed for
them to run. This patch splits the debug info of all the ZFS
modules into a separate debian package and strips down the original
modules. This way the initramfs of the original kernel and the
crash kernel is reduced in size allowing for potential reduction
in the amount of memory needed to be reserved for the crash kernel.

= Implementation Details

When I first decided to tackle this I thought this was a
misconfiguration on our side and `dh_strip` wasn't working as
expected. After contacting the Debian folks that seem to own
the Debian packaging of ZFS for Debian, we realized that this
is a general issue and not specific to Delphix's packaging
of ZFS. `dh_strip` doesn't work with kernel modules and it was
effective a no-op. Mo Zhou (one of the aforementioned Debian
folks) suggested that the most straightforward way would be
to manually stip these modules and place them in the right
directories, and this is what I did.

Note: The decision to install the debug info under the directory
/usr/lib/debug/lib/modules was inspired by Fedora which does the
same (couldn't find anything consistent for Debian). Furthermore,
it seems like Ubuntu conventionally does the same.

References:
[1] man debhelper & manuals referenced there
[2] man strip
[3] https://sourceware.org/gdb/current/onlinedocs/gdb/Separate-Debug-Files.html
[4] https://wiki.debian.org/DebugPackage
[5] https://wiki.debian.org/AutomaticDebugPackages

= Results

Initramfs before on ESX:
```
delphix@sd-drgn:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-42-generic
-rw-r--r-- 1 root root 99M Apr  2 21:57 /var/lib/kdump/initrd.img-5.3.0-42-generic

delphix@sd-drgn:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-42-generic | sort -k 5 -n | tail -n 5
-rw-r--r--   1 root     root      6341913 Feb 28 12:40 lib/modules/5.3.0-42-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
-rw-r--r--   1 root     root      8932776 Apr  2 21:55 lib/modules/5.3.0-42-generic/extra/lua/zlua.ko
-rw-r--r--   1 root     root     13886056 Apr  2 21:55 lib/modules/5.3.0-42-generic/extra/icp/icp.ko
-rwxr-xr-x   1 root     root     14185264 Apr  2 21:55 lib/libzpool.so.2.0.0
-rw-r--r--   1 root     root     73782264 Apr  2 21:55 lib/modules/5.3.0-42-generic/extra/zfs/zfs.ko
```

Initramfs after on ESX:
```
delphix@sd-zfs-pkg:~$ ls -lh /var/lib/kdump/initrd.img-5.3.0-45-generic
-rw-r--r-- 1 root root 62M Apr 18 02:08 /var/lib/kdump/initrd.img-5.3.0-45-generic

delphix@sd-zfs-pkg:~$ lsinitramfs -l /var/lib/kdump/initrd.img-5.3.0-45-generic | sort -k 5 -n | tail -n 5
-rw-r--r--   1 root     root      2922456 Apr 18 01:33 lib/libzpool.so.2.0.0
-rw-r--r--   1 root     root      3024337 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/nouveau/nouveau.ko
-rw-r--r--   1 root     root      3161065 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/i915/i915.ko
-rw-r--r--   1 root     root      4065288 Apr 18 01:33 lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
-rw-r--r--   1 root     root      6341585 Mar 27 12:47 lib/modules/5.3.0-45-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
```

This will cause a ~37% reduction to the size of initramfs in ESX.
The reduction percentage is higher for AWS - from 139MB (47MB
compressed) to 52MB(19MB compressed), so 62~64% reduction.

= Testing

Artifacts created by linux-pkg:
```
$ ls linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws*
linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb
linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb
```

Artifact content list (note the paths and sizes of *.ko files):
```
$ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko
-rw-r--r-- root/root     11496 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko
-rw-r--r-- root/root    356048 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/icp/icp.ko
-rw-r--r-- root/root    220232 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko
-rw-r--r-- root/root    107984 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko
-rw-r--r-- root/root    157480 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/spl/spl.ko
-rw-r--r-- root/root    328880 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko
-rw-r--r-- root/root    114504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko
-rw-r--r-- root/root   4049504 2020-04-17 21:41 ./lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko

$ dpkg --contents linux-pkg/packages/zfs/tmp/artifacts/zfs-modules-4.15.0-1063-aws-dbg_0.8.0-delphix+2020.04.17.21_amd64.deb | grep .ko
-rw-r--r-- root/root    241000 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/avl/zavl.ko
-rw-r--r-- root/root  12478312 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/icp/icp.ko
-rw-r--r-- root/root   8095720 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/lua/zlua.ko
-rw-r--r-- root/root   1300336 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/nvpair/znvpair.ko
-rw-r--r-- root/root   4055432 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/spl/spl.ko
-rw-r--r-- root/root   1064112 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/unicode/zunicode.ko
-rw-r--r-- root/root   3977680 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zcommon/zcommon.ko
-rw-r--r-- root/root  65634856 2020-04-17 21:41 ./usr/lib/debug/lib/modules/4.15.0-1063-aws/extra/zfs/zfs.ko
```

ab-pre-push link:
(AWS) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3356/
(ESX) http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/3357/

Ensure everything is installed in a VM created from the above pre-push:
```
delphix@sd-zfs-pkg:~$ apt search zfs-modules-5.3.0-45-generic
Sorting... Done
Full Text Search... Done
zfs-modules-5.3.0-45-generic/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local]
  OpenZFS filesystem kernel modules for Linux (kernel 5.3.0-45-generic)

zfs-modules-5.3.0-45-generic-dbg/now 0.8.0-delphix+2020.04.18.01 amd64 [installed,local]
  Debugging symbols for OpenZFS userland libraries and tools

delphix@sd-zfs-pkg:~$ dpkg -L zfs-modules-5.3.0-45-generic-dbg
/.
/usr
/usr/lib
/usr/lib/debug
/usr/lib/debug/lib
/usr/lib/debug/lib/modules
/usr/lib/debug/lib/modules/5.3.0-45-generic
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/avl/zavl.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/icp/icp.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/lua/zlua.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/nvpair/znvpair.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/spl/spl.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zcommon/zcommon.ko
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs
/usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
/usr/lib/debug/modules
/usr/lib/debug/modules/5.3.0-45-generic
/usr/lib/debug/modules/5.3.0-45-generic/extra
/usr/share
/usr/share/doc
/usr/share/doc/zfs-modules-5.3.0-45-generic-dbg
/usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/changelog.Debian.gz
/usr/share/doc/zfs-modules-5.3.0-45-generic-dbg/copyright
```

Ensure that the binaries are connected with the debuglink:
```
delphix@sd-zfs-pkg:~$ ls -lh /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
-rw-r--r-- 1 root root 3.9M Apr 18 01:33 /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko

delphix@sd-zfs-pkg:~$ ls -lh /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko
-rw-r--r-- 1 root root 68M Apr 18 01:33 /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko

delphix@sd-zfs-pkg:~$ readelf -S /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info
delphix@sd-zfs-pkg:~$ readelf -S /usr/lib/debug/lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko | grep debug_info
  [51] .debug_info       PROGBITS         0000000000000000  002082dc
  [52] .rela.debug_info  RELA             0000000000000000  0266de30

delphix@sd-zfs-pkg:~$ readelf -x.gnu_debuglink /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko

Hex dump of section '.gnu_debuglink':
  0x00000000 7a66732e 6b6f0000 facaadd7          zfs.ko......
```

Ensure that SDB and crash-python still work:
```
delphix@sd-zfs-pkg:~$  sudo sdb
sdb> spa
ADDR               NAME
------------------------------------------------------------
0xffff8c2616588000 rpool
^D

delphix@sd-zfs-pkg:~$ sudo crash-kcore.sh
...<cropped>...
add symbol table from file "/usr/lib/debug/boot/vmlinux-5.3.0-45-generic" with all sections offset by 0x24800000
Slick Debugger (sdb) initializing...
Loading tasks....... done. (443 tasks total)
Loading modules for 5.3.0-45-genericLoading /lib/modules/5.3.0-45-generic/kernel/net/connstat/connstat.ko at 0xffffffffc0ea1000
...<cropped>...
Loading /lib/modules/5.3.0-45-generic/extra/zfs/zfs.ko at 0xffffffffc0774000
Loading /lib/modules/5.3.0-45-generic/extra/unicode/zunicode.ko at 0xffffffffc0722000
Loading /lib/modules/5.3.0-45-generic/extra/lua/zlua.ko at 0xffffffffc06f3000
Loading /lib/modules/5.3.0-45-generic/extra/avl/zavl.ko at 0xffffffffc036d000
...<cropped>...
(gdb) spa
ADDR           NAME
------------------------------------------------------------
0xffff8c2616588000 rpool
```

I also wanted to make sure that we can still access these files from the
crash kernel if we boot into single-user mode. We can even run sdb as in
the original kernel:
```
You are in rescue mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):
root@localhost:~# sdb
sdb> spa
ADDR               NAME
------------------------------------------------------------
0xffff99c16988c000 rpool
```

Finally I made sure to capture a normal crash-dump and analyze it as before:
```
delphix@sd-zfs-pkg:/var/crash/202004181853$ sudo sdb /usr/lib/debug/boot/vmlinux-5.3.0-45-generic dump.202004181853
sdb> spa
ADDR               NAME
------------------------------------------------------------
0xffff899b558bc000 rpool
```

= Future Work

[1] Add modifications to zfs-load so the same thing happens for
developers iterating on ZFS changes internally.

[2] Continue making the crash-kernel lighter by disabling unecessary
subsystems and functionality that consume memory (e.g. memory
hotplugging, systemd cgroup memory statistic metadata, etc..).

[3] Reach out to our contacts from LKDC for further improvements.
sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 14, 2022
The bincode variable-length integer encoding (varint) is more compact
than fixed-length integer encoding (fixint), but uses more CPU to
process.

This commit changes BlockBasedLog's to use fixint encoding.  This
increases index merge speed by 40%, while increasing the on-disk size of
the index by 7%.

The change is made in a backwards compatible way.
ixhamza added a commit to ixhamza/zfs that referenced this issue Nov 15, 2023
Merge after upstream zfs-2.2-release rc3 tag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants