Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sio_cache_0 kernel slab high memory usage during scrub #8662

Closed
mailinglists35 opened this issue Apr 23, 2019 · 17 comments
Closed

sio_cache_0 kernel slab high memory usage during scrub #8662

mailinglists35 opened this issue Apr 23, 2019 · 17 comments

Comments

@mailinglists35
Copy link

mailinglists35 commented Apr 23, 2019

System information

Type Version/Name
Distribution Name ubuntu
Distribution Version 18.04lts
Linux Kernel 4.18
Architecture amd64
ZFS Version 0.8.0-rc4
SPL Version 0.8.0-rc4

Describe the problem you're observing

detailed attachments on https://zfsonlinux.topicbox.com/groups/zfs-discuss/T225c012532a7c86c (arc_summary, /proc/spl/kmem/slab, dmesg)

when I scrub the pool (10TB 3-way mirror with ssd cache and ssd log), I observe that kernel slab memory occupied by sio_cache_0 fills up to the point where kernel starts oom killer on innocent apps
these slabs show as unreclaimable in oom killer dmesg

if I setup spl_kmem_cache_expire to 0x01 (illumos style, 15 seconds aging return of objects), the memory consumption stabilizes around 3GB - down from 10GB with default 0x02 -, but I find that still enormously.

Does sio_cache_0 really need all that ram during scrub?

Describe how to reproduce the problem

zpool scrub poolname

Include any warning/errors/backtraces from the system logs

@behlendorf
Copy link
Contributor

@mailinglists35 thanks for reporting this. The sio_cache by default is limited to 5% of system memory during the scrub, though we may exceed this somewhat due to memory fragmentation. Could you please post the output of the following command when it's consuming 10G. That should let us determine if the core issue here is fragmentation.

cat /proc/slabinfo  | grep sio_cache

It wasn't clear from the mailing list thread how much memory is in your system. Could you include that information as well.

You can further reduce the amount of memory ZFS is allowed to use for the scan by setting the zfs_scan_mem_lim_fact and zfs_scan_mem_lim_soft_fact module options.

zfs_scan_mem_lim_fact (int)

Maximum fraction of RAM used for I/O sorting by sequential scan algorithm.
This tunable determines the hard limit for I/O sorting memory usage.
When the hard limit is reached we stop scanning metadata and start issuing
data verification I/O. This is done until we get below the soft limit.

Default value: 20 which is 5% of RAM (1/20).

zfs_scan_mem_lim_soft_fact (int)

The fraction of the hard limit used to determined the soft limit for I/O sorting
by the sequential scan algorithm. When we cross this limit from bellow no action
is taken. When we cross this limit from above it is because we are issuing
verification I/O. In this case (unless the metadata scan is done) we stop
issuing verification I/O and start scanning metadata again until we get to the
hard limit.

Default value: 20 which is 5% of RAM (1/20).

@mailinglists35
Copy link
Author

the system has 16GB physical RAM

when slabtop reports 3GB (which is ~20% of ram), the output of the requested command is:

sio_cache_2       121752 123120    168   48    2 : tunables    0    0    0 : slabdata   2565   2565      0
sio_cache_1       607524 628686    152   53    2 : tunables    0    0    0 : slabdata  11862  11862      0
sio_cache_0       22944850 22945260    136   60    2 : tunables    0    0    0 : slabdata 382421 382421      0

I will update as well then it reaches the peak.

@mailinglists35
Copy link
Author

I can't seem to be able to trigger it again, but please leave this open until fragmentation - if that was the cause - gets high enough to occur again

@tcaputi
Copy link
Contributor

tcaputi commented Apr 24, 2019

@mailinglists35 Even 20% is 15% higher than it should be. Can you please do the following to enable dbgmsg logging and the contained dprintf messages:

echo $(($(cat /sys/module/zfs/parameters/zfs_flags) | 1)) > /sys/module/zfs/parameters/zfs_flags
echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable

Then please provide the all of the relevant dbgmsg logs like this:

cat /proc/spl/kstat/zfs/dbgmsg | grep dsl_scan

@mailinglists35
Copy link
Author

thank you, will do that.

related, I stumbled upon this:

I have seen total ZoL slab allocated space be as high as 10 GB (on this 16 GB machine) despite the ARC only reporting a 5 GB size. As you can see, this stuff can fluctuate back and forth during normal usage.
Sidebar: Accurately tracking ZoL slab memory usage

To accurately track ZoL memory usage you must defeat SLUB slab merging somehow. You can turn it off entirely with the slub_nomerge kernel paramter or hack the spl ZoL kernel module to defeat it (see the sidebar here).

Because you can set spl_kmem_cache_slab_limit as a module parameter for the spl ZoL kernel module, I believe that you can set it to zero to avoid having any ZoL slabs be native kernel slabs. This avoids SLUB slab merging entirely and also makes it so that all ZoL slabs appear in /proc/spl/kmem/slab. It may be somewhat less efficient.

does that still apply to current master? to accurately measure the data, should I boot with slub_nomerge, and/or should I set spl_kmem_cache_slab_limit to zero?

@tcaputi
Copy link
Contributor

tcaputi commented Apr 25, 2019

I wouldn't change anything from default until we have a better understanding of what's going on. One of the statements in dbgmsg will look something like this:

current scan memory usage: 0 bytes

This includes almost all of the memory currently being used by the scan, which is primarily the sio caches but includes other things as well, so don't expect the numbers to line up exactly. If this number is less than the total memory usage of all the sio caches that would be reason to start looking at the SPL and memory allocator. At that point I would say we should try the tunables you mentioned, but I want to sanity check that the scanning code's memory limiting is working properly in the first place.

@tcaputi
Copy link
Contributor

tcaputi commented Apr 29, 2019

@mailinglists35 any update o n this?

@mailinglists35
Copy link
Author

I can't make it use again 10..12GB :(

@mailinglists35
Copy link
Author

if this is an issue, I can close then reopen when it occurs again

@tcaputi
Copy link
Contributor

tcaputi commented May 14, 2019

@mailinglists35 should i close this issue?

@richardelling
Copy link
Contributor

related comment for the archives...
I've got a telegraf collector that records /proc/slabinfo into a TSDB.
It is unclear to me that there is demand, beyond a few of us geeks who
track down these errors. So at this point, I'm not planning to upstream
to telegraf. Given enough demand, I'll do it. So let me know if it would help.

@mailinglists35
Copy link
Author

@tcaputi if it prevents release or gives other administrative troubles, please close. but later can I reopen or re-start talk here without having to open a new issue?

@tcaputi
Copy link
Contributor

tcaputi commented May 15, 2019

Yes, that's fine.

@awnz
Copy link

awnz commented Oct 9, 2023

I think I'm experiencing this on a homelab/test server now (Proxmox 7.4.17, zfs zfs-2.1.11-pve1). Happening on a disk resilver. It also happens during a scrub, which I had to abort (the slab freed up when aborted).

Admittedly this is a low-memory node (8GB) but the rate it runs out when busy with the scanning part of resilvering sees it eat that memory up fast, like in a matter of minutes. I've attached two screenshots of btop, slabtop, zpool status and meminfo output about a minute apart to show the rate that it balloons out. The load on this node is some storage (Linstor) but no active VMs or containers (evacuated because of the instability).
If I don't offline the volume that's resilvering, it will OOM then panic.

While typing this I noticed there's a version mismatch between zfs-utils (2.1.11-pve1) and the kernel module (2.1.9-pve1), despite all packages being up-to-date. It's part of a three-node cluster, it's the same versions across all nodes, but node 2 seems to be the only node suffering from this. Nodes 1 and 3 completed weekly scrubs on Sunday without issue, but Node 2 OOMed and paniced as above.

Any suggestions what to look for next to isolate/debug this?

image
image

@awnz
Copy link

awnz commented Oct 9, 2023

In the meantime I've stumbled across the zfs_scan_mem_lim_fact and _soft_fact paramaters which seem relevant, here: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-scan-mem-lim-fact
sio_cache is mentioned in the documentation, which is what I'm seeing spinning out of control above.
Mine are set to the default 20 (divisor of physical memory) which seems not to be being honoured. I'm trying 100 and will report back.

@amotin
Copy link
Member

amotin commented Oct 9, 2023

@awnz IIRC scrub should log some status updates to dbgmsg in procfs on every txg. Seems it should also report memory usage there if ZFS module is built with ZFS_DEBUG and you enable dprintf's via echo 1 >/sys/module/zfs/parameters/zfs_flags. I wonder if it is accounting issue, or memory leak or the limits not working right.

@awnz
Copy link

awnz commented Oct 9, 2023

It seems not to be compiled with debug. I've installed the relevant packages and will have another go at this this weekend.

In the meantime my dirty workaround is to watch for the memory runaway when the resilver enters the scan stage (or reboot from the kernel panic if I've missed it), then offline and then online the resilvered disk to break the scans into more manageable sized chunks that actually fit in memory. The memory is eaten when the scan runs but then released as the resilver stage progresses. nope that didn't work.

Went for options zfs zfs_scan_legacy=1 instead.

Since a scrub seems to result in the same behaviour, I'll retest with Proxmox debug packages installed and no workload on this node this weekend and report back. (edit 19/10: sorry was unable to test last weekend, will try again when I can)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants