Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel slab high memory usage during scrub OOM kill other applications #11429

Closed
ufou opened this issue Jan 4, 2021 · 1 comment
Closed

kernel slab high memory usage during scrub OOM kill other applications #11429

ufou opened this issue Jan 4, 2021 · 1 comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@ufou
Copy link

ufou commented Jan 4, 2021

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 5.4.0-58-generic
Architecture amd64
ZFS Version 0.8.3-1ubuntu12.5
SPL Version 0.8.3-1ubuntu12.5

Describe the problem you're observing

We run the HWE Ubuntu kernel which therefore means we get the 0.8.* version of zfs/spl, our issue is probably the same as:
#8662

We run MySQL (Mariadb, actually) using zfs volumes for data and backup space (separate volumes), we run a scrub from cron every 4 weeks which takes ~4 hours, on our replicas the scrub generally completes without issue but with the primary we have seen MySQL crash (OOM killed on the last crash)

The servers are Intel Xeon Gold, with 512Gb RAM, disks are 6 x Intel S4510 SSD 3.8Tb in 3 x mirrored sets

Describe how to reproduce the problem

Start a scrub on the data volume, then watch meminfo for Unreclaim usage:

zpool scrub mysqldata
Every 2.0s: cat /proc/meminfo | grep claim                                                                                                                          
Mon Jan  4 19:04:27 2021

KReclaimable:    2442512 kB
SReclaimable:    2442512 kB
SUnreclaim:      1932272 kB 

after 30s later:

Every 2.0s: cat /proc/meminfo | grep claim                                                                                                                          
Mon Jan  4 19:05:02 2021

KReclaimable:    2442976 kB
SReclaimable:    2442976 kB
SUnreclaim:      7637196 kB

Then issue the stop:

zpool scrub -s mysqldata

Check again:

Every 2.0s: cat /proc/meminfo | grep claim                                                                                                                          
Mon Jan  4 19:06:05 2021

KReclaimable:    2442976 kB
SReclaimable:    2442976 kB
SUnreclaim:      1970984 kB

I was unable to alter the behaviour of the SUnreclaim meminfo value by changing any of /sys/module/zfs/parameters/zfs_scan_mem_lim_fact, /sys/module/zfs/parameters/zfs_scan_mem_lim_soft_fact or by adding /sys/module/zfs/parameters/zfs_scrub_delay (permission denied as root)

Include any warning/errors/backtraces from the system logs

cat /proc/meminfo | grep claim
KReclaimable:    2453676 kB
SReclaimable:    2453676 kB
SUnreclaim:     16378036 kB
cat /proc/slabinfo  | grep sio_cache
sio_cache_2       2310396 2310528    168   48    2 : tunables    0    0    0 : slabdata  48136  48136      0
sio_cache_1       237122 237122    152   53    2 : tunables    0    0    0 : slabdata   4474   4474      0
sio_cache_0       106508040 106508040    136   30    1 : tunables    0    0    0 : slabdata 3550268 3550268      0
@ufou ufou added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Jan 4, 2021
behlendorf added a commit to behlendorf/zfs that referenced this issue May 29, 2021
For small objects the kernel's slab implemention is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to PAGE_SIZE on systems using larger pages. Since 16,384 bytes
was experimentally determined to yield the best performance on
4K page systems this is used as the cutoff. This means on 4K
page systems there is no functional change.

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 2, 2021
For small objects the kernel's slab implemention is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to PAGE_SIZE on systems using larger pages. Since 16,384 bytes
was experimentally determined to yield the best performance on
4K page systems this is used as the cutoff. This means on 4K
page systems there is no functional change.

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 3, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jun 4, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 8, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 9, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
tonyhutter pushed a commit that referenced this issue Jun 23, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #12152
Closes #11429
Closes #11574
Closes #12150
@manojkumardevisetty
Copy link

server3@server3:$ cat /proc/meminfo | grep claim
KReclaimable: 262152 kB
SReclaimable: 262152 kB
SUnreclaim: 49721732 kB
server3@server3:
$ cat /proc/meminfo | grep claim
KReclaimable: 263072 kB
SReclaimable: 263072 kB
SUnreclaim: 49905428 kB
server3@server3:$ zpool scrub -s pool
cannot cancel scrubbing pool: permission denied
server3@server3:
$ sudo zpool scrub -s pool
cannot cancel scrubbing pool: currently resilvering
server3@server3:~$

I want to cancle the resilering process. becuase I have 512 gb ram and my slab is eating it slowly all the memory.. can you able to help me on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants