zvol swap devices #883

behlendorf · 2012-08-23T04:41:34Z

The patch stack updates the ZFS code to handle ZVOL based swap devices. For example:

  $ zfs create -V 2G tank/swap
  $ mkswap /dev/zvol/tank/swap
  $ swapon /dev/zvol/tank/swap

  Filename              Type        Size    Used    Priority
  /dev/zvol/tank/swap  partition    2097144 1006944 -1

This was accomplished by:

Reverting all use of PF_MEMALLOC and relying instead of the improve SPL slab behavior.
Preallocating vdev aggregate I/O buffers.
Using the PF_NOFS flag to detect all instances of KM_SLEEP in critical I/O paths and changing them to KM_PUSHPAGE.
Changing one instance of vmem_alloc() to kmem_alloc() in mzap_update() to ensure the gfp flags are honored.

With these changes in place I am now able to execute 10 concurrent instances of python -c print 2**10**10. Once the swap device is fully consumed these processes will be cleanly killed by the OOM killer as expected.

The system may appear to hang for 10-15 seconds once it starts swapping heavily but it will improve. This appears to be due to a sudden need for 8k ARC buffer which be quickly allocated. Once those buffers get allocated the system is once again responsive. So while this is a step in the right direction there is still room for improvement.

behlendorf · 2012-08-23T04:46:58Z

@ryao Can you please carefully review and test these changes. They are working well for me in a RHEL 6.2 but they need significantly more testing.

pyavdr · 2012-08-23T07:08:27Z

I downloaded behlendorf/spl &zfs master and swap branch; installed it on suse 12.1, but it deadlocks as soon as the system uses the swap zvol with python. I don´t know if i got all the patches or if they are in that downloaded code. Is there any tutorial or help how to apply all these commits/patches to the right code base? Or any spot where it can be downloaded at once, without using a local github installation?

behlendorf · 2012-08-23T16:10:12Z

Let's make this a little easier. The following tags are for the patched source. Just use the tarballs linked below:

https://github.com/behlendorf/spl/tarball/spl-0.6.0-rc10-swap
https://github.com/behlendorf/zfs/tarball/zfs-0.6.0-rc10-swap

Also be aware that while that I've resolved all the deadlocks I encountered in my RHEL 6.2 VM there may be others which I just never encountered. The code has been instrumented to detect these so keep an eye on the console logs while running to see if anything gets logged.

Finally, while the code was running deadlock free for me it still needs polish. There where a few instance where it would appear to lock up but it would work itself free in my testing in about 10-15 seconds. It was just allocating a large number of emergency objects due to sudden demand to swap under low memory. I've noticed that performance improved the longer it ran.

Thus far I've had my VM running and swapping heavily for about 12 hours without encountering any new issues.

pyavdr · 2012-08-23T19:17:48Z

It is great news, that there is at least one working system with swap on zvols. I installed the tarballs and checked it again. No sucess with opensuse kernel SMP 3.1.10. 16 GB RAM and a swap zvol of 50 GB. No messages in log or console - just deadlock - need a hard reset after 5 minutes.

behlendorf · 2012-08-23T20:01:41Z

I have an OpenSuse 12.1 VM with a 3.1 kernel which I'll give it a try with. Hopefully I'll be able to reproduce your issue and get it fixed. This is exactly why these patches really need some testing on a variety of kernels before getting merged.

If you want to try one more time, building the spl and zfs code with the --enable-debug configure option will enable some addition debugging which might flag the deadlock.

behlendorf · 2012-08-23T22:28:12Z

@pyavdr Your kernel was deadlocking because I missed a few call paths, and because the debug code to automatically detect those cases depends on CONFIG_RT_MUTEXES being disabled in your kernel. I'll be pushing various updates to this branch to detect those cases as they are found. In the meanwhile you could disable CONFIG_RT_MUTEXES.

pyavdr · 2012-08-24T19:15:17Z

@behlendorf
That was really hard work for you and richard. Im happy to see that there is another kernel which can swap on zvols. Im out of office the next 12 days. I will test it as time permits after return.

behlendorf · 2012-08-25T02:29:33Z

@pyavdr That would be great. I'm hoping that with enough testing we could get this merged in a few weeks. It should be a significant improvement not just for swap but general system stability under low memory conditions.

behlendorf · 2012-08-26T01:45:47Z

This branch was refreshed to include a few more fixes and is testing very well. For those who had trouble with the first version this is worth another try. Please let me know if you have any issues, I've queued up this branch from more extensive internal testing at LLNL.

Commit eec8164 worked around an issue involving direct reclaim through the use of PF_MEMALLOC. Since we are reworking thing to use KM_PUSHPAGE so that swap works, we revert this patch in favor of the use of KM_PUSHPAGE in the affected areas. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726

The commit, cfc9a5c, to fix deadlocks in zpl_writepage() relied on PF_MEMALLOC. That had the effect of disabling the direct reclaim path on all allocations originating from calls to this function, but it failed to address the actual cause of those deadlocks. This led to the same deadlocks being observed with swap on zvols, but not with swap on the loop device, which exercises this code. The use of PF_MEMALLOC also had the side effect of permitting allocations to be made from ZONE_DMA in instances that did not require it. This contributes to the possibility of panics caused by depletion of pages from ZONE_DMA. As such, we revert this patch in favor of a proper fix for both issues. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726

This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0cc eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726

The vdev queue layer may require a small number of buffers when attempting to create aggregate I/O requests. Rather than attempting to allocate them from the global zio buffers, which is slow under memory pressure, it makes sense to pre-allocate them because... 1) These buffers are short lived. They are only required for the life of a single I/O at which point they can be used by the next I/O. 2) The maximum number of concurrent buffers needed by a vdev is small. It's roughly limited by the zfs_vdev_max_pending tunable which defaults to 10. By keeping a small list of these buffer per-vdev we can ensure one is always available when we need it. This significantly reduces contention on the vq->vq_lock, because we no longer need to perform a slow allocation under this lock. This is particularly important when memory is already low on the system. It would probably be wise to extend the use of these buffers beyond aggregate I/O and in to the raidz implementation. The inability to quickly allocate buffer for the parity stripes could result in similiar problems. Signed-off-by: Brian Behlendorf <[email protected]>

The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard() call paths must only use KM_PUSHPAGE to avoid potential deadlocks during direct reclaim. This patch annotates these call paths so any accidental use of KM_SLEEP will be quickly detected. In the interest of stability if debugging is disabled the offending allocation will have its GFP flags automatically corrected. When debugging is enabled any misuse will be treated as a fatal error. This patch is entirely for debugging. We should be careful to NOT become dependant on it fixing up the incorrect allocations. Signed-off-by: Brian Behlendorf <[email protected]>

These allocations in mzap_update() used to be kmem_alloc() but were changed to vmem_alloc() due to the size of the allocation. However, since it turns out this function may be called in the context of the txg_sync thread they must be changed back to use a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored. Signed-off-by: Brian Behlendorf <[email protected]>

Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726

pyavdr · 2012-09-17T19:24:55Z

Today i tested the latest zfsonlinux/zfs and spl version. Installed it on suse 12.1 (kernel 3.1 SMP) and suse 12.2 ( kernel default with preempt 3.4), but it deadlocks as soon as the system uses the swap zvol with python. I did´nt compile with the debug option and get no kernel messages. It was a plain new installation, no special configurations. Just make spl and zfs.
As for my understanding it should run .. but it dont. What is wrong ?

behlendorf · 2012-09-17T19:47:12Z

@pyavdr Yes, it should... although I confess I have occasionally seen some odd issues in the area with just my SLES test systems. They don't build with CONFIG_RT_MUTEX_TESTER defined for the kernel which I leveraged to automatically detect deadlock situations. Now, testing on other kernels suggests we found all those cases but perhaps the default SLES build options are such that they trigger a less likely case. If your game I'd suggest rebuilding your kernel with CONFIG_RT_MUTEX_TESTER defined.

pyavdr · 2012-09-18T09:15:26Z

@behlendorf Thank you for the hints. After 6 hours i managed to recompile the kernel with some given RT MUTEX options enabled (make menuconfig). But i stranded to install zfs with the new kernel afterwards ( invalid module zfs.ko). There are already problems with pathes and definitons for the new kernel. Can´t imagine how to define CONFIG_RT_MUTEX_TESTER for a new kernel and make zfs run afterwards. There are some hundreds of kernel options which maybe different to others. Is there any chance to set CONFIG_RT_MUTEX_TESTER without a new kernel recompilation, maybe /proc/sys/kernel or /etc/sysctl.conf or boot.local ?

behlendorf · 2012-09-18T17:01:05Z

@pyavdr Interestingly, I'm also able to reproduce similar issues but only under OpenSUSE (11.4 and 12.1). They must be building their kernel with some CONFIG_ option no other kernel uses, or perhaps they are carrying a patch which causes the issue, or even a different default sysctl tuning.

If your game to pursue this further, please try the following SPL patch with your stock OpenSUSE kernel. It shifts the debugging from the PF_MUTEX_TESTER page bit to what should be an unused page bit in your kernel. THIS IS NOT SAFE FOR ALL KERNELS This should allow us to safely detect any swap related deadlocks you might be hitting.

Beyond that we'll need to dig in to what's different about OpenSUSE.

diff --git a/include/sys/kmem.h b/include/sys/kmem.h
index 0149e75..1c1482e 100644
--- a/include/sys/kmem.h
+++ b/include/sys/kmem.h
@@ -72,8 +72,9 @@
  * will the PF_NOFS bit be valid.  Happily, most existing distributions
  * ship a kernel with CONFIG_RT_MUTEX_TESTER disabled.
  */
-#if !defined(CONFIG_RT_MUTEX_TESTER) && defined(PF_MUTEX_TESTER)
-# define PF_NOFS                       PF_MUTEX_TESTER
+//#if !defined(CONFIG_RT_MUTEX_TESTER) && defined(PF_MUTEX_TESTER)
+#if 1
+# define PF_NOFS                       0x00080000

 static inline void
 sanitize_flags(struct task_struct *p, gfp_t *flags)

pyavdr · 2012-09-18T19:07:02Z

@behlendorf
I changed the kmem.h to set PF_NOFS directly, compile with Debug mode. Tried it several times. Got some error stacks of killed pythons when not using zfs as a swap device. But as soon as i added the zvol for swap and increased the number of pythons running making the system using the zvol swap, the system deadlocks. Without any messages in syslog. Examined all and everything with logwatch. Im not sure if there are any related messages, maybe the deadlock is caused by another issue, not covered by the PF_NOFS flags. So far ... no success on this.

pyavdr · 2012-09-20T05:59:47Z

@behlendorf
Today i checked Ubuntu 12.04 Kernel 3.2.0-30 with latest updates with zfsonlinux RC11 under Vmware Workstation. Using a 10 GB zvol as swap device. After applying some pythons and maybe 2 minutes the system deadlocked while using about 3 GB of zvol swapspace. It looks like that the deadlock problem is not specific to Opensuse.

pyavdr · 2012-09-20T12:18:17Z

@behlendorf
Pls check also #978 maybe i found a reason for this.

This was referenced Aug 23, 2012

Emergency slab objects openzfs/spl#155

Closed

Make KM_SLEEP an alias of KM_PUSHPAGE openzfs/spl#145

Closed

This was referenced Aug 23, 2012

Use Linux SLAB allocator for SPL SLAB allocations openzfs/spl#147

Closed

zfs blocking everything, out of memory, and daily lockups #860

Closed

behlendorf mentioned this pull request Aug 23, 2012

Support swap on zvol #342

Closed

ryao mentioned this pull request Aug 24, 2012

Fix deadlocks in DMU #726

Closed

This was referenced Aug 26, 2012

z_wr_int can be interrupted when CONFIG_PREEMPT_NONE=y is set #746

Closed

z_wr_iss/0: page allocation failure #540

Closed

ryao and others added 7 commits August 27, 2012 12:01

behlendorf merged commit b8d06fc into openzfs:master Aug 31, 2012

behlendorf mentioned this pull request Aug 31, 2012

ARC grows well past the zfs_arc_max #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zvol swap devices #883

zvol swap devices #883

behlendorf commented Aug 23, 2012

behlendorf commented Aug 23, 2012

pyavdr commented Aug 23, 2012

behlendorf commented Aug 23, 2012

pyavdr commented Aug 23, 2012

behlendorf commented Aug 23, 2012

behlendorf commented Aug 23, 2012

pyavdr commented Aug 24, 2012

behlendorf commented Aug 25, 2012

behlendorf commented Aug 26, 2012

pyavdr commented Sep 17, 2012

behlendorf commented Sep 17, 2012

pyavdr commented Sep 18, 2012

behlendorf commented Sep 18, 2012

pyavdr commented Sep 18, 2012

pyavdr commented Sep 20, 2012

pyavdr commented Sep 20, 2012

zvol swap devices #883

zvol swap devices #883

Conversation

behlendorf commented Aug 23, 2012

behlendorf commented Aug 23, 2012

pyavdr commented Aug 23, 2012

behlendorf commented Aug 23, 2012

pyavdr commented Aug 23, 2012

behlendorf commented Aug 23, 2012

behlendorf commented Aug 23, 2012

pyavdr commented Aug 24, 2012

behlendorf commented Aug 25, 2012

behlendorf commented Aug 26, 2012

pyavdr commented Sep 17, 2012

behlendorf commented Sep 17, 2012

pyavdr commented Sep 18, 2012

behlendorf commented Sep 18, 2012

pyavdr commented Sep 18, 2012

pyavdr commented Sep 20, 2012

pyavdr commented Sep 20, 2012