Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zvol swap devices #883

Merged
merged 7 commits into from
Aug 31, 2012
Merged

zvol swap devices #883

merged 7 commits into from
Aug 31, 2012

Conversation

behlendorf
Copy link
Contributor

The patch stack updates the ZFS code to handle ZVOL based swap devices. For example:

  $ zfs create -V 2G tank/swap
  $ mkswap /dev/zvol/tank/swap
  $ swapon /dev/zvol/tank/swap

  Filename              Type        Size    Used    Priority
  /dev/zvol/tank/swap  partition    2097144 1006944 -1

This was accomplished by:

  • Reverting all use of PF_MEMALLOC and relying instead of the improve SPL slab behavior.
  • Preallocating vdev aggregate I/O buffers.
  • Using the PF_NOFS flag to detect all instances of KM_SLEEP in critical I/O paths and changing them to KM_PUSHPAGE.
  • Changing one instance of vmem_alloc() to kmem_alloc() in mzap_update() to ensure the gfp flags are honored.

With these changes in place I am now able to execute 10 concurrent instances of python -c print 2**10**10. Once the swap device is fully consumed these processes will be cleanly killed by the OOM killer as expected.

The system may appear to hang for 10-15 seconds once it starts swapping heavily but it will improve. This appears to be due to a sudden need for 8k ARC buffer which be quickly allocated. Once those buffers get allocated the system is once again responsive. So while this is a step in the right direction there is still room for improvement.

@behlendorf
Copy link
Contributor Author

@ryao Can you please carefully review and test these changes. They are working well for me in a RHEL 6.2 but they need significantly more testing.

@pyavdr
Copy link
Contributor

pyavdr commented Aug 23, 2012

I downloaded behlendorf/spl &zfs master and swap branch; installed it on suse 12.1, but it deadlocks as soon as the system uses the swap zvol with python. I don´t know if i got all the patches or if they are in that downloaded code. Is there any tutorial or help how to apply all these commits/patches to the right code base? Or any spot where it can be downloaded at once, without using a local github installation?

@behlendorf
Copy link
Contributor Author

Let's make this a little easier. The following tags are for the patched source. Just use the tarballs linked below:

https://github.com/behlendorf/spl/tarball/spl-0.6.0-rc10-swap
https://github.com/behlendorf/zfs/tarball/zfs-0.6.0-rc10-swap

Also be aware that while that I've resolved all the deadlocks I encountered in my RHEL 6.2 VM there may be others which I just never encountered. The code has been instrumented to detect these so keep an eye on the console logs while running to see if anything gets logged.

Finally, while the code was running deadlock free for me it still needs polish. There where a few instance where it would appear to lock up but it would work itself free in my testing in about 10-15 seconds. It was just allocating a large number of emergency objects due to sudden demand to swap under low memory. I've noticed that performance improved the longer it ran.

Thus far I've had my VM running and swapping heavily for about 12 hours without encountering any new issues.

@behlendorf behlendorf mentioned this pull request Aug 23, 2012
@pyavdr
Copy link
Contributor

pyavdr commented Aug 23, 2012

It is great news, that there is at least one working system with swap on zvols. I installed the tarballs and checked it again. No sucess with opensuse kernel SMP 3.1.10. 16 GB RAM and a swap zvol of 50 GB. No messages in log or console - just deadlock - need a hard reset after 5 minutes.

@behlendorf
Copy link
Contributor Author

I have an OpenSuse 12.1 VM with a 3.1 kernel which I'll give it a try with. Hopefully I'll be able to reproduce your issue and get it fixed. This is exactly why these patches really need some testing on a variety of kernels before getting merged.

If you want to try one more time, building the spl and zfs code with the --enable-debug configure option will enable some addition debugging which might flag the deadlock.

@behlendorf
Copy link
Contributor Author

@pyavdr Your kernel was deadlocking because I missed a few call paths, and because the debug code to automatically detect those cases depends on CONFIG_RT_MUTEXES being disabled in your kernel. I'll be pushing various updates to this branch to detect those cases as they are found. In the meanwhile you could disable CONFIG_RT_MUTEXES.

@pyavdr
Copy link
Contributor

pyavdr commented Aug 24, 2012

@behlendorf
That was really hard work for you and richard. Im happy to see that there is another kernel which can swap on zvols. Im out of office the next 12 days. I will test it as time permits after return.

@ryao ryao mentioned this pull request Aug 24, 2012
@behlendorf
Copy link
Contributor Author

@pyavdr That would be great. I'm hoping that with enough testing we could get this merged in a few weeks. It should be a significant improvement not just for swap but general system stability under low memory conditions.

@behlendorf
Copy link
Contributor Author

This branch was refreshed to include a few more fixes and is testing very well. For those who had trouble with the first version this is worth another try. Please let me know if you have any issues, I've queued up this branch from more extensive internal testing at LLNL.

ryao and others added 7 commits August 27, 2012 12:01
Commit eec8164 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC.   Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#726
The commit, cfc9a5c, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC.   That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks.  This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.

The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it.  This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.

As such, we revert this patch in favor of a proper fix for both issues.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#726
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0cc eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated.  The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#726
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests.  Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...

1) These buffers are short lived.  They are only required for
the life of a single I/O at which point they can be used by
the next I/O.

2) The maximum number of concurrent buffers needed by a vdev is
small.  It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.

By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it.  This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock.  This is
particularly important when memory is already low on the system.

It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation.  The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.

Signed-off-by: Brian Behlendorf <[email protected]>
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.

This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected.   In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected.  When debugging is enabled
any misuse will be treated as a fatal error.

This patch is entirely for debugging.  We should be careful to
NOT become dependant on it fixing up the incorrect allocations.

Signed-off-by: Brian Behlendorf <[email protected]>
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.

Signed-off-by: Brian Behlendorf <[email protected]>
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#726
@behlendorf behlendorf merged commit b8d06fc into openzfs:master Aug 31, 2012
@pyavdr
Copy link
Contributor

pyavdr commented Sep 17, 2012

Today i tested the latest zfsonlinux/zfs and spl version. Installed it on suse 12.1 (kernel 3.1 SMP) and suse 12.2 ( kernel default with preempt 3.4), but it deadlocks as soon as the system uses the swap zvol with python. I did´nt compile with the debug option and get no kernel messages. It was a plain new installation, no special configurations. Just make spl and zfs.
As for my understanding it should run .. but it dont. What is wrong ?

@behlendorf
Copy link
Contributor Author

@pyavdr Yes, it should... although I confess I have occasionally seen some odd issues in the area with just my SLES test systems. They don't build with CONFIG_RT_MUTEX_TESTER defined for the kernel which I leveraged to automatically detect deadlock situations. Now, testing on other kernels suggests we found all those cases but perhaps the default SLES build options are such that they trigger a less likely case. If your game I'd suggest rebuilding your kernel with CONFIG_RT_MUTEX_TESTER defined.

@pyavdr
Copy link
Contributor

pyavdr commented Sep 18, 2012

@behlendorf Thank you for the hints. After 6 hours i managed to recompile the kernel with some given RT MUTEX options enabled (make menuconfig). But i stranded to install zfs with the new kernel afterwards ( invalid module zfs.ko). There are already problems with pathes and definitons for the new kernel. Can´t imagine how to define CONFIG_RT_MUTEX_TESTER for a new kernel and make zfs run afterwards. There are some hundreds of kernel options which maybe different to others. Is there any chance to set CONFIG_RT_MUTEX_TESTER without a new kernel recompilation, maybe /proc/sys/kernel or /etc/sysctl.conf or boot.local ?

@behlendorf
Copy link
Contributor Author

@pyavdr Interestingly, I'm also able to reproduce similar issues but only under OpenSUSE (11.4 and 12.1). They must be building their kernel with some CONFIG_ option no other kernel uses, or perhaps they are carrying a patch which causes the issue, or even a different default sysctl tuning.

If your game to pursue this further, please try the following SPL patch with your stock OpenSUSE kernel. It shifts the debugging from the PF_MUTEX_TESTER page bit to what should be an unused page bit in your kernel. THIS IS NOT SAFE FOR ALL KERNELS This should allow us to safely detect any swap related deadlocks you might be hitting.

Beyond that we'll need to dig in to what's different about OpenSUSE.

diff --git a/include/sys/kmem.h b/include/sys/kmem.h
index 0149e75..1c1482e 100644
--- a/include/sys/kmem.h
+++ b/include/sys/kmem.h
@@ -72,8 +72,9 @@
  * will the PF_NOFS bit be valid.  Happily, most existing distributions
  * ship a kernel with CONFIG_RT_MUTEX_TESTER disabled.
  */
-#if !defined(CONFIG_RT_MUTEX_TESTER) && defined(PF_MUTEX_TESTER)
-# define PF_NOFS                       PF_MUTEX_TESTER
+//#if !defined(CONFIG_RT_MUTEX_TESTER) && defined(PF_MUTEX_TESTER)
+#if 1
+# define PF_NOFS                       0x00080000

 static inline void
 sanitize_flags(struct task_struct *p, gfp_t *flags)

@pyavdr
Copy link
Contributor

pyavdr commented Sep 18, 2012

@behlendorf
I changed the kmem.h to set PF_NOFS directly, compile with Debug mode. Tried it several times. Got some error stacks of killed pythons when not using zfs as a swap device. But as soon as i added the zvol for swap and increased the number of pythons running making the system using the zvol swap, the system deadlocks. Without any messages in syslog. Examined all and everything with logwatch. Im not sure if there are any related messages, maybe the deadlock is caused by another issue, not covered by the PF_NOFS flags. So far ... no success on this.

@pyavdr
Copy link
Contributor

pyavdr commented Sep 20, 2012

@behlendorf
Today i checked Ubuntu 12.04 Kernel 3.2.0-30 with latest updates with zfsonlinux RC11 under Vmware Workstation. Using a 10 GB zvol as swap device. After applying some pythons and maybe 2 minutes the system deadlocked while using about 3 GB of zvol swapspace. It looks like that the deadlock problem is not specific to Opensuse.

@pyavdr
Copy link
Contributor

pyavdr commented Sep 20, 2012

@behlendorf
Pls check also #978 maybe i found a reason for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants