-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvol swap devices #883
zvol swap devices #883
Conversation
@ryao Can you please carefully review and test these changes. They are working well for me in a RHEL 6.2 but they need significantly more testing. |
I downloaded behlendorf/spl &zfs master and swap branch; installed it on suse 12.1, but it deadlocks as soon as the system uses the swap zvol with python. I don´t know if i got all the patches or if they are in that downloaded code. Is there any tutorial or help how to apply all these commits/patches to the right code base? Or any spot where it can be downloaded at once, without using a local github installation? |
Let's make this a little easier. The following tags are for the patched source. Just use the tarballs linked below: https://github.com/behlendorf/spl/tarball/spl-0.6.0-rc10-swap Also be aware that while that I've resolved all the deadlocks I encountered in my RHEL 6.2 VM there may be others which I just never encountered. The code has been instrumented to detect these so keep an eye on the console logs while running to see if anything gets logged. Finally, while the code was running deadlock free for me it still needs polish. There where a few instance where it would appear to lock up but it would work itself free in my testing in about 10-15 seconds. It was just allocating a large number of emergency objects due to sudden demand to swap under low memory. I've noticed that performance improved the longer it ran. Thus far I've had my VM running and swapping heavily for about 12 hours without encountering any new issues. |
It is great news, that there is at least one working system with swap on zvols. I installed the tarballs and checked it again. No sucess with opensuse kernel SMP 3.1.10. 16 GB RAM and a swap zvol of 50 GB. No messages in log or console - just deadlock - need a hard reset after 5 minutes. |
I have an OpenSuse 12.1 VM with a 3.1 kernel which I'll give it a try with. Hopefully I'll be able to reproduce your issue and get it fixed. This is exactly why these patches really need some testing on a variety of kernels before getting merged. If you want to try one more time, building the spl and zfs code with the --enable-debug configure option will enable some addition debugging which might flag the deadlock. |
@pyavdr Your kernel was deadlocking because I missed a few call paths, and because the debug code to automatically detect those cases depends on CONFIG_RT_MUTEXES being disabled in your kernel. I'll be pushing various updates to this branch to detect those cases as they are found. In the meanwhile you could disable CONFIG_RT_MUTEXES. |
@behlendorf |
@pyavdr That would be great. I'm hoping that with enough testing we could get this merged in a few weeks. It should be a significant improvement not just for swap but general system stability under low memory conditions. |
This branch was refreshed to include a few more fixes and is testing very well. For those who had trouble with the first version this is worth another try. Please let me know if you have any issues, I've queued up this branch from more extensive internal testing at LLNL. |
Commit eec8164 worked around an issue involving direct reclaim through the use of PF_MEMALLOC. Since we are reworking thing to use KM_PUSHPAGE so that swap works, we revert this patch in favor of the use of KM_PUSHPAGE in the affected areas. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726
The commit, cfc9a5c, to fix deadlocks in zpl_writepage() relied on PF_MEMALLOC. That had the effect of disabling the direct reclaim path on all allocations originating from calls to this function, but it failed to address the actual cause of those deadlocks. This led to the same deadlocks being observed with swap on zvols, but not with swap on the loop device, which exercises this code. The use of PF_MEMALLOC also had the side effect of permitting allocations to be made from ZONE_DMA in instances that did not require it. This contributes to the possibility of panics caused by depletion of pages from ZONE_DMA. As such, we revert this patch in favor of a proper fix for both issues. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0cc eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726
The vdev queue layer may require a small number of buffers when attempting to create aggregate I/O requests. Rather than attempting to allocate them from the global zio buffers, which is slow under memory pressure, it makes sense to pre-allocate them because... 1) These buffers are short lived. They are only required for the life of a single I/O at which point they can be used by the next I/O. 2) The maximum number of concurrent buffers needed by a vdev is small. It's roughly limited by the zfs_vdev_max_pending tunable which defaults to 10. By keeping a small list of these buffer per-vdev we can ensure one is always available when we need it. This significantly reduces contention on the vq->vq_lock, because we no longer need to perform a slow allocation under this lock. This is particularly important when memory is already low on the system. It would probably be wise to extend the use of these buffers beyond aggregate I/O and in to the raidz implementation. The inability to quickly allocate buffer for the parity stripes could result in similiar problems. Signed-off-by: Brian Behlendorf <[email protected]>
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard() call paths must only use KM_PUSHPAGE to avoid potential deadlocks during direct reclaim. This patch annotates these call paths so any accidental use of KM_SLEEP will be quickly detected. In the interest of stability if debugging is disabled the offending allocation will have its GFP flags automatically corrected. When debugging is enabled any misuse will be treated as a fatal error. This patch is entirely for debugging. We should be careful to NOT become dependant on it fixing up the incorrect allocations. Signed-off-by: Brian Behlendorf <[email protected]>
These allocations in mzap_update() used to be kmem_alloc() but were changed to vmem_alloc() due to the size of the allocation. However, since it turns out this function may be called in the context of the txg_sync thread they must be changed back to use a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored. Signed-off-by: Brian Behlendorf <[email protected]>
Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#726
Today i tested the latest zfsonlinux/zfs and spl version. Installed it on suse 12.1 (kernel 3.1 SMP) and suse 12.2 ( kernel default with preempt 3.4), but it deadlocks as soon as the system uses the swap zvol with python. I did´nt compile with the debug option and get no kernel messages. It was a plain new installation, no special configurations. Just make spl and zfs. |
@pyavdr Yes, it should... although I confess I have occasionally seen some odd issues in the area with just my SLES test systems. They don't build with CONFIG_RT_MUTEX_TESTER defined for the kernel which I leveraged to automatically detect deadlock situations. Now, testing on other kernels suggests we found all those cases but perhaps the default SLES build options are such that they trigger a less likely case. If your game I'd suggest rebuilding your kernel with CONFIG_RT_MUTEX_TESTER defined. |
@behlendorf Thank you for the hints. After 6 hours i managed to recompile the kernel with some given RT MUTEX options enabled (make menuconfig). But i stranded to install zfs with the new kernel afterwards ( invalid module zfs.ko). There are already problems with pathes and definitons for the new kernel. Can´t imagine how to define CONFIG_RT_MUTEX_TESTER for a new kernel and make zfs run afterwards. There are some hundreds of kernel options which maybe different to others. Is there any chance to set CONFIG_RT_MUTEX_TESTER without a new kernel recompilation, maybe /proc/sys/kernel or /etc/sysctl.conf or boot.local ? |
@pyavdr Interestingly, I'm also able to reproduce similar issues but only under OpenSUSE (11.4 and 12.1). They must be building their kernel with some CONFIG_ option no other kernel uses, or perhaps they are carrying a patch which causes the issue, or even a different default sysctl tuning. If your game to pursue this further, please try the following SPL patch with your stock OpenSUSE kernel. It shifts the debugging from the PF_MUTEX_TESTER page bit to what should be an unused page bit in your kernel. THIS IS NOT SAFE FOR ALL KERNELS This should allow us to safely detect any swap related deadlocks you might be hitting. Beyond that we'll need to dig in to what's different about OpenSUSE.
|
@behlendorf |
@behlendorf |
@behlendorf |
The patch stack updates the ZFS code to handle ZVOL based swap devices. For example:
This was accomplished by:
With these changes in place I am now able to execute 10 concurrent instances of
python -c print 2**10**10
. Once the swap device is fully consumed these processes will be cleanly killed by the OOM killer as expected.The system may appear to hang for 10-15 seconds once it starts swapping heavily but it will improve. This appears to be due to a sudden need for 8k ARC buffer which be quickly allocated. Once those buffers get allocated the system is once again responsive. So while this is a step in the right direction there is still room for improvement.