-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directory copy stalls #69
Comments
Similar symptoms:
|
Yeah, also I can add it happens only with SMP. It works Ok when booted in uniprocessor mode. |
It appears the culprit is dedup. Copying a 466 MB directory from ext4 to ZFS took about 40 seconds with dedup off, and 2 minutes with dedup on. Enabling compression makes it faster (fewer blocks to checksum?). Sometimes compression=on/dedup=on was as fast as dedup=off; I don't know why. Copying a 3 GB directory with many files made the machine unusable. For comparison, copying the 466 MB directory took about 20 seconds with zfs-fuse, so there is room for improvement. Using 0.6.0-rc1 on Fedora 14 x86_64 (SMP). |
ChrisAzfs, seems you're missing the point. In mine and paragw' case copying never gets completed. It stalls until rebooting. |
Poige can you try again using the 0.6.0-rc1 release. There are a couple changes in the newer code which will probably help, plus if won't be restricted to using a zvol anymore. You can use a normal ZFS filesystem now for your testing if you prefer. |
behlendorf, would 2.6.32.28 fit for that? |
You can use the 2.6.32.28 kernel with the newer 0.6.0-rc1 source, but the kernel itself doesn't explain the issue you've reported. |
Asked just to be sure. Already compiled, gonna test it now. |
ok, on zpool list: |
And even simple mount stalls:
|
And again: mount worked ok when booted w/o SMP (uniprocessor mode). |
OK, this is a known issue. The problem is your kernel is built with CONFIG_PREEMPT, it's a low latency kernel. See issue #83. This can and will be fixed but it isn't yet. |
Damn! I just had recompiled the kernel with CONFIG_PREEMPT before you wrote that new version is available to give it a try. :-) Ok, I'll revert this. |
behlendorf, I gave it a try on 2.6.35.10: copying staled, alas…
|
Was there any debugging printed to 'dmesg'? That would be critical to determine what the system is stuck on. With CONFIG_PREEMPT disabled you certainly shouldn't be seeing this message anymore. BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3379 |
Not in the moment of the stale. Also, it's a bit strange that having only ran
|
|
The first stale-related dmesg is:
|
Thanks, that debugging is a good start. When I get a moment I'll try and reproduce this locally, is your test setup the same as described in comment #1? Also, exactly what commands did you run to trigger this deadlock? Finally, I notice your VM only has 1GiB of memory which isn't much for ZFS in theory it should be OK, just slow, but I try to leave 2GiB of memory in all my VMs for ZFS testing. |
You're welcome! It's x86_64 Linux in VirtualBox. Actually it has 1430 MiB and, AFAIR, I did ± it (no RAM-starving occured), as well as changed number of availables CPUs. (With only one CPU everything works ok.). The trigger is simple |
Hi! Any changes worth of re-trying? :) |
Ew, I have been using CONFIG_PREEMPT_VOLUNTARY for a while. Maybe that's why ZFS goes all retarded on me every day? I am gonna turn it off now. |
Please let me know if that improves thing. Leaving CONFIG_PREEMPT on could result in a deadlock in the slab reclaim which could manifest itself as a spinning kswapd task. |
So far no spinning kswapd task with the new kernel and no config preempt enabled. |
The kswapd spinning and deadlock issues should be resolved in the latest spl/zfs master source. Performance issues and memory usage must still be addressed, but those are different issues than the kswapd thrashing/deadlock described here. The following commits have resolved the issue. 691f6ac, behlendorf/spl@cb255ae - Allows the kernel to perform direct reclaim regularly which eases the pressure on kswapd (indirect reclaim) d6bd8ea - Fixes a deadlock exposed by increased direct reclaim. behlendorf/spl@2092cf6 - Fixes upstream kernel vmalloc() deadlock, the kernel patch no longer required. However, I will work to get it included in the upstream kernel since it is a real kernel bug. I'm closing this issue, if your still seeing a spinning kswapd with the latest code please let me know and we'll reopen this issue or file a new one. |
Will try now |
Marked as Closed but I can't confirm that — I managed to get another hang but now I think it can be due to really tight memory limit. Gonna check this out soon, |
Having increased mem. to 2 GiB I could have copied /usr/src to ZFS' volume. During this copying I noticed from time to time valuable burst of number of running processes:
quite strange… |
And 4096 GiB was the lowest boundary to have successfully copied /usr/src onto dedup=on, compress=on ZFS. Otherwise it stalled all the ways and even panic-rebooted once. |
That doesn't sound fixed, I'm reopening the issue. |
This was likely caused by issue #287, this fix will likely be merged in the next day or two so I'm closing this bug. |
Signed-off-by: Jan Kryl <[email protected]>
Add zfs-headers package
Test Setup
Host Machine : 4xXeon, 10Gb RAM 4 Disks out of which 2 are using software RAID1 (Standard MD) running Ubuntu Maverick x86_64 (2.6.35.xx) and SPL+ZFS built from Git.
Virtual Machine : KVM/Qemu - 1Gb RAM 3 Virtio based disk images - one for OS install (Ubuntu Maverick Server x86_64) and 2 others vdb and vdc
ZFS standard striped pool named rpool (no RAID options specified) - 39Gb allocated from vdb and vdc. compress=on dedup=on.
2 Volumes created - rpool/one 20G and other rpool/two 10g and ext3 file systems created on both.
Trying to copy a 650Mb directory (OS install ISO contents) to any one of the ext3 file systems on the rpool/one (rpool/two for that matter) never seems to finish. Various ZFS related process and kdmflush/kswapd0 et.al continue to take lots of CPU and the copy never seems to finish - too slow disk throughput and not enough CPU for cp are couple things I suspect may be the cause.
Also while the copy is in progress no other sync operation can complete - running dpkg to install a package for example results in kernel hung task detector triggering several times showing dpkg stuck in sync call doing wait_for_completion.
Is this sort of VM based setup not good enough for ZFS or am I doing something wrong in creating the pools or using features that aren't fully tested yet (dedup comes to mind)? I remember trying something very similar on a real machine with similar results though.
Would be glad to provide any additional information or do further testing.
The text was updated successfully, but these errors were encountered: