Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directory copy stalls #69

Closed
paragw opened this issue Oct 13, 2010 · 31 comments
Closed

Directory copy stalls #69

paragw opened this issue Oct 13, 2010 · 31 comments
Milestone

Comments

@paragw
Copy link

paragw commented Oct 13, 2010

Test Setup

Host Machine : 4xXeon, 10Gb RAM 4 Disks out of which 2 are using software RAID1 (Standard MD) running Ubuntu Maverick x86_64 (2.6.35.xx) and SPL+ZFS built from Git.

Virtual Machine : KVM/Qemu - 1Gb RAM 3 Virtio based disk images - one for OS install (Ubuntu Maverick Server x86_64) and 2 others vdb and vdc

ZFS standard striped pool named rpool (no RAID options specified) - 39Gb allocated from vdb and vdc. compress=on dedup=on.

2 Volumes created - rpool/one 20G and other rpool/two 10g and ext3 file systems created on both.

Trying to copy a 650Mb directory (OS install ISO contents) to any one of the ext3 file systems on the rpool/one (rpool/two for that matter) never seems to finish. Various ZFS related process and kdmflush/kswapd0 et.al continue to take lots of CPU and the copy never seems to finish - too slow disk throughput and not enough CPU for cp are couple things I suspect may be the cause.

Also while the copy is in progress no other sync operation can complete - running dpkg to install a package for example results in kernel hung task detector triggering several times showing dpkg stuck in sync call doing wait_for_completion.

Is this sort of VM based setup not good enough for ZFS or am I doing something wrong in creating the pools or using features that aren't fully tested yet (dedup comes to mind)? I remember trying something very similar on a real machine with similar results though.

Would be glad to provide any additional information or do further testing.

@poige
Copy link

poige commented Feb 4, 2011

Similar symptoms:

top - 17:41:31 up 44 min,  8 users,  load average: 13.03, 13.06, 11.24
Tasks: 257 total,   1 running, 256 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 65.9%id, 33.8%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   1464756k total,   963500k used,   501256k free,    71140k buffers
Swap:  1048572k total,        0k used,  1048572k free,   330304k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1590 poige     20   0 10912 1232  812 R    0  0.1   0:02.52 top
   20 root      20   0     0    0    0 D    0  0.0   0:00.04 sync_supers
  735 root      39  19     0    0    0 D    0  0.0   0:00.72 zvol/2
 1341 root      39  19     0    0    0 D    0  0.0   0:00.03 zio_write_intr/
 1342 root      39  19     0    0    0 D    0  0.0   0:00.01 zio_write_intr/
 1343 root      39  19     0    0    0 D    0  0.0   0:00.02 zio_write_intr/
 1344 root      39  19     0    0    0 D    0  0.0   0:00.02 zio_write_intr/
 1345 root      39  19     0    0    0 D    0  0.0   0:00.01 zio_write_intr/
 1346 root      39  19     0    0    0 D    0  0.0   0:00.01 zio_write_intr/
 1347 root      39  19     0    0    0 D    0  0.0   0:00.02 zio_write_intr/
 1348 root      39  19     0    0    0 D    0  0.0   0:00.03 zio_write_intr/
 1489 root       0 -20     0    0    0 D    0  0.0   0:00.00 txg_quiesce
 1490 root       0 -20     0    0    0 D    0  0.0   0:00.21 txg_sync
 1551 root      20   0 14124  628  516 D    0  0.0   0:03.58 cp

@poige
Copy link

poige commented Feb 9, 2011

Yeah, also I can add it happens only with SMP. It works Ok when booted in uniprocessor mode.

@ChrisAzfs
Copy link

It appears the culprit is dedup. Copying a 466 MB directory from ext4 to ZFS took about 40 seconds with dedup off, and 2 minutes with dedup on. Enabling compression makes it faster (fewer blocks to checksum?). Sometimes compression=on/dedup=on was as fast as dedup=off; I don't know why. Copying a 3 GB directory with many files made the machine unusable.

For comparison, copying the 466 MB directory took about 20 seconds with zfs-fuse, so there is room for improvement.

Using 0.6.0-rc1 on Fedora 14 x86_64 (SMP).

@poige
Copy link

poige commented Feb 20, 2011

ChrisAzfs, seems you're missing the point. In mine and paragw' case copying never gets completed. It stalls until rebooting.

@behlendorf
Copy link
Contributor

Poige can you try again using the 0.6.0-rc1 release. There are a couple changes in the newer code which will probably help, plus if won't be restricted to using a zvol anymore. You can use a normal ZFS filesystem now for your testing if you prefer.

@poige
Copy link

poige commented Feb 20, 2011

behlendorf, would 2.6.32.28 fit for that?

@behlendorf
Copy link
Contributor

You can use the 2.6.32.28 kernel with the newer 0.6.0-rc1 source, but the kernel itself doesn't explain the issue you've reported.

@poige
Copy link

poige commented Feb 20, 2011

Asked just to be sure. Already compiled, gonna test it now.

@poige
Copy link

poige commented Feb 20, 2011

ok, on zpool list:
SPL: Loaded Solaris Porting Layer v0.6.0
zunicode: module license 'CDDL' taints kernel.
Disabling lock debugging due to kernel taint
BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3379
caller is txg_hold_open+0x1d/0x5a [zfs]
Pid: 3379, comm: modprobe Tainted: P 2.6.32.28 #1
Call Trace:
[] debug_smp_processor_id+0xbd/0xd8
[] txg_hold_open+0x1d/0x5a [zfs]
[] dmu_tx_assign+0x159/0x349 [zfs]
[] ? dsl_dataset_block_freeable+0x37/0x45 [zfs]
[] zvol_replay_write+0x61/0xab [zfs]
[] zil_replay_log_record+0x10c/0x162 [zfs]
[] zil_parse+0x3cb/0x4ba [zfs]
[] ? __switch_to+0xcb/0x1f2
[] ? zil_replay_log_record+0x0/0x162 [zfs]
[] ? zil_incr_blks+0x0/0xf [zfs]
[] zil_replay+0xb1/0xea [zfs]
[] __zvol_create_minor+0x2ef/0x413 [zfs]
[] zvol_create_minors_cb+0x26/0x2d [zfs]
[] dmu_objset_find_spa+0x332/0x348 [zfs]
[] ? zvol_create_minors_cb+0x0/0x2d [zfs]
[] dmu_objset_find_spa+0x131/0x348 [zfs]
[] ? zvol_create_minors_cb+0x0/0x2d [zfs]
[] zvol_create_minors+0x69/0xc0 [zfs]
[] ? blk_register_region+0x26/0x28
[] zvol_init+0x11e/0x125 [zfs]
[] ? spl__init+0x0/0x10 [zfs]
[] _init+0x1d/0xfe [zfs]
[] ? spl__init+0x0/0x10 [zfs]
[] spl__init+0xe/0x10 [zfs]
[] do_one_initcall+0x59/0x14b
[] sys_init_module+0xd0/0x227
[] system_call_fastpath+0x16/0x1b

@poige
Copy link

poige commented Feb 20, 2011

And even simple mount stalls:

 3740 poige     20   0 10864 1344  904 R    1  0.1   0:00.13 top
 3396 root      39  19     0    0    0 D    0  0.0   0:00.14 zvol/3
 3582 root      39  19     0    0    0 D    0  0.0   0:00.00 z_wr_int/0
 3583 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/1
 3584 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/2
 3585 root      39  19     0    0    0 D    0  0.0   0:00.01 z_wr_int/3
 3586 root      39  19     0    0    0 D    0  0.0   0:00.01 z_wr_int/4
 3587 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/5
 3588 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/6
 3589 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/7
 3590 root      39  19     0    0    0 D    0  0.0   0:00.00 z_wr_int/8
 3591 root      39  19     0    0    0 D    0  0.0   0:00.03 z_wr_int/9
 3592 root      39  19     0    0    0 D    0  0.0   0:00.03 z_wr_int/10
 3593 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/11
 3594 root      39  19     0    0    0 D    0  0.0   0:00.01 z_wr_int/12
 3595 root      39  19     0    0    0 D    0  0.0   0:00.01 z_wr_int/13
 3596 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/14
 3597 root      39  19     0    0    0 D    0  0.0   0:00.02 z_wr_int/15
 3721 root       0 -20     0    0    0 D    0  0.0   0:00.00 txg_quiesce
 3722 root       0 -20     0    0    0 D    0  0.0   0:00.07 txg_sync
 3735 root      20   0  9888  880  724 D    0  0.1   0:00.13 mount

@poige
Copy link

poige commented Feb 20, 2011

And again: mount worked ok when booted w/o SMP (uniprocessor mode).

@behlendorf
Copy link
Contributor

OK, this is a known issue. The problem is your kernel is built with CONFIG_PREEMPT, it's a low latency kernel. See issue #83. This can and will be fixed but it isn't yet.

https://github.com/behlendorf/zfs/issues/#issue/83

@poige
Copy link

poige commented Feb 21, 2011

Damn! I just had recompiled the kernel with CONFIG_PREEMPT before you wrote that new version is available to give it a try. :-) Ok, I'll revert this.

@poige
Copy link

poige commented Feb 21, 2011

behlendorf, I gave it a try on 2.6.35.10: copying staled, alas…

$ zgrep CONFIG_PREEMPT /proc/config.gz 
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

@behlendorf
Copy link
Contributor

Was there any debugging printed to 'dmesg'? That would be critical to determine what the system is stuck on. With CONFIG_PREEMPT disabled you certainly shouldn't be seeing this message anymore.

BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3379

@poige
Copy link

poige commented Feb 21, 2011

Not in the moment of the stale. Also, it's a bit strange that having only ran sudo zpool list I have had 2 processes in state 'D':

1320 poige     20   0 10912 1224  812 R    1  0.1   0:00.06 top
1313 root       0 -20     0    0    0 D    0  0.0   0:00.00 txg_quiesce
1314 root       0 -20     0    0    0 D    0  0.0   0:00.00 txg_sync

@poige
Copy link

poige commented Feb 21, 2011

vmstat 1 shows a burst in a number of running processes @ stale:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1      0  22760  49648 471020    0    0  9612     0 1321 1963  0  8 66 27
17  1      0  13472  50028 477532    0    0  8344  4011 1221 2218  0 10 65 25
 0  1      0  14008  50456 479920    0    0 11072  8550 1839 4049  1 13 61 25
 1  2      0  12344  50660 474048    0    0  8768 47804 2862 7540  0 26 45 29
 0  2      0  16516  50868 482004    0    0  4792  8012 1150 2453  0  6 33 60
 0  2      0  16516  50872 482088    0    0     4     0  275  234  0  1 33 66
 0  2      0 252396  50880 482084    0    0     0   548  310  238  0  3 33 64
 0  2      0 252396  50880 482088    0    0     0     0  271  237  0  0 34 66
 0  2      0 252396  50880 482088    0    0     0     0  278  238  0  1 33 66
 0  2      0 252396  50880 482088    0    0     0     0  299  249  0  0 33 66
 0  2      0 252396  50880 482088    0    0     0     0  252  243  0  1 33 66

@poige
Copy link

poige commented Feb 21, 2011

The first stale-related dmesg is:

INFO: task arc_reclaim:986 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
arc_reclaim   D 0000000000000000     0   986      2 0x00000000
 ffff880057b87cd0 0000000000000046 ffff880000000000 ffff880057b87fd8
 ffff880058e16540 0000000000014d00 0000000000014d00 ffff880057b87fd8
 0000000000014d00 0000000000014d00 ffff880057b87fd8 0000000000014d00
Call Trace:
 [<ffffffff812d21a5>] __mutex_lock_slowpath+0x145/0x247
 [<ffffffff812d22b8>] mutex_lock+0x11/0x28
 [<ffffffffa03f758d>] arc_evict+0x60/0x4d0 [zfs]
 [<ffffffff8105a725>] ? finish_wait+0x61/0x69
 [<ffffffffa03671cd>] ? __cv_timedwait_common+0xd4/0xf4 [spl]
 [<ffffffffa03f7a64>] arc_adjust+0x67/0x170 [zfs]
 [<ffffffffa03f8d1a>] arc_reclaim_thread+0x107/0x182 [zfs]
 [<ffffffffa03f8c13>] ? arc_reclaim_thread+0x0/0x182 [zfs]
 [<ffffffffa0364105>] thread_generic_wrapper+0x6c/0x79 [spl]
 [<ffffffffa0364099>] ? thread_generic_wrapper+0x0/0x79 [spl]
 [<ffffffff8105a193>] kthread+0x7a/0x82
 [<ffffffff81009824>] kernel_thread_helper+0x4/0x10
 [<ffffffff8105a119>] ? kthread+0x0/0x82
 [<ffffffff81009820>] ? kernel_thread_helper+0x0/0x10

@behlendorf
Copy link
Contributor

Thanks, that debugging is a good start. When I get a moment I'll try and reproduce this locally, is your test setup the same as described in comment #1? Also, exactly what commands did you run to trigger this deadlock? Finally, I notice your VM only has 1GiB of memory which isn't much for ZFS in theory it should be OK, just slow, but I try to leave 2GiB of memory in all my VMs for ZFS testing.

@poige
Copy link

poige commented Feb 21, 2011

You're welcome! It's x86_64 Linux in VirtualBox. Actually it has 1430 MiB and, AFAIR, I did ± it (no RAM-starving occured), as well as changed number of availables CPUs. (With only one CPU everything works ok.). The trigger is simple sudo cp -xa /usr/src /mounted_zfs_volume. The volume is 10G size with Reiser3 on the top. The pool is either simple stripe or raidz, consisting of 3 disk devices each of 8 GiB size.

@poige
Copy link

poige commented Mar 11, 2011

Hi! Any changes worth of re-trying? :)

@Rudd-O
Copy link
Contributor

Rudd-O commented Mar 15, 2011

Ew, I have been using CONFIG_PREEMPT_VOLUNTARY for a while. Maybe that's why ZFS goes all retarded on me every day? I am gonna turn it off now.

@behlendorf
Copy link
Contributor

Please let me know if that improves thing. Leaving CONFIG_PREEMPT on could result in a deadlock in the slab reclaim which could manifest itself as a spinning kswapd task.

@Rudd-O
Copy link
Contributor

Rudd-O commented Mar 15, 2011

So far no spinning kswapd task with the new kernel and no config preempt enabled.

@behlendorf
Copy link
Contributor

The kswapd spinning and deadlock issues should be resolved in the latest spl/zfs master source. Performance issues and memory usage must still be addressed, but those are different issues than the kswapd thrashing/deadlock described here. The following commits have resolved the issue.

691f6ac, behlendorf/spl@cb255ae - Allows the kernel to perform direct reclaim regularly which eases the pressure on kswapd (indirect reclaim)

d6bd8ea - Fixes a deadlock exposed by increased direct reclaim.

behlendorf/spl@2092cf6 - Fixes upstream kernel vmalloc() deadlock, the kernel patch no longer required. However, I will work to get it included in the upstream kernel since it is a real kernel bug.

I'm closing this issue, if your still seeing a spinning kswapd with the latest code please let me know and we'll reopen this issue or file a new one.

@Rudd-O
Copy link
Contributor

Rudd-O commented Mar 22, 2011

Will try now

@poige
Copy link

poige commented Oct 2, 2011

Marked as Closed but I can't confirm that — I managed to get another hang but now I think it can be due to really tight memory limit. Gonna check this out soon,

@poige
Copy link

poige commented Oct 2, 2011

Having increased mem. to 2 GiB I could have copied /usr/src to ZFS' volume. During this copying I noticed from time to time valuable burst of number of running processes:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
102  0      0  25112   1588 665884    0    0  2392 11177 1660 3694  0 35 46 19

quite strange…

@poige
Copy link

poige commented Oct 3, 2011

And 4096 GiB was the lowest boundary to have successfully copied /usr/src onto dedup=on, compress=on ZFS. Otherwise it stalled all the ways and even panic-rebooted once.

@behlendorf behlendorf reopened this Oct 18, 2011
@behlendorf
Copy link
Contributor

That doesn't sound fixed, I'm reopening the issue.

@behlendorf
Copy link
Contributor

This was likely caused by issue #287, this fix will likely be merged in the next day or two so I'm closing this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants