Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task mv:11533 blocked for more than 120 seconds #6265

Closed
ioquatix opened this issue Jun 25, 2017 · 5 comments
Closed

task mv:11533 blocked for more than 120 seconds #6265

ioquatix opened this issue Jun 25, 2017 · 5 comments
Labels
Status: Stale No recent activity for issue

Comments

@ioquatix
Copy link

ioquatix commented Jun 25, 2017

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version Up to date rolling release
Linux Kernel Linux hana 4.11.6-1-ARCH #1 SMP PREEMPT Sat Jun 17 08:19:42 CEST 2017 x86_64 GNU/Linux
Architecture Intel Pentium G4560
ZFS Version 0.7.0-rc4_65_gd9ad3fea3
SPL Version 0.7.0-rc4_4_gac48361

Describe the problem you're observing

This is an entirely new system with 6x 4TB WD RED drives on a SAS3 backplane with 32Gb of memory with about 12 free at the time this happened.

[16655.091457] perf: interrupt took too long (3144 > 3132), lowering kernel.perf_event_max_sample_rate to 63600
[23030.236885] perf: interrupt took too long (3946 > 3930), lowering kernel.perf_event_max_sample_rate to 50400
[37846.703532] INFO: task mv:11533 blocked for more than 120 seconds.
[37846.703601]       Tainted: P           O    4.11.6-1-ARCH #1
[37846.703651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[37846.703719] mv              D    0 11533  16785 0x00000000
[37846.703725] Call Trace:
[37846.703738]  __schedule+0x22e/0x8e0
[37846.703744]  schedule+0x3d/0x90
[37846.703755]  cv_wait_common+0x11c/0x130 [spl]
[37846.703763]  ? wake_bit_function+0x60/0x60
[37846.703790]  __cv_wait+0x15/0x20 [spl]
[37846.703863]  txg_wait_open+0xb0/0x100 [zfs]
[37846.703919]  dmu_free_long_range+0x295/0x410 [zfs]
[37846.703985]  zfs_rmnode+0x245/0x330 [zfs]
[37846.704042]  ? zfs_znode_hold_exit+0xf4/0x130 [zfs]
[37846.704098]  zfs_zinactive+0xd4/0xe0 [zfs]
[37846.704155]  zfs_inactive+0x7e/0x210 [zfs]
[37846.704164]  ? truncate_pagecache+0x5a/0x70
[37846.704218]  zpl_evict_inode+0x43/0x60 [zfs]
[37846.704227]  evict+0xc5/0x190
[37846.704234]  iput+0x1ae/0x240
[37846.704239]  do_unlinkat+0x1a7/0x310
[37846.704246]  SyS_unlinkat+0x1b/0x30
[37846.704252]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[37846.704257] RIP: 0033:0x7f6829416a27
[37846.704260] RSP: 002b:00007ffefe01f278 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[37846.704266] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6829416a27
[37846.704269] RDX: 0000000000000000 RSI: 0000000001f0df18 RDI: 0000000000000004
[37846.704272] RBP: 0000000001f0da60 R08: 0000000000000000 R09: 0000000000000000
[37846.704275] R10: 70303830312e3130 R11: 0000000000000246 R12: 0000000000000004
[37846.704279] R13: 00007ffefe01f354 R14: 00007ffefe01f3b0 R15: 0000000001f0e970
[37969.583533] INFO: task mv:11533 blocked for more than 120 seconds.
[37969.583617]       Tainted: P           O    4.11.6-1-ARCH #1
[37969.583694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[37969.583777] mv              D    0 11533  16785 0x00000000
[37969.583783] Call Trace:
[37969.583796]  __schedule+0x22e/0x8e0
[37969.583802]  schedule+0x3d/0x90
[37969.583812]  cv_wait_common+0x11c/0x130 [spl]
[37969.583820]  ? wake_bit_function+0x60/0x60
[37969.583827]  __cv_wait+0x15/0x20 [spl]
[37969.583880]  txg_wait_open+0xb0/0x100 [zfs]
[37969.583917]  dmu_free_long_range+0x295/0x410 [zfs]
[37969.583961]  zfs_rmnode+0x245/0x330 [zfs]
[37969.584001]  ? zfs_znode_hold_exit+0xf4/0x130 [zfs]
[37969.584040]  zfs_zinactive+0xd4/0xe0 [zfs]
[37969.584079]  zfs_inactive+0x7e/0x210 [zfs]
[37969.584085]  ? truncate_pagecache+0x5a/0x70
[37969.584128]  zpl_evict_inode+0x43/0x60 [zfs]
[37969.584137]  evict+0xc5/0x190
[37969.584144]  iput+0x1ae/0x240
[37969.584149]  do_unlinkat+0x1a7/0x310
[37969.584156]  SyS_unlinkat+0x1b/0x30
[37969.584162]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[37969.584166] RIP: 0033:0x7f6829416a27
[37969.584169] RSP: 002b:00007ffefe01f278 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[37969.584173] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6829416a27
[37969.584175] RDX: 0000000000000000 RSI: 0000000001f0df18 RDI: 0000000000000004
[37969.584177] RBP: 0000000001f0da60 R08: 0000000000000000 R09: 0000000000000000
[37969.584179] R10: 70303830312e3130 R11: 0000000000000246 R12: 0000000000000004
[37969.584181] R13: 00007ffefe01f354 R14: 00007ffefe01f3b0 R15: 0000000001f0e970
[38092.463546] INFO: task mv:11533 blocked for more than 120 seconds.
[38092.463614]       Tainted: P           O    4.11.6-1-ARCH #1
[38092.463681] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[38092.463749] mv              D    0 11533  16785 0x00000000
[38092.463755] Call Trace:
[38092.463768]  __schedule+0x22e/0x8e0
[38092.463774]  schedule+0x3d/0x90
[38092.463785]  cv_wait_common+0x11c/0x130 [spl]
[38092.463801]  ? wake_bit_function+0x60/0x60
[38092.463808]  __cv_wait+0x15/0x20 [spl]
[38092.463864]  txg_wait_open+0xb0/0x100 [zfs]
[38092.463900]  dmu_free_long_range+0x295/0x410 [zfs]
[38092.464209]  zfs_rmnode+0x245/0x330 [zfs]
[38092.464250]  ? zfs_znode_hold_exit+0xf4/0x130 [zfs]
[38092.464289]  zfs_zinactive+0xd4/0xe0 [zfs]
[38092.464328]  zfs_inactive+0x7e/0x210 [zfs]
[38092.464335]  ? truncate_pagecache+0x5a/0x70
[38092.464374]  zpl_evict_inode+0x43/0x60 [zfs]
[38092.464382]  evict+0xc5/0x190
[38092.464387]  iput+0x1ae/0x240
[38092.464391]  do_unlinkat+0x1a7/0x310
[38092.464395]  SyS_unlinkat+0x1b/0x30
[38092.464399]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[38092.464403] RIP: 0033:0x7f6829416a27
[38092.464406] RSP: 002b:00007ffefe01f278 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[38092.464410] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6829416a27
[38092.464412] RDX: 0000000000000000 RSI: 0000000001f0df18 RDI: 0000000000000004
[38092.464414] RBP: 0000000001f0da60 R08: 0000000000000000 R09: 0000000000000000
[38092.464416] R10: 70303830312e3130 R11: 0000000000000246 R12: 0000000000000004
[38092.464418] R13: 00007ffefe01f354 R14: 00007ffefe01f3b0 R15: 0000000001f0e970

I have two ZFS pools:

# zpool status
  pool: old-tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 19h6m with 0 errors on Tue Apr  4 16:31:03 2017
config:

        NAME                        STATE     READ WRITE CKSUM
        old-tank                    ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x50014ee20702ef26  ONLINE       0     0     0
            wwn-0x50014ee15a0a0c46  ONLINE       0     0     0
            wwn-0x50014ee20702e46a  ONLINE       0     0     0
            wwn-0x50014ee15a0a11ce  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x50014ee20eba1695  ONLINE       0     0     0
            wwn-0x50014ee20eba337e  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x50014ee2640efb3a  ONLINE       0     0     0
            wwn-0x50014ee2b964ef3f  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            wwn-0x50014ee2b964f2d5  ONLINE       0     0     0
            wwn-0x50014ee2b964f4f4  ONLINE       0     0     0

errors: No known data errors

I am copying a lot of data from the old-tank to tank.

When copying, from zfs iostat 5 I was getting about 250Mbytes/s READ from old-tank and 550Mbyte/s WRITE to tank. It was humming along quite nicely for a long time.

I noticed some odd pauses and from another machine, copying via scp, it said "stalled" which indicates no data transfer for a few moments.

Describe how to reproduce the problem

Copy lots of data. Perhaps this issue is isolated to the old-tank, or operations that involve it? Perhaps those drives have issues and they may not have TLER so are blocking the system?

drives

lrwxrwxrwx 1 root root    9 Jun 25 19:41 ata-Samsung_SSD_850_PRO_256GB_S39KNWAJ313793W -> ../../sda
lrwxrwxrwx 1 root root    9 Jun 25 23:25 ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ2144377 -> ../../sdh
lrwxrwxrwx 1 root root    9 Jun 25 23:26 ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ2263503 -> ../../sdj
lrwxrwxrwx 1 root root    9 Jun 25 23:26 ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0353148 -> ../../sdk
lrwxrwxrwx 1 root root    9 Jun 25 23:26 ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0365295 -> ../../sdi
lrwxrwxrwx 1 root root    9 Jun 25 19:41 ata-WDC_WD40EFRX-68N32N0_WD-WCC7K0EZ4UKS -> ../../sdc
lrwxrwxrwx 1 root root    9 Jun 25 19:41 ata-WDC_WD40EFRX-68N32N0_WD-WCC7K0RLAYP1 -> ../../sdf
lrwxrwxrwx 1 root root    9 Jun 25 19:41 ata-WDC_WD40EFRX-68N32N0_WD-WCC7K1AT8T4C -> ../../sdd
lrwxrwxrwx 1 root root    9 Jun 25 19:41 ata-WDC_WD40EFRX-68N32N0_WD-WCC7K2NYU749 -> ../../sdg
lrwxrwxrwx 1 root root    9 Jun 25 19:41 ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4VYNKV0 -> ../../sde
lrwxrwxrwx 1 root root    9 Jun 25 19:41 ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4YSVVTC -> ../../sdb
lrwxrwxrwx 1 root root    9 Jun 25 23:26 wwn-0x50014ee15a0a0c46 -> ../../sdk
lrwxrwxrwx 1 root root    9 Jun 25 23:26 wwn-0x50014ee15a0a11ce -> ../../sdi
lrwxrwxrwx 1 root root    9 Jun 25 23:25 wwn-0x50014ee20702e46a -> ../../sdh
lrwxrwxrwx 1 root root    9 Jun 25 23:26 wwn-0x50014ee20702ef26 -> ../../sdj
lrwxrwxrwx 1 root root    9 Jun 25 19:41 wwn-0x50014ee20eba1695 -> ../../sde
lrwxrwxrwx 1 root root    9 Jun 25 19:41 wwn-0x50014ee20eba337e -> ../../sdf
lrwxrwxrwx 1 root root    9 Jun 25 19:41 wwn-0x50014ee2640efb3a -> ../../sdb
lrwxrwxrwx 1 root root    9 Jun 25 19:41 wwn-0x50014ee2b964ef3f -> ../../sdd
lrwxrwxrwx 1 root root    9 Jun 25 19:41 wwn-0x50014ee2b964f2d5 -> ../../sdc
lrwxrwxrwx 1 root root    9 Jun 25 19:41 wwn-0x50014ee2b964f4f4 -> ../../sdg
lrwxrwxrwx 1 root root    9 Jun 25 19:41 wwn-0x5002538d704e9ff1 -> ../../sda

iotop while mv

Total DISK READ :      99.58 M/s | Total DISK WRITE :      51.15 M/s
Actual DISK READ:      49.47 M/s | Actual DISK WRITE:     119.54 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                         
 5702 be/0 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [z_null_int]
  474 be/0 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [z_null_int]
 5704 be/0 root        4.90 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_0]
 5705 be/0 root        5.25 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_1]
 5706 be/0 root        4.62 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_2]
 5707 be/0 root        4.97 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_3]
 5708 be/0 root        6.32 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_4]
 5709 be/0 root        6.21 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_5]
 5710 be/0 root        4.89 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_6]
 5711 be/0 root        4.97 M/s    0.00 B/s  0.00 %  0.00 % [z_rd_int_7]
26859 be/4 samuel     57.45 M/s   51.15 M/s  0.00 %  0.00 % mv /old-tank/media/Western/Animation/stuff /tank/media/video/Cartoons/

smartctl --all

For the four older drives, I've attached the output of smartctl. I think one of the drives may have started to go bad (wwn-0x50014ee20702e46a) - or it might have been a one off failure for some reason. I'll investigate further.

wwn-0x50014ee15a0a0c46-smart.txt
wwn-0x50014ee15a0a11ce-smart.txt
wwn-0x50014ee20702e46a-smart.txt
wwn-0x50014ee20702ef26-smart.txt

iostat

# iostat -mx 10
Linux 4.11.6-1-ARCH (hana)      26/06/17        _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.35    0.00    7.85   15.41    0.00   69.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.55    0.39    0.81     0.01     0.01    21.15     0.00    0.55    0.22    0.71   0.45   0.05
sdb               0.00     0.00    1.38  128.88     0.02    12.98   204.30     0.25    1.92   48.12    1.43   1.20  15.64
sdc               0.00     0.00    1.43  125.11     0.02    12.53   203.16     0.21    1.68   27.86    1.38   1.01  12.72
sdd               0.00     0.00    1.37  128.45     0.02    12.98   205.00     0.21    1.64   28.27    1.35   0.98  12.67
sde               0.00     0.00    1.39  130.68     0.02    13.24   205.57     0.24    1.77   35.04    1.42   1.06  14.03
sdf               0.00     0.00    1.41  130.46     0.02    13.24   205.89     0.22    1.64   27.91    1.35   0.98  12.87
sdg               0.00     0.00    1.35  125.29     0.02    12.53   202.95     0.29    2.26   73.42    1.49   1.50  18.93
sdh               0.00     0.00  163.08   21.99     9.34     0.43   108.03     1.31    7.08    7.92    0.81   2.03  37.52
sdi               0.00     0.00  164.02   18.43     9.33     0.43   109.55     1.27    6.98    7.65    1.08   2.01  36.75
sdj               0.00     0.00  160.24   21.91     9.34     0.43   109.75     1.32    7.23    8.14    0.61   2.04  37.12
sdk               0.00     0.00  163.52   19.62     9.33     0.42   109.13     1.33    7.25    8.02    0.83   2.00  36.71

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.10    0.00    0.28    0.55    0.00   99.07

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.40   10.10     0.00     0.05     9.75     0.02    1.65   38.25    0.20   1.59   1.67
sdc               0.00     0.00    0.40   10.10     0.00     0.05     9.75     0.02    1.69   40.00    0.17   1.62   1.70
sdd               0.00     0.00    0.40    9.80     0.00     0.05    10.04     0.02    1.54   35.00    0.17   1.47   1.50
sde               0.00     0.00    0.40    8.70     0.00     0.04     7.91     0.01    1.62   35.00    0.08   1.57   1.43
sdf               0.00     0.00    0.40    8.70     0.00     0.04     7.91     0.02    1.90   41.75    0.07   1.87   1.70
sdg               0.00     0.00    0.40   10.00     0.00     0.05     9.85     0.02    2.28   55.00    0.17   2.21   2.30
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

free -h

# free -h
              total        used        free      shared  buff/cache   available
Mem:            31G         12G         18G        1.3M        462M         18G
Swap:          8.0G          0B        8.0G
@ioquatix
Copy link
Author

Okay, so I've destroyed the old RAIDZ1, and made a new RAIDZ10 named backup.

(~@hana ^_^)$ zpool status
  pool: backup
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        backup                      ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x50014ee15a0a0c46  ONLINE       0     0     0
            wwn-0x50014ee15a0a11ce  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x50014ee20702e46a  ONLINE       0     0     0
            wwn-0x50014ee20702ef26  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                            STATE     READ WRITE CKSUM
        tank                            ONLINE       0     0     0
          mirror-0                      ONLINE       0     0     0
            wwn-0x50014ee20eba1695      ONLINE       0     0     0
            wwn-0x50014ee20eba337e      ONLINE       0     0     0
          mirror-1                      ONLINE       0     0     0
            wwn-0x50014ee2640efb3a      ONLINE       0     0     0
            wwn-0x50014ee2b964ef3f      ONLINE       0     0     0
          mirror-2                      ONLINE       0     0     0
            wwn-0x50014ee2b964f2d5      ONLINE       0     0     0
            wwn-0x50014ee2b964f4f4      ONLINE       0     0     0
        cache
          wwn-0x5002538d704e9ff1-part4  ONLINE       0     0     0

errors: No known data errors

I ran bad blocks on the 4 3TB drives and they didn't seem to show any issues at all. Even performance seems okay. I assume the bad sectors got reallocated.

I decided to do a backup zfs send -Rn tank/media@snapshot1 | zfs recv backup/media, and at the same time zpool iostat 10.

While seeing consistently good performance:

backup       106G  5.33T      0  3.76K      0   474M       
tank        4.52T  6.36T  1.87K     38   240M   286K       
----------  -----  -----  -----  -----  -----  -----       
backup       109G  5.33T      0  3.76K      0   478M       
tank        4.52T  6.36T  1.87K     46   234M   290K       
----------  -----  -----  -----  -----  -----  -----       
backup       111G  5.33T      0  3.73K  1.60K   469M       
tank        4.52T  6.36T  1.87K     41   239M   296K       
----------  -----  -----  -----  -----  -----  -----       
backup       114G  5.33T      0  3.71K  2.00K   471M       
tank        4.52T  6.36T  1.86K     46   237M   425K       
----------  -----  -----  -----  -----  -----  -----       
backup       116G  5.32T      0  3.77K  1.60K   480M       
tank        4.52T  6.36T  1.86K     47   237M   297K       
----------  -----  -----  -----  -----  -----  -----       
backup       119G  5.32T      0  3.73K  1.60K   473M       
tank        4.52T  6.36T  1.86K     42   238M   296K       
----------  -----  -----  -----  -----  -----  -----       
backup       121G  5.32T      0  3.75K  3.60K   477M
tank        4.52T  6.36T  1.86K     27   237M   279K
----------  -----  -----  -----  -----  -----  -----

I received several warnings:

[  736.948819] INFO: task z_zvol:6566 blocked for more than 120 seconds.
[  736.949046]       Tainted: P           O    4.11.6-3-ARCH #1
[  736.949096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  736.949187] z_zvol          D    0  6566      2 0x00000000
[  736.949195] Call Trace:
[  736.949211]  __schedule+0x22e/0x8e0
[  736.949218]  schedule+0x3d/0x90
[  736.949232]  taskq_wait_outstanding+0x87/0xd0 [spl]
[  736.949241]  ? wake_bit_function+0x60/0x60
[  736.949339]  zvol_task_cb+0x2b0/0x450 [zfs]
[  736.949348]  taskq_thread+0x254/0x480 [spl]
[  736.949355]  ? wake_up_q+0x80/0x80
[  736.949360]  kthread+0x125/0x140
[  736.949367]  ? taskq_cancel_id+0x140/0x140 [spl]
[  736.949371]  ? kthread_create_on_node+0x70/0x70
[  736.949375]  ret_from_fork+0x2c/0x40
[  859.828802] INFO: task z_zvol:6566 blocked for more than 120 seconds.
[  859.828880]       Tainted: P           O    4.11.6-3-ARCH #1
[  859.828931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  859.829020] z_zvol          D    0  6566      2 0x00000000
[  859.829026] Call Trace:
[  859.829041]  __schedule+0x22e/0x8e0
[  859.829047]  schedule+0x3d/0x90
[  859.829058]  taskq_wait_outstanding+0x87/0xd0 [spl]
[  859.829067]  ? wake_bit_function+0x60/0x60
[  859.829134]  zvol_task_cb+0x2b0/0x450 [zfs]
[  859.829147]  taskq_thread+0x254/0x480 [spl]
[  859.829156]  ? wake_up_q+0x80/0x80
[  859.829163]  kthread+0x125/0x140
[  859.829172]  ? taskq_cancel_id+0x140/0x140 [spl]
[  859.829177]  ? kthread_create_on_node+0x70/0x70
[  859.829183]  ret_from_fork+0x2c/0x40
[  982.708856] INFO: task z_zvol:6566 blocked for more than 120 seconds.
[  982.708952]       Tainted: P           O    4.11.6-3-ARCH #1
[  982.709003] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  982.709071] z_zvol          D    0  6566      2 0x00000000
[  982.709077] Call Trace:
[  982.709091]  __schedule+0x22e/0x8e0
[  982.709097]  schedule+0x3d/0x90
[  982.709108]  taskq_wait_outstanding+0x87/0xd0 [spl]
[  982.709116]  ? wake_bit_function+0x60/0x60
[  982.709160]  zvol_task_cb+0x2b0/0x450 [zfs]
[  982.709168]  taskq_thread+0x254/0x480 [spl]
[  982.709174]  ? wake_up_q+0x80/0x80
[  982.709180]  kthread+0x125/0x140
[  982.709186]  ? taskq_cancel_id+0x140/0x140 [spl]
[  982.709190]  ? kthread_create_on_node+0x70/0x70
[  982.709194]  ret_from_fork+0x2c/0x40
[ 1105.592098] INFO: task z_zvol:6566 blocked for more than 120 seconds.
[ 1105.592186]       Tainted: P           O    4.11.6-3-ARCH #1
[ 1105.592244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1105.592319] z_zvol          D    0  6566      2 0x00000000
[ 1105.592326] Call Trace:
[ 1105.592347]  __schedule+0x22e/0x8e0
[ 1105.592358]  schedule+0x3d/0x90
[ 1105.592372]  taskq_wait_outstanding+0x87/0xd0 [spl]
[ 1105.592381]  ? wake_bit_function+0x60/0x60
[ 1105.592436]  zvol_task_cb+0x2b0/0x450 [zfs]
[ 1105.592448]  taskq_thread+0x254/0x480 [spl]
[ 1105.592456]  ? wake_up_q+0x80/0x80
[ 1105.592461]  kthread+0x125/0x140
[ 1105.592468]  ? taskq_cancel_id+0x140/0x140 [spl]
[ 1105.592472]  ? kthread_create_on_node+0x70/0x70
[ 1105.592476]  ret_from_fork+0x2c/0x40

Lots of available memory:

$ free -h                                      
              total        used        free      shared  buff/cache   available
Mem:            31G         14G         16G        1.2M        191M         16G
Swap:          8.0G          0B        8.0G

Load average a tad high but not too bad:

load average: 6.46, 5.68, 3.14

@ioquatix
Copy link
Author

iostat -mx 10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.54    0.00   39.24   15.68    0.00   44.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.10    0.40    1.20     0.00     0.01     8.50     0.00    0.19    0.00    0.25   0.19   0.03
sdb               0.00     0.00  328.40    2.80    40.93     0.02   253.24     0.73    2.20    2.19    3.21   1.00  33.13
sdd               0.00     0.00  318.30    3.20    39.68     0.02   252.88     0.68    2.10    2.10    2.50   0.99  31.83
sdc               0.00     0.00  339.00    3.60    42.25     0.03   252.75     0.73    2.13    2.10    4.81   0.95  32.60
sdf               0.00     0.00  320.50    4.60    39.97     0.03   251.95     0.64    2.01    2.03    0.43   0.95  30.90
sdg               0.00     0.00  302.00    3.90    37.66     0.03   252.32     0.83    2.78    2.78    2.56   1.16  35.60
sde               0.00     0.00  309.40    4.50    38.57     0.03   251.81     0.70    2.28    2.26    3.71   1.03  32.36
sdh               0.00     0.00    0.20  930.60     0.00   115.79   254.76     7.05    7.58   46.50    7.57   1.00  93.50
sdk               0.00     0.00    0.80  987.90     0.00   122.49   253.74     9.73    9.84   15.88    9.84   1.01  99.40
sdj               0.00     0.00    0.20  929.90     0.00   115.70   254.76     9.73   10.46   47.00   10.45   1.06  98.87
sdi               0.00     0.00    0.20  972.80     0.00   120.63   253.90     9.43    9.69   45.00    9.68   1.01  98.47

Last 4 drives are backup, everything other than sda is tank.

@ioquatix
Copy link
Author

Okay, so today I coped another partition from tank to backup - around 500Gb. I didn't see any error message in dmesg despite similar utilisation and performance.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle    
           0.78    0.00   40.24   16.49    0.00   42.48    

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util                                                                                                                     
sda               0.00     0.80    0.00   11.90     0.00     0.06     9.61     0.00    0.34    0.00    0.34   0.08   0.10                                                                                                                     
sdb               0.00     0.00  276.00    6.20    34.35     0.05   249.69     0.74    2.63    2.59    4.19   1.16  32.60                                                                                                                     
sdd               0.00     0.00  288.40    7.40    35.87     0.05   248.69     0.72    2.45    2.46    1.99   1.13  33.47                                                                                                                     
sdc               0.00     0.00  304.50    6.40    37.83     0.05   249.55     0.71    2.29    2.28    2.44   1.07  33.34                                                                                                                     
sdf               0.00     0.00  275.90    5.40    34.22     0.04   249.38     0.69    2.46    2.43    4.07   1.12  31.47                                                                                                                     
sdg               0.00     0.00  300.10    7.00    37.29     0.05   249.07     0.75    2.42    2.42    2.43   1.13  34.67                                                                                                                     
sde               0.00     0.00  286.40    6.40    35.55     0.04   248.90     0.70    2.38    2.37    2.97   1.09  31.83                                                                                                                     
sdh               0.00     0.00    0.00  505.70     0.00    59.95   242.80     9.65   19.08    0.00   19.08   1.95  98.40                                                                                                                     
sdk               0.00     0.00    0.00  988.30     0.00   121.50   251.77     9.98   10.09    0.00   10.09   1.01 100.04                                                                                                                     
sdj               0.00     0.00    0.00  487.30     0.00    59.77   251.19     9.99   20.49    0.00   20.49   2.05 100.00                                                                                                                     
sdi               0.00     0.00    0.00 1008.90     0.00   122.73   249.14     9.49    9.42    0.00    9.42   0.98  99.20    

@ioquatix
Copy link
Author

Also, after running badblocks, the drive seems to have recovered somewhat:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     36739         -
# 2  Short offline       Completed without error       00%     36735         -
# 3  Short offline       Completed: read failure       60%     36668         1565545632
# 4  Extended offline    Completed: read failure       10%     36664         1565545632
# 5  Short offline       Completed: read failure       50%     36650         1565545632

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

1 participant