100% CPU Spin on all cores by zfs threads #258

devsk · 2011-05-29T18:51:30Z

It happened when a large file (20GB) was being overwritten from a pool which has dedup and compression on. The system becomes completely unresponsive during this and stays unresponsive for upto many minutes. I was recording the CPU usage of various processes and z_wr_iss tasks were all spinning with 100% CPU usage. There were more such tasks than the number of cores I have in my system. And of course, kernel threads did not let anything in user space run (non-PREEMPT setup).

It happened multiple times during the overwrite of the file which took a LONNNGGG time.

The way to reproduce this would be:

Create a pool of size 22GB with dedup and compression on.
Create a 20GB file of random data on another FS.

time dd if=/dev/urandom of=/var/tmp/tempfile count=20000 bs=1M

Copy it into the pool.
Create another 20GB file in another FS with the same name with random data.
Overwrite the file in the pool with the file from 4.

The size of the pool I have is about 90% used, so this should come close to reproducing it.

Also, I am using pool version 26 (to still have the possibility of recovery from zfs-fuse or the OS install I have).

devsk · 2011-05-29T18:59:40Z

It goes without saying, I guess (z_wr_iss), something is being written to disk when this spin happens. I see large amount of write IO happening when the system goes to lunch. Also, a bunch of space is freed as shown by 'df' after every spin. Seems like the FS is reclaiming disk blocks in a tight loop.

devsk · 2011-05-29T19:08:04Z

I forgot to mention a lot of RAM (about 1.2G) is freed before and after of this hang. If that matters for analysis.

I am going to get another 1TB drive and get rid of dedup. Its more trouble than worth the 30% (300GB in my case = $30) savings.

behlendorf · 2011-05-29T19:12:56Z

OK that's a good test case. Does this only happen when the pool is at 90% of capacity, or does the 100% CPU spin even occur when your doing an overwrite and the pool is much emptier?

devsk · 2011-05-29T19:23:01Z

There is close to 900GB of data in there. So, its hard to test with more empty space. I don't know if it will reproduce with more empty space.

I just realized that I quoted the processes wrong. The tasks which spin are z_wr_int.

devsk · 2011-05-30T18:27:52Z

The PREEMPT_NONE is killing the system during heavy IO. Everything locks up for several minutes.

behlendorf · 2011-06-01T02:58:00Z

It's conceivable this is related to issue #167. Apparently the default setting for CONFIG_DEBUG_MUTEX under ArchLinux results in some unfortunate mutex problems like potential spinning in the z_wr_int threads. The workaround in the short term is to disable this which is the default I believe for most other non-debug distribution kernels. That said this of course still needs to be fixed.

devsk · 2011-06-01T03:14:40Z

Doesn't apply to my system.

# zgrep CONFIG_DEBUG_MUTEX /proc/config.gz 
# CONFIG_DEBUG_MUTEXES is not set

devsk · 2011-06-08T00:14:51Z

arc_reclaim continues to hog CPU once in a while. Note that the system was completely idle at this time. I see lot of memory movement (free and used memory changing) all the time although the system is idle.

Also, note the CPU time it has accumulated. Far greater than anything else in the system.

top - 17:11:06 up 2 days,  1:15, 28 users,  load average: 0.07, 0.03, 0.05
Tasks: 588 total,   3 running, 585 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  2.6%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12298180k total, 10590668k used,  1707512k free,  3004812k buffers
Swap: 12582908k total,        0k used, 12582908k free,  1842640k cached
PID to kill: 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                            
 2426 root       0 -20     0    0    0 R   20  0.0  11:14.57 arc_reclaim                                                        
 4946 root      39  19     0    0    0 R    0  0.0   0:00.43 z_null_iss/0

devsk · 2011-06-08T00:16:42Z

The last thing about arc_reclaim actually has nothing to do with this bug. This bug is about total 100% CPU spin by many many threads at the same time and not just high CPU usage by one process during idle.

devsk · 2011-06-08T00:25:05Z

It seems like ac_reclaim is popping up every 10-20 seconds and spinning for a bit.

 2426 root       0 -20     0    0    0 R   22  0.0  11:19.00 arc_reclaim                                                        
22601 root      20   0 19748 1732  984 R    2  0.0   0:01.11 top

gig-tmb · 2011-06-28T11:54:21Z

It seems like txg_sync is popping up and system hungs

1495 root 0 -20 0 0 0 D 22 0.0 00:19.00 txg_sync

mlsorensen · 2011-09-27T19:45:05Z

I may have also run across this issue. I can reliably cause the zfs to choke with txg_sync and z_fr_iss threads spinning and nothing apparently going on, on two different boxes. The filesystem is fresh, nothing on it, no dedup or compression. Kernel is 2.6.39.3 and everything is default settings. /proc/meminfo shows that vmallocused does not change at all during this time.

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000
99999+1 records in
99999+1 records out
104857317376 bytes (105 GB) copied, 153.905 s, 681 MB/s

[root@osd0 ~]# zpool iostat 2
capacity operations bandwidth
pool alloc free read write read write

pool0 125G 50.6T 0 2.19K 30 272M
pool0 128G 50.6T 0 8.58K 0 1.04G
pool0 131G 50.6T 0 9.41K 0 1.14G
pool0 134G 50.6T 0 9.23K 0 1.12G
pool0 137G 50.6T 0 8.48K 0 1.03G
pool0 141G 50.6T 0 8.63K 0 1.05G
...

then a second time... takes forever so I ctrl+c

dd if=/dev/zero of=/pool0/file bs=1M count=100000
^C8141+0 records in
8141+0 records out
8536457216 bytes (8.5 GB) copied, 328.335 s, 26.0 MB/s

...
pool0 261G 50.5T 0 0 0 0
pool0 261G 50.5T 1 0 11.7K 0
pool0 261G 50.5T 2.19K 20 4.20M 161K
pool0 261G 50.5T 2.08K 0 4.02M 0
pool0 261G 50.5T 1.24K 0 2.22M 0
pool0 261G 50.5T 0 0 0 0
pool0 261G 50.5T 0 0 0 0
...

Then a third time, this time new files (dd is CPU bound so we start two)

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file4 bs=1M count=100000 & dd if=/dev/zero of=/pool0/file5 bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 181.895 s, 576 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 183.822 s, 570 MB/s

[root@osd0 ~]# zpool iostat 2
capacity operations bandwidth
pool alloc free read write read write

pool0 55.0G 50.7T 10 1.89K 20.8K 234M
pool0 59.3G 50.7T 0 8.94K 0 1.07G
pool0 61.8G 50.7T 0 8.66K 0 1.04G
pool0 64.6G 50.7T 0 10.1K 0 1.22G
pool0 67.3G 50.7T 0 9.71K 0 1.17G
pool0 71.2G 50.7T 0 9.29K 0 1.12G

Then a fourth time... rewrite those files.

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file4 bs=1M count=100000 & dd if=/dev/zero of=/pool0/file5 bs=1M count=100000

^C

849+0 records in
849+0 records out
890241024 bytes (890 MB) copied, 72.4994 s, 12.3 MB/s

fg
^C
1405+0 records in
1405+0 records out
1473249280 bytes (1.5 GB) copied, 319.138 s, 4.6 MB/s

...
ool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 3.78K 3 7.41M 71.4K
pool0 280G 50.5T 2.21K 17 4.19M 89.7K
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
...

top - 13:39:52 up 36 min, 2 users, load average: 82.47, 47.54, 33.11
Tasks: 411 total, 89 running, 322 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 99.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33010732k total, 5229556k used, 27781176k free, 17004k buffers
Swap: 8388604k total, 0k used, 8388604k free, 1408720k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ WCHAN COMMAND
2380 root 0 -20 0 0 0 R 97.8 0.0 8:11.04 - txg_sync
2264 root 39 19 0 0 0 R 26.7 0.0 1:46.96 - z_fr_iss/5
2268 root 39 19 0 0 0 R 26.7 0.0 1:46.98 - z_fr_iss/9
2277 root 39 19 0 0 0 S 26.7 0.0 1:46.74 - z_fr_iss/18
2287 root 39 19 0 0 0 R 26.7 0.0 1:47.00 - z_fr_iss/28
2291 root 39 19 0 0 0 R 26.7 0.0 1:47.02 taskq_thr z_fr_iss/32
2300 root 39 19 0 0 0 R 26.7 0.0 1:47.01 taskq_thr z_fr_iss/41
2303 root 39 19 0 0 0 R 26.7 0.0 1:46.84 - z_fr_iss/44
2305 root 39 19 0 0 0 R 26.7 0.0 1:46.92 - z_fr_iss/46
2319 root 39 19 0 0 0 R 26.7 0.0 1:46.74 - z_fr_iss/60
2321 root 39 19 0 0 0 R 26.7 0.0 1:46.77 - z_fr_iss/62
2350 root 39 19 0 0 0 R 26.7 0.0 1:46.81 - z_fr_iss/91
2351 root 39 19 0 0 0 R 26.7 0.0 1:46.81 - z_fr_iss/92

[root@osd0 ~]# zpool status
pool: pool0
state: ONLINE
scan: none requested
config:

NAME                                STATE     READ WRITE CKSUM
pool0                               ONLINE       0     0     0
  raidz1-0                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:0:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:1:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:2:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:3:0   ONLINE       0     0     0
  raidz1-1                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:4:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:5:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:6:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:7:0   ONLINE       0     0     0
  raidz1-2                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:8:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:9:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:10:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:11:0  ONLINE       0     0     0
  raidz1-3                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:12:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:13:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:14:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:15:0  ONLINE       0     0     0
  raidz1-4                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:16:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:17:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:18:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:19:0  ONLINE       0     0     0
  raidz1-5                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:20:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:21:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:22:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:23:0  ONLINE       0     0     0
  raidz1-6                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:24:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:25:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:26:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:27:0  ONLINE       0     0     0

mlsorensen · 2011-09-27T19:59:31Z

changed the config a bit to try to duplicate. This time txg_sync goes crazy but during first run, and data is being written fine. Main difference is no raidz, just striped across disks. So at least in this scenario no immediate problem apparent unless you're looking at top. rewrite still chokes.

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000 & dd if=/dev/zero of=/pool0/file2 bs=1M count=100000 &
[1] 5179
[2] 5180
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 237.073 s, 442 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 240.795 s, 435 MB/s

          capacity     operations    bandwidth

pool alloc free read write read write

pool0 44.2G 48.9T 1 2.69K 41.9K 327M
pool0 45.0G 48.9T 0 6.71K 0 825M
pool0 45.5G 48.9T 0 6.70K 0 824M
pool0 46.4G 48.9T 0 6.67K 0 820M
pool0 47.2G 48.9T 0 6.84K 0 842M
pool0 47.7G 48.9T 0 6.66K 0 818M
pool0 48.5G 48.9T 0 6.71K 0 826M
pool0 49.9G 48.9T 0 6.72K 0 809M
pool0 50.7G 48.9T 0 6.56K 0 807M
pool0 51.6G 48.9T 0 6.70K 0 826M

top - 13:51:57 up 48 min, 2 users, load average: 5.85, 18.11, 29.82
Tasks: 439 total, 23 running, 416 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 73.9%sy, 0.0%ni, 24.0%id, 0.0%wa, 0.0%hi, 2.0%si, 0.0%st
Mem: 33010732k total, 16579784k used, 16430948k free, 15644k buffers
Swap: 8388604k total, 0k used, 8388604k free, 1408788k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5144 root 0 -20 0 0 0 R 94.7 0.0 3:24.45 txg_sync
5179 root 20 0 103m 1668 560 R 78.4 0.0 1:31.99 dd
5180 root 20 0 103m 1668 560 R 78.4 0.0 1:31.97 dd
4861 root 39 19 0 0 0 R 73.0 0.0 2:40.13 z_wr_iss/18
4840 root 39 19 0 0 0 S 70.3 0.0 2:25.96 z_wr_iss/0
4847 root 39 19 0 0 0 R 70.3 0.0 2:30.31 z_wr_iss/6
4848 root 39 19 0 0 0 R 67.6 0.0 2:22.34 z_wr_iss/7
4849 root 39 19 0 0 0 R 67.6 0.0 2:25.00 z_wr_iss/8
4851 root 39 19 0 0 0 R 67.6 0.0 2:25.22 z_wr_iss/9
4853 root 39 19 0 0 0 S 67.6 0.0 2:25.97 z_wr_iss/11
4854 root 39 19 0 0 0 R 67.6 0.0 2:00.84 z_wr_iss/12
4862 root 39 19 0 0 0 R 67.6 0.0 2:25.09 z_wr_iss/19
4863 root 39 19 0 0 0 R 67.6 0.0 2:25.59 z_wr_iss/20

But again on rewrite:

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000 & dd if=/dev/zero of=/pool0/file2 bs=1M count=100000 &
[1] 5200
[2] 5201

1127+0 records in
1127+0 records out
1181745152 bytes (1.2 GB) copied, 314.275 s, 3.8 MB/s

8359+0 records in
8359+0 records out
8765046784 bytes (8.8 GB) copied, 323.541 s, 27.1 MB/s

...
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 128K 0
pool0 195G 48.7T 1.09K 0 66.1M 0
pool0 195G 48.7T 2.11K 0 132M 0
pool0 195G 48.7T 2.01K 0 127M 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 2.50K 0 316M
pool0 195G 48.7T 0 1.02K 0 128M
pool0 195G 48.7T 0 1.74K 0 216M
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
...

top - 13:54:13 up 51 min, 2 users, load average: 63.20, 29.43, 31.94
Tasks: 412 total, 94 running, 318 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33010732k total, 4820136k used, 28190596k free, 15676k buffers
Swap: 8388604k total, 0k used, 8388604k free, 1408788k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ WCHAN COMMAND
5144 root 0 -20 0 0 0 R 100.0 0.0 4:57.12 - txg_sync
4999 root 39 19 0 0 0 S 27.3 0.0 0:15.15 taskq_thr z_fr_iss/82
4913 root 39 19 0 0 0 R 25.0 0.0 0:15.07 - z_fr_iss/4
4914 root 39 19 0 0 0 R 25.0 0.0 0:15.10 - z_fr_iss/5
4915 root 39 19 0 0 0 R 25.0 0.0 0:15.12 - z_fr_iss/6
4916 root 39 19 0 0 0 R 25.0 0.0 0:15.12 - z_fr_iss/7
4918 root 39 19 0 0 0 R 25.0 0.0 0:15.15 - z_fr_iss/9
4919 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/10
4920 root 39 19 0 0 0 R 25.0 0.0 0:15.13 taskq_thr z_fr_iss/11
4921 root 39 19 0 0 0 R 25.0 0.0 0:15.11 - z_fr_iss/12
4923 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/14
4924 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/15
4925 root 39 19 0 0 0 R 25.0 0.0 0:15.13 - z_fr_iss/16
4926 root 39 19 0 0 0 R 25.0 0.0 0:15.15 taskq_thr z_fr_iss/17
4927 root 39 19 0 0 0 R 25.0 0.0 0:15.09 - z_fr_iss/18
4928 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/19

[root@osd0 ~]# zpool status
pool: pool0
state: ONLINE
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
pool0       ONLINE       0     0     0
  sdb       ONLINE       0     0     0
  sdc       ONLINE       0     0     0
  sdd       ONLINE       0     0     0
  sde       ONLINE       0     0     0
  sdf       ONLINE       0     0     0
  sdg       ONLINE       0     0     0
  sdh       ONLINE       0     0     0
  sdi       ONLINE       0     0     0
  sdj       ONLINE       0     0     0
  sdk       ONLINE       0     0     0
  sdl       ONLINE       0     0     0
  sdm       ONLINE       0     0     0
  sdn       ONLINE       0     0     0
  sdo       ONLINE       0     0     0
  sdp       ONLINE       0     0     0
  sdq       ONLINE       0     0     0
  sdr       ONLINE       0     0     0
  sds       ONLINE       0     0     0
  sdt       ONLINE       0     0     0
  sdu       ONLINE       0     0     0
  sdv       ONLINE       0     0     0
  sdw       ONLINE       0     0     0
  sdx       ONLINE       0     0     0
  sdy       ONLINE       0     0     0
  sdz       ONLINE       0     0     0
  sdab      ONLINE       0     0     0
  sdac      ONLINE       0     0     0

errors: No known data errors

Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <[email protected]> Issue #258

behlendorf · 2011-11-13T23:49:01Z

Is anyone able to still reproduce this issue with the latest master source? Commit ae6ba3d plus some earlier changes should have improved things considerably.

mlsorensen · 2011-11-14T00:14:35Z

Will give it a go. You suspect this will fix the txg_sync high CPU on
rewrites and large deletes?
On Nov 13, 2011 4:49 PM, "Brian Behlendorf" <
[email protected]>
wrote:

Is anyone able to still reproduce this issue with the latest master
source? Commit ae6ba3d plus some earlier
changes should have improved things considerably.

Reply to this email directly or view it on GitHub:
#258 (comment)

behlendorf · 2011-11-14T20:54:05Z

It may help in the txg_sync case, but I expect that this patch with primarily address the high cpu usage by the arc_reclaim thread. The large deletes on the other hand are probably a different issue which we'll look at shortly.

pyavdr · 2012-01-11T16:19:11Z

I have a raidz1 pool with 10*1,5 TB hardisks ubuntu 11.10 with the zfs/spl version 41/2011-12-15. After deleting a 55 GB file in a zfs filesystem without dedup or compression the system becomes unresponsive. Top shows txg_sync at about 100% cpu usage. Need to hard reset the box. Can´t write anymore to the filesystem after reboot, because of high cpu usage again, reset again. I moved my files to a new zfs filesystem and deleted the old one. After that it works for some days. The pool is filled up at about 50%, no limit on the zfs filesystem. No chance to catch up dmesg. It is annoying and makes me think about alternatives to zol. Please look at it, im ready to test any new versions.

behlendorf · 2012-01-11T17:48:19Z

It occurs to me this issue may have the same root cause as the slow unlinks reported elsewhere such as issue 458. Ned Bass has started to dig in to the issue. If it is the same issue there's a good chance you can wait out spinning CPUs, they are making headway.

When a single work item is dispatched we only need to wake up one thread to service it. The current taskq implementation uses non-exclusive processes in the work wait queue, so every thread in the queue gets woken up when __taskq_dispatch() calls wake_up(). This overhead can become non-negligible as the number of threads in the queue increases, particularly when a large number of short work items are being dispatched. For example, this scenario appears to have led to poor unlink performance in ZFS as described in Issues openzfs/zfs#458 and openzfs/zfs#258. To optimize this case, use add_wait_queue_exclusive() to add threads to the work queue so only one is woken up at a time.

As described in Issue openzfs#458 and openzfs#258, unlinking large amounts of data can cause the threads in the zio free wait queue to start spinning. Reducing the number of z_fr_iss threads from a fixed value of 100 to 1 per cpu signficantly reduces contention on the taskq spinlock and improves throughput. Instrumenting the taskq code showed that __taskq_dispatch() can spend a long time holding tq->tq_lock if there are a large number of threads in the queue. It turns out the time spent in wake_up() scales linearly with the number of threads in the queue. When a large number of short work items are dispatched, as seems to be the case with unlink, the worker threads drain the queue faster than the dispatcher can fill it. They then all pile into the work wait queue to wait for new work items. So if 100 threads are in the queue, wake_up() takes about 100 times as long, and the woken threads have to spin until the dispatcher releases the lock. Reducing the number of threads helps with the symptoms, but doesn't get to the root of the problem. It would seem that wake_up() shouldn't scale linearly in time with queue depth, particularly if we are only trying to wake up one thread. In that vein, I tried making all of the waiting processes exclusive to prevent the scheduler from iterating over the entire list, but I still saw the linear time scaling. So further investigation is needed, but in the meantime reducing the thread count is an easy workaround.

nedbass · 2012-01-15T04:51:17Z

@pyavdr if you have a chance please give the above two patches a try. Thanks

pyavdr · 2012-01-15T06:36:53Z

As far as for now: no changes ... already spinning cpus at 100% until system becomes unresponsible and all programms freezed/died. Needed to hard reset the system.

what i did: got an already installed opensuse 12.1 on vmware (8 cpu´s 12 GB RAM, 5 drive raidz1) with spl&zfs at level 0.6.44 , edited yr patches to the two source files. did make clean; ./configure; make; make install for spl and zfs files. reboot. by the way zfs does not automount on suse 12.1 ( systemd??) zfs mount -a ...

Emptied my test zfs (500 GB). started to copying my testfiles (10 files each about 50 GB) from remote windows via smb to the zfs suse. started another copy with more files( source windows). while filling the zfs with test files from remote started to copy the files locally with suse shell ( cp zfs1/* zfs2 ) to another zfs with enabled dedup. After the last command the system died instantly as described, by closing the smb connections to windows, freezing "top", no keystrokes. im looking for more usefull informations. but there is no logging or debug info available. looks like the system does´nt write any messages at this state.... don´t know if it will recover after a while ..., cant say which process was on top at this state.

nedbass · 2012-01-15T17:43:22Z

That is different the reproducer you originally described (removing 55GB file with no dedup). Has that case gotten better? Deduplication may need much more memory than the 12 GB you have in your VM.

pyavdr · 2012-01-16T10:24:44Z

You are right, im mixing up different issues. just was too enthusiastic with yr patches.

again : i checked the copying/removal of my big files without dedup. testing about 3 hours ... it runs smooth and fine after the patches. no more freezes. thank you, i think this basic operation without dedup is working now.

Is it possible to apply this patches to the daily ppa?

nedbass · 2012-01-16T20:53:08Z

Yes, once Brian inspects, tests, and merges them they'll make their way into the ppa.

As described in Issue #458 and #258, unlinking large amounts of data can cause the threads in the zio free wait queue to start spinning. Reducing the number of z_fr_iss threads from a fixed value of 100 to 1 per cpu signficantly reduces contention on the taskq spinlock and improves throughput. Instrumenting the taskq code showed that __taskq_dispatch() can spend a long time holding tq->tq_lock if there are a large number of threads in the queue. It turns out the time spent in wake_up() scales linearly with the number of threads in the queue. When a large number of short work items are dispatched, as seems to be the case with unlink, the worker threads drain the queue faster than the dispatcher can fill it. They then all pile into the work wait queue to wait for new work items. So if 100 threads are in the queue, wake_up() takes about 100 times as long, and the woken threads have to spin until the dispatcher releases the lock. Reducing the number of threads helps with the symptoms, but doesn't get to the root of the problem. It would seem that wake_up() shouldn't scale linearly in time with queue depth, particularly if we are only trying to wake up one thread. In that vein, I tried making all of the waiting processes exclusive to prevent the scheduler from iterating over the entire list, but I still saw the linear time scaling. So further investigation is needed, but in the meantime reducing the thread count is an easy workaround. Signed-off-by: Brian Behlendorf <[email protected]> Issue #258 Issue #458

behlendorf · 2012-01-17T18:37:11Z

Merged in to master. I'm going to close this bug, if your still having issues with dedup please move a clear description in to a new issue. We'll track the remaining spl taskq optimization work under openzfs/spl#32

Signed-off-by: Pawan <[email protected]>

ghost assigned nedbass Jan 11, 2012

behlendorf closed this as completed Jan 17, 2012

kernelOfTruth mentioned this issue Nov 4, 2015

High CPU usage by "z_fr_iss" after deleting large files #3976

Closed

fuhrmannb pushed a commit to fuhrmannb/cstor that referenced this issue Nov 3, 2020

fix(snap_rebuild): test cases rebuild fix (openzfs#258)

112e422

Signed-off-by: Pawan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% CPU Spin on all cores by zfs threads #258

100% CPU Spin on all cores by zfs threads #258

devsk commented May 29, 2011

devsk commented May 29, 2011

devsk commented May 29, 2011

behlendorf commented May 29, 2011

devsk commented May 29, 2011

devsk commented May 30, 2011

behlendorf commented Jun 1, 2011

devsk commented Jun 1, 2011

devsk commented Jun 8, 2011

devsk commented Jun 8, 2011

devsk commented Jun 8, 2011

gig-tmb commented Jun 28, 2011

mlsorensen commented Sep 27, 2011

mlsorensen commented Sep 27, 2011

behlendorf commented Nov 13, 2011

mlsorensen commented Nov 14, 2011

behlendorf commented Nov 14, 2011

pyavdr commented Jan 11, 2012

behlendorf commented Jan 11, 2012

nedbass commented Jan 15, 2012

pyavdr commented Jan 15, 2012

nedbass commented Jan 15, 2012

pyavdr commented Jan 16, 2012

nedbass commented Jan 16, 2012

behlendorf commented Jan 17, 2012

100% CPU Spin on all cores by zfs threads #258

100% CPU Spin on all cores by zfs threads #258

Comments

devsk commented May 29, 2011

devsk commented May 29, 2011

devsk commented May 29, 2011

behlendorf commented May 29, 2011

devsk commented May 29, 2011

devsk commented May 30, 2011

behlendorf commented Jun 1, 2011

devsk commented Jun 1, 2011

devsk commented Jun 8, 2011

devsk commented Jun 8, 2011

devsk commented Jun 8, 2011

gig-tmb commented Jun 28, 2011

mlsorensen commented Sep 27, 2011

mlsorensen commented Sep 27, 2011

behlendorf commented Nov 13, 2011

mlsorensen commented Nov 14, 2011

behlendorf commented Nov 14, 2011

pyavdr commented Jan 11, 2012

behlendorf commented Jan 11, 2012

nedbass commented Jan 15, 2012

pyavdr commented Jan 15, 2012

nedbass commented Jan 15, 2012

pyavdr commented Jan 16, 2012

nedbass commented Jan 16, 2012

behlendorf commented Jan 17, 2012