Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU Spin on all cores by zfs threads #258

Closed
devsk opened this issue May 29, 2011 · 24 comments
Closed

100% CPU Spin on all cores by zfs threads #258

devsk opened this issue May 29, 2011 · 24 comments
Assignees
Milestone

Comments

@devsk
Copy link

devsk commented May 29, 2011

It happened when a large file (20GB) was being overwritten from a pool which has dedup and compression on. The system becomes completely unresponsive during this and stays unresponsive for upto many minutes. I was recording the CPU usage of various processes and z_wr_iss tasks were all spinning with 100% CPU usage. There were more such tasks than the number of cores I have in my system. And of course, kernel threads did not let anything in user space run (non-PREEMPT setup).

It happened multiple times during the overwrite of the file which took a LONNNGGG time.

The way to reproduce this would be:

  1. Create a pool of size 22GB with dedup and compression on.
  2. Create a 20GB file of random data on another FS.

time dd if=/dev/urandom of=/var/tmp/tempfile count=20000 bs=1M

  1. Copy it into the pool.
  2. Create another 20GB file in another FS with the same name with random data.
  3. Overwrite the file in the pool with the file from 4.

The size of the pool I have is about 90% used, so this should come close to reproducing it.

Also, I am using pool version 26 (to still have the possibility of recovery from zfs-fuse or the OS install I have).

@devsk
Copy link
Author

devsk commented May 29, 2011

It goes without saying, I guess (z_wr_iss), something is being written to disk when this spin happens. I see large amount of write IO happening when the system goes to lunch. Also, a bunch of space is freed as shown by 'df' after every spin. Seems like the FS is reclaiming disk blocks in a tight loop.

@devsk
Copy link
Author

devsk commented May 29, 2011

I forgot to mention a lot of RAM (about 1.2G) is freed before and after of this hang. If that matters for analysis.

I am going to get another 1TB drive and get rid of dedup. Its more trouble than worth the 30% (300GB in my case = $30) savings.

@behlendorf
Copy link
Contributor

OK that's a good test case. Does this only happen when the pool is at 90% of capacity, or does the 100% CPU spin even occur when your doing an overwrite and the pool is much emptier?

@devsk
Copy link
Author

devsk commented May 29, 2011

There is close to 900GB of data in there. So, its hard to test with more empty space. I don't know if it will reproduce with more empty space.

I just realized that I quoted the processes wrong. The tasks which spin are z_wr_int.

@devsk
Copy link
Author

devsk commented May 30, 2011

The PREEMPT_NONE is killing the system during heavy IO. Everything locks up for several minutes.

@behlendorf
Copy link
Contributor

It's conceivable this is related to issue #167. Apparently the default setting for CONFIG_DEBUG_MUTEX under ArchLinux results in some unfortunate mutex problems like potential spinning in the z_wr_int threads. The workaround in the short term is to disable this which is the default I believe for most other non-debug distribution kernels. That said this of course still needs to be fixed.

@devsk
Copy link
Author

devsk commented Jun 1, 2011

Doesn't apply to my system.

# zgrep CONFIG_DEBUG_MUTEX /proc/config.gz 
# CONFIG_DEBUG_MUTEXES is not set

@devsk
Copy link
Author

devsk commented Jun 8, 2011

arc_reclaim continues to hog CPU once in a while. Note that the system was completely idle at this time. I see lot of memory movement (free and used memory changing) all the time although the system is idle.

Also, note the CPU time it has accumulated. Far greater than anything else in the system.

top - 17:11:06 up 2 days,  1:15, 28 users,  load average: 0.07, 0.03, 0.05
Tasks: 588 total,   3 running, 585 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  2.6%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12298180k total, 10590668k used,  1707512k free,  3004812k buffers
Swap: 12582908k total,        0k used, 12582908k free,  1842640k cached
PID to kill: 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                            
 2426 root       0 -20     0    0    0 R   20  0.0  11:14.57 arc_reclaim                                                        
 4946 root      39  19     0    0    0 R    0  0.0   0:00.43 z_null_iss/0

@devsk
Copy link
Author

devsk commented Jun 8, 2011

The last thing about arc_reclaim actually has nothing to do with this bug. This bug is about total 100% CPU spin by many many threads at the same time and not just high CPU usage by one process during idle.

@devsk
Copy link
Author

devsk commented Jun 8, 2011

It seems like ac_reclaim is popping up every 10-20 seconds and spinning for a bit.

 2426 root       0 -20     0    0    0 R   22  0.0  11:19.00 arc_reclaim                                                        
22601 root      20   0 19748 1732  984 R    2  0.0   0:01.11 top         

@gig-tmb
Copy link

gig-tmb commented Jun 28, 2011

It seems like txg_sync is popping up and system hungs

1495 root 0 -20 0 0 0 D 22 0.0 00:19.00 txg_sync

@mlsorensen
Copy link

I may have also run across this issue. I can reliably cause the zfs to choke with txg_sync and z_fr_iss threads spinning and nothing apparently going on, on two different boxes. The filesystem is fresh, nothing on it, no dedup or compression. Kernel is 2.6.39.3 and everything is default settings. /proc/meminfo shows that vmallocused does not change at all during this time.

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000
99999+1 records in
99999+1 records out
104857317376 bytes (105 GB) copied, 153.905 s, 681 MB/s

[root@osd0 ~]# zpool iostat 2
capacity operations bandwidth
pool alloc free read write read write


pool0 125G 50.6T 0 2.19K 30 272M
pool0 128G 50.6T 0 8.58K 0 1.04G
pool0 131G 50.6T 0 9.41K 0 1.14G
pool0 134G 50.6T 0 9.23K 0 1.12G
pool0 137G 50.6T 0 8.48K 0 1.03G
pool0 141G 50.6T 0 8.63K 0 1.05G
...

then a second time... takes forever so I ctrl+c

dd if=/dev/zero of=/pool0/file bs=1M count=100000
^C8141+0 records in
8141+0 records out
8536457216 bytes (8.5 GB) copied, 328.335 s, 26.0 MB/s

...
pool0 261G 50.5T 0 0 0 0
pool0 261G 50.5T 1 0 11.7K 0
pool0 261G 50.5T 2.19K 20 4.20M 161K
pool0 261G 50.5T 2.08K 0 4.02M 0
pool0 261G 50.5T 1.24K 0 2.22M 0
pool0 261G 50.5T 0 0 0 0
pool0 261G 50.5T 0 0 0 0
...

Then a third time, this time new files (dd is CPU bound so we start two)

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file4 bs=1M count=100000 & dd if=/dev/zero of=/pool0/file5 bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 181.895 s, 576 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 183.822 s, 570 MB/s

[root@osd0 ~]# zpool iostat 2
capacity operations bandwidth
pool alloc free read write read write


pool0 55.0G 50.7T 10 1.89K 20.8K 234M
pool0 59.3G 50.7T 0 8.94K 0 1.07G
pool0 61.8G 50.7T 0 8.66K 0 1.04G
pool0 64.6G 50.7T 0 10.1K 0 1.22G
pool0 67.3G 50.7T 0 9.71K 0 1.17G
pool0 71.2G 50.7T 0 9.29K 0 1.12G

Then a fourth time... rewrite those files.

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file4 bs=1M count=100000 & dd if=/dev/zero of=/pool0/file5 bs=1M count=100000

^C

849+0 records in
849+0 records out
890241024 bytes (890 MB) copied, 72.4994 s, 12.3 MB/s

fg
^C
1405+0 records in
1405+0 records out
1473249280 bytes (1.5 GB) copied, 319.138 s, 4.6 MB/s

...
ool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 3.78K 3 7.41M 71.4K
pool0 280G 50.5T 2.21K 17 4.19M 89.7K
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
pool0 280G 50.5T 0 0 0 0
...

top - 13:39:52 up 36 min, 2 users, load average: 82.47, 47.54, 33.11
Tasks: 411 total, 89 running, 322 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 99.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33010732k total, 5229556k used, 27781176k free, 17004k buffers
Swap: 8388604k total, 0k used, 8388604k free, 1408720k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ WCHAN COMMAND
2380 root 0 -20 0 0 0 R 97.8 0.0 8:11.04 - txg_sync
2264 root 39 19 0 0 0 R 26.7 0.0 1:46.96 - z_fr_iss/5
2268 root 39 19 0 0 0 R 26.7 0.0 1:46.98 - z_fr_iss/9
2277 root 39 19 0 0 0 S 26.7 0.0 1:46.74 - z_fr_iss/18
2287 root 39 19 0 0 0 R 26.7 0.0 1:47.00 - z_fr_iss/28
2291 root 39 19 0 0 0 R 26.7 0.0 1:47.02 taskq_thr z_fr_iss/32
2300 root 39 19 0 0 0 R 26.7 0.0 1:47.01 taskq_thr z_fr_iss/41
2303 root 39 19 0 0 0 R 26.7 0.0 1:46.84 - z_fr_iss/44
2305 root 39 19 0 0 0 R 26.7 0.0 1:46.92 - z_fr_iss/46
2319 root 39 19 0 0 0 R 26.7 0.0 1:46.74 - z_fr_iss/60
2321 root 39 19 0 0 0 R 26.7 0.0 1:46.77 - z_fr_iss/62
2350 root 39 19 0 0 0 R 26.7 0.0 1:46.81 - z_fr_iss/91
2351 root 39 19 0 0 0 R 26.7 0.0 1:46.81 - z_fr_iss/92

[root@osd0 ~]# zpool status
pool: pool0
state: ONLINE
scan: none requested
config:

NAME                                STATE     READ WRITE CKSUM
pool0                               ONLINE       0     0     0
  raidz1-0                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:0:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:1:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:2:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:3:0   ONLINE       0     0     0
  raidz1-1                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:4:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:5:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:6:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:7:0   ONLINE       0     0     0
  raidz1-2                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:8:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:9:0   ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:10:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:11:0  ONLINE       0     0     0
  raidz1-3                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:12:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:13:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:14:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:15:0  ONLINE       0     0     0
  raidz1-4                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:16:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:17:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:18:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:19:0  ONLINE       0     0     0
  raidz1-5                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:20:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:21:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:22:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:23:0  ONLINE       0     0     0
  raidz1-6                          ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:24:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:25:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:26:0  ONLINE       0     0     0
    pci-0000:03:00.0-scsi-0:2:27:0  ONLINE       0     0     0

@mlsorensen
Copy link

changed the config a bit to try to duplicate. This time txg_sync goes crazy but during first run, and data is being written fine. Main difference is no raidz, just striped across disks. So at least in this scenario no immediate problem apparent unless you're looking at top. rewrite still chokes.

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000 & dd if=/dev/zero of=/pool0/file2 bs=1M count=100000 &
[1] 5179
[2] 5180
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 237.073 s, 442 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 240.795 s, 435 MB/s

          capacity     operations    bandwidth

pool alloc free read write read write


pool0 44.2G 48.9T 1 2.69K 41.9K 327M
pool0 45.0G 48.9T 0 6.71K 0 825M
pool0 45.5G 48.9T 0 6.70K 0 824M
pool0 46.4G 48.9T 0 6.67K 0 820M
pool0 47.2G 48.9T 0 6.84K 0 842M
pool0 47.7G 48.9T 0 6.66K 0 818M
pool0 48.5G 48.9T 0 6.71K 0 826M
pool0 49.9G 48.9T 0 6.72K 0 809M
pool0 50.7G 48.9T 0 6.56K 0 807M
pool0 51.6G 48.9T 0 6.70K 0 826M

top - 13:51:57 up 48 min, 2 users, load average: 5.85, 18.11, 29.82
Tasks: 439 total, 23 running, 416 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 73.9%sy, 0.0%ni, 24.0%id, 0.0%wa, 0.0%hi, 2.0%si, 0.0%st
Mem: 33010732k total, 16579784k used, 16430948k free, 15644k buffers
Swap: 8388604k total, 0k used, 8388604k free, 1408788k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5144 root 0 -20 0 0 0 R 94.7 0.0 3:24.45 txg_sync
5179 root 20 0 103m 1668 560 R 78.4 0.0 1:31.99 dd
5180 root 20 0 103m 1668 560 R 78.4 0.0 1:31.97 dd
4861 root 39 19 0 0 0 R 73.0 0.0 2:40.13 z_wr_iss/18
4840 root 39 19 0 0 0 S 70.3 0.0 2:25.96 z_wr_iss/0
4847 root 39 19 0 0 0 R 70.3 0.0 2:30.31 z_wr_iss/6
4848 root 39 19 0 0 0 R 67.6 0.0 2:22.34 z_wr_iss/7
4849 root 39 19 0 0 0 R 67.6 0.0 2:25.00 z_wr_iss/8
4851 root 39 19 0 0 0 R 67.6 0.0 2:25.22 z_wr_iss/9
4853 root 39 19 0 0 0 S 67.6 0.0 2:25.97 z_wr_iss/11
4854 root 39 19 0 0 0 R 67.6 0.0 2:00.84 z_wr_iss/12
4862 root 39 19 0 0 0 R 67.6 0.0 2:25.09 z_wr_iss/19
4863 root 39 19 0 0 0 R 67.6 0.0 2:25.59 z_wr_iss/20

But again on rewrite:

[root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000 & dd if=/dev/zero of=/pool0/file2 bs=1M count=100000 &
[1] 5200
[2] 5201

1127+0 records in
1127+0 records out
1181745152 bytes (1.2 GB) copied, 314.275 s, 3.8 MB/s

8359+0 records in
8359+0 records out
8765046784 bytes (8.8 GB) copied, 323.541 s, 27.1 MB/s

...
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 128K 0
pool0 195G 48.7T 1.09K 0 66.1M 0
pool0 195G 48.7T 2.11K 0 132M 0
pool0 195G 48.7T 2.01K 0 127M 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 2.50K 0 316M
pool0 195G 48.7T 0 1.02K 0 128M
pool0 195G 48.7T 0 1.74K 0 216M
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
pool0 195G 48.7T 0 0 0 0
...

top - 13:54:13 up 51 min, 2 users, load average: 63.20, 29.43, 31.94
Tasks: 412 total, 94 running, 318 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33010732k total, 4820136k used, 28190596k free, 15676k buffers
Swap: 8388604k total, 0k used, 8388604k free, 1408788k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ WCHAN COMMAND
5144 root 0 -20 0 0 0 R 100.0 0.0 4:57.12 - txg_sync
4999 root 39 19 0 0 0 S 27.3 0.0 0:15.15 taskq_thr z_fr_iss/82
4913 root 39 19 0 0 0 R 25.0 0.0 0:15.07 - z_fr_iss/4
4914 root 39 19 0 0 0 R 25.0 0.0 0:15.10 - z_fr_iss/5
4915 root 39 19 0 0 0 R 25.0 0.0 0:15.12 - z_fr_iss/6
4916 root 39 19 0 0 0 R 25.0 0.0 0:15.12 - z_fr_iss/7
4918 root 39 19 0 0 0 R 25.0 0.0 0:15.15 - z_fr_iss/9
4919 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/10
4920 root 39 19 0 0 0 R 25.0 0.0 0:15.13 taskq_thr z_fr_iss/11
4921 root 39 19 0 0 0 R 25.0 0.0 0:15.11 - z_fr_iss/12
4923 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/14
4924 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/15
4925 root 39 19 0 0 0 R 25.0 0.0 0:15.13 - z_fr_iss/16
4926 root 39 19 0 0 0 R 25.0 0.0 0:15.15 taskq_thr z_fr_iss/17
4927 root 39 19 0 0 0 R 25.0 0.0 0:15.09 - z_fr_iss/18
4928 root 39 19 0 0 0 R 25.0 0.0 0:15.14 - z_fr_iss/19

[root@osd0 ~]# zpool status
pool: pool0
state: ONLINE
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
pool0       ONLINE       0     0     0
  sdb       ONLINE       0     0     0
  sdc       ONLINE       0     0     0
  sdd       ONLINE       0     0     0
  sde       ONLINE       0     0     0
  sdf       ONLINE       0     0     0
  sdg       ONLINE       0     0     0
  sdh       ONLINE       0     0     0
  sdi       ONLINE       0     0     0
  sdj       ONLINE       0     0     0
  sdk       ONLINE       0     0     0
  sdl       ONLINE       0     0     0
  sdm       ONLINE       0     0     0
  sdn       ONLINE       0     0     0
  sdo       ONLINE       0     0     0
  sdp       ONLINE       0     0     0
  sdq       ONLINE       0     0     0
  sdr       ONLINE       0     0     0
  sds       ONLINE       0     0     0
  sdt       ONLINE       0     0     0
  sdu       ONLINE       0     0     0
  sdv       ONLINE       0     0     0
  sdw       ONLINE       0     0     0
  sdx       ONLINE       0     0     0
  sdy       ONLINE       0     0     0
  sdz       ONLINE       0     0     0
  sdab      ONLINE       0     0     0
  sdac      ONLINE       0     0     0

errors: No known data errors

behlendorf added a commit that referenced this issue Nov 3, 2011
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound.  A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.

It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory.  By default a kmem_cache will
dynamically determine the type of memory used based on the object
size.  For large objects virtual memory is usually preferable
and for small object physical memory is a better choice.  See
the spl_slab_alloc() function for a longer discussion on this.

However, there is a certain amount of gray area when defining a
'large' object.  For the following caches it turns out they were
just over the line:

  * dnode_cache
  * zio_cache
  * zio_link_cache
  * zio_buf_512_cache
  * zfs_data_buf_512_cache

Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses.  This entirely avoids the
need to serialize on the virtual address space lock.

As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue #258
@behlendorf
Copy link
Contributor

Is anyone able to still reproduce this issue with the latest master source? Commit ae6ba3d plus some earlier changes should have improved things considerably.

@mlsorensen
Copy link

Will give it a go. You suspect this will fix the txg_sync high CPU on
rewrites and large deletes?
On Nov 13, 2011 4:49 PM, "Brian Behlendorf" <
[email protected]>
wrote:

Is anyone able to still reproduce this issue with the latest master
source? Commit ae6ba3d plus some earlier
changes should have improved things considerably.


Reply to this email directly or view it on GitHub:
#258 (comment)

@behlendorf
Copy link
Contributor

It may help in the txg_sync case, but I expect that this patch with primarily address the high cpu usage by the arc_reclaim thread. The large deletes on the other hand are probably a different issue which we'll look at shortly.

@pyavdr
Copy link
Contributor

pyavdr commented Jan 11, 2012

I have a raidz1 pool with 10*1,5 TB hardisks ubuntu 11.10 with the zfs/spl version 41/2011-12-15. After deleting a 55 GB file in a zfs filesystem without dedup or compression the system becomes unresponsive. Top shows txg_sync at about 100% cpu usage. Need to hard reset the box. Can´t write anymore to the filesystem after reboot, because of high cpu usage again, reset again. I moved my files to a new zfs filesystem and deleted the old one. After that it works for some days. The pool is filled up at about 50%, no limit on the zfs filesystem. No chance to catch up dmesg. It is annoying and makes me think about alternatives to zol. Please look at it, im ready to test any new versions.

@behlendorf
Copy link
Contributor

It occurs to me this issue may have the same root cause as the slow unlinks reported elsewhere such as issue 458. Ned Bass has started to dig in to the issue. If it is the same issue there's a good chance you can wait out spinning CPUs, they are making headway.

@ghost ghost assigned nedbass Jan 11, 2012
nedbass added a commit to nedbass/spl that referenced this issue Jan 14, 2012
When a single work item is dispatched we only need to wake up one
thread to service it.  The current taskq implementation uses
non-exclusive processes in the work wait queue, so every thread in the
queue gets woken up when __taskq_dispatch() calls wake_up().  This
overhead can become non-negligible as the number of threads in the
queue increases, particularly when a large number of short work items
are being dispatched.  For example, this scenario appears to have led
to poor unlink performance in ZFS as described in Issues
openzfs/zfs#458 and openzfs/zfs#258.  To optimize this case, use
add_wait_queue_exclusive() to add threads to the work queue so only
one is woken up at a time.
nedbass added a commit to nedbass/zfs that referenced this issue Jan 14, 2012
As described in Issue openzfs#458 and openzfs#258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.

Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue.  It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue.  When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it.  They then all pile into the work wait queue to wait for
new work items.  So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.

Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem.  It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread.  In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling.  So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.
@nedbass
Copy link
Contributor

nedbass commented Jan 15, 2012

@pyavdr if you have a chance please give the above two patches a try. Thanks

@pyavdr
Copy link
Contributor

pyavdr commented Jan 15, 2012

As far as for now: no changes ... already spinning cpus at 100% until system becomes unresponsible and all programms freezed/died. Needed to hard reset the system.

what i did: got an already installed opensuse 12.1 on vmware (8 cpu´s 12 GB RAM, 5 drive raidz1) with spl&zfs at level 0.6.44 , edited yr patches to the two source files. did make clean; ./configure; make; make install for spl and zfs files. reboot. by the way zfs does not automount on suse 12.1 ( systemd??) zfs mount -a ...

Emptied my test zfs (500 GB). started to copying my testfiles (10 files each about 50 GB) from remote windows via smb to the zfs suse. started another copy with more files( source windows). while filling the zfs with test files from remote started to copy the files locally with suse shell ( cp zfs1/* zfs2 ) to another zfs with enabled dedup. After the last command the system died instantly as described, by closing the smb connections to windows, freezing "top", no keystrokes. im looking for more usefull informations. but there is no logging or debug info available. looks like the system does´nt write any messages at this state.... don´t know if it will recover after a while ..., cant say which process was on top at this state.

@nedbass
Copy link
Contributor

nedbass commented Jan 15, 2012

That is different the reproducer you originally described (removing 55GB file with no dedup). Has that case gotten better? Deduplication may need much more memory than the 12 GB you have in your VM.

@pyavdr
Copy link
Contributor

pyavdr commented Jan 16, 2012

You are right, im mixing up different issues. just was too enthusiastic with yr patches.

again : i checked the copying/removal of my big files without dedup. testing about 3 hours ... it runs smooth and fine after the patches. no more freezes. thank you, i think this basic operation without dedup is working now.

Is it possible to apply this patches to the daily ppa?

@nedbass
Copy link
Contributor

nedbass commented Jan 16, 2012

Yes, once Brian inspects, tests, and merges them they'll make their way into the ppa.

behlendorf pushed a commit that referenced this issue Jan 17, 2012
As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.

Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue.  It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue.  When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it.  They then all pile into the work wait queue to wait for
new work items.  So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.

Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem.  It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread.  In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling.  So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue #258
Issue #458
@behlendorf
Copy link
Contributor

Merged in to master. I'm going to close this bug, if your still having issues with dedup please move a clear description in to a new issue. We'll track the remaining spl taskq optimization work under openzfs/spl#32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants