-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% CPU Spin on all cores by zfs threads #258
Comments
It goes without saying, I guess (z_wr_iss), something is being written to disk when this spin happens. I see large amount of write IO happening when the system goes to lunch. Also, a bunch of space is freed as shown by 'df' after every spin. Seems like the FS is reclaiming disk blocks in a tight loop. |
I forgot to mention a lot of RAM (about 1.2G) is freed before and after of this hang. If that matters for analysis. I am going to get another 1TB drive and get rid of dedup. Its more trouble than worth the 30% (300GB in my case = $30) savings. |
OK that's a good test case. Does this only happen when the pool is at 90% of capacity, or does the 100% CPU spin even occur when your doing an overwrite and the pool is much emptier? |
There is close to 900GB of data in there. So, its hard to test with more empty space. I don't know if it will reproduce with more empty space. I just realized that I quoted the processes wrong. The tasks which spin are z_wr_int. |
The PREEMPT_NONE is killing the system during heavy IO. Everything locks up for several minutes. |
It's conceivable this is related to issue #167. Apparently the default setting for CONFIG_DEBUG_MUTEX under ArchLinux results in some unfortunate mutex problems like potential spinning in the z_wr_int threads. The workaround in the short term is to disable this which is the default I believe for most other non-debug distribution kernels. That said this of course still needs to be fixed. |
Doesn't apply to my system. # zgrep CONFIG_DEBUG_MUTEX /proc/config.gz # CONFIG_DEBUG_MUTEXES is not set |
arc_reclaim continues to hog CPU once in a while. Note that the system was completely idle at this time. I see lot of memory movement (free and used memory changing) all the time although the system is idle. Also, note the CPU time it has accumulated. Far greater than anything else in the system. top - 17:11:06 up 2 days, 1:15, 28 users, load average: 0.07, 0.03, 0.05 Tasks: 588 total, 3 running, 585 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 2.6%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 12298180k total, 10590668k used, 1707512k free, 3004812k buffers Swap: 12582908k total, 0k used, 12582908k free, 1842640k cached PID to kill: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2426 root 0 -20 0 0 0 R 20 0.0 11:14.57 arc_reclaim 4946 root 39 19 0 0 0 R 0 0.0 0:00.43 z_null_iss/0 |
The last thing about arc_reclaim actually has nothing to do with this bug. This bug is about total 100% CPU spin by many many threads at the same time and not just high CPU usage by one process during idle. |
It seems like ac_reclaim is popping up every 10-20 seconds and spinning for a bit. 2426 root 0 -20 0 0 0 R 22 0.0 11:19.00 arc_reclaim 22601 root 20 0 19748 1732 984 R 2 0.0 0:01.11 top |
It seems like txg_sync is popping up and system hungs 1495 root 0 -20 0 0 0 D 22 0.0 00:19.00 txg_sync |
I may have also run across this issue. I can reliably cause the zfs to choke with txg_sync and z_fr_iss threads spinning and nothing apparently going on, on two different boxes. The filesystem is fresh, nothing on it, no dedup or compression. Kernel is 2.6.39.3 and everything is default settings. /proc/meminfo shows that vmallocused does not change at all during this time. [root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000 [root@osd0 ~]# zpool iostat 2 pool0 125G 50.6T 0 2.19K 30 272M then a second time... takes forever so I ctrl+c dd if=/dev/zero of=/pool0/file bs=1M count=100000 ... Then a third time, this time new files (dd is CPU bound so we start two) [root@osd0 ~]# dd if=/dev/zero of=/pool0/file4 bs=1M count=100000 & dd if=/dev/zero of=/pool0/file5 bs=1M count=100000 [root@osd0 ~]# zpool iostat 2 pool0 55.0G 50.7T 10 1.89K 20.8K 234M Then a fourth time... rewrite those files. [root@osd0 ~]# dd if=/dev/zero of=/pool0/file4 bs=1M count=100000 & dd if=/dev/zero of=/pool0/file5 bs=1M count=100000 ^C 849+0 records in fg ... top - 13:39:52 up 36 min, 2 users, load average: 82.47, 47.54, 33.11 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ WCHAN COMMAND [root@osd0 ~]# zpool status
|
changed the config a bit to try to duplicate. This time txg_sync goes crazy but during first run, and data is being written fine. Main difference is no raidz, just striped across disks. So at least in this scenario no immediate problem apparent unless you're looking at top. rewrite still chokes. [root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000 & dd if=/dev/zero of=/pool0/file2 bs=1M count=100000 &
pool alloc free read write read write pool0 44.2G 48.9T 1 2.69K 41.9K 327M top - 13:51:57 up 48 min, 2 users, load average: 5.85, 18.11, 29.82 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND But again on rewrite: [root@osd0 ~]# dd if=/dev/zero of=/pool0/file bs=1M count=100000 & dd if=/dev/zero of=/pool0/file2 bs=1M count=100000 & 1127+0 records in 8359+0 records in ... top - 13:54:13 up 51 min, 2 users, load average: 63.20, 29.43, 31.94 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ WCHAN COMMAND [root@osd0 ~]# zpool status
errors: No known data errors |
Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <[email protected]> Issue #258
Is anyone able to still reproduce this issue with the latest master source? Commit ae6ba3d plus some earlier changes should have improved things considerably. |
Will give it a go. You suspect this will fix the txg_sync high CPU on
|
It may help in the txg_sync case, but I expect that this patch with primarily address the high cpu usage by the arc_reclaim thread. The large deletes on the other hand are probably a different issue which we'll look at shortly. |
I have a raidz1 pool with 10*1,5 TB hardisks ubuntu 11.10 with the zfs/spl version 41/2011-12-15. After deleting a 55 GB file in a zfs filesystem without dedup or compression the system becomes unresponsive. Top shows txg_sync at about 100% cpu usage. Need to hard reset the box. Can´t write anymore to the filesystem after reboot, because of high cpu usage again, reset again. I moved my files to a new zfs filesystem and deleted the old one. After that it works for some days. The pool is filled up at about 50%, no limit on the zfs filesystem. No chance to catch up dmesg. It is annoying and makes me think about alternatives to zol. Please look at it, im ready to test any new versions. |
It occurs to me this issue may have the same root cause as the slow unlinks reported elsewhere such as issue 458. Ned Bass has started to dig in to the issue. If it is the same issue there's a good chance you can wait out spinning CPUs, they are making headway. |
When a single work item is dispatched we only need to wake up one thread to service it. The current taskq implementation uses non-exclusive processes in the work wait queue, so every thread in the queue gets woken up when __taskq_dispatch() calls wake_up(). This overhead can become non-negligible as the number of threads in the queue increases, particularly when a large number of short work items are being dispatched. For example, this scenario appears to have led to poor unlink performance in ZFS as described in Issues openzfs/zfs#458 and openzfs/zfs#258. To optimize this case, use add_wait_queue_exclusive() to add threads to the work queue so only one is woken up at a time.
As described in Issue openzfs#458 and openzfs#258, unlinking large amounts of data can cause the threads in the zio free wait queue to start spinning. Reducing the number of z_fr_iss threads from a fixed value of 100 to 1 per cpu signficantly reduces contention on the taskq spinlock and improves throughput. Instrumenting the taskq code showed that __taskq_dispatch() can spend a long time holding tq->tq_lock if there are a large number of threads in the queue. It turns out the time spent in wake_up() scales linearly with the number of threads in the queue. When a large number of short work items are dispatched, as seems to be the case with unlink, the worker threads drain the queue faster than the dispatcher can fill it. They then all pile into the work wait queue to wait for new work items. So if 100 threads are in the queue, wake_up() takes about 100 times as long, and the woken threads have to spin until the dispatcher releases the lock. Reducing the number of threads helps with the symptoms, but doesn't get to the root of the problem. It would seem that wake_up() shouldn't scale linearly in time with queue depth, particularly if we are only trying to wake up one thread. In that vein, I tried making all of the waiting processes exclusive to prevent the scheduler from iterating over the entire list, but I still saw the linear time scaling. So further investigation is needed, but in the meantime reducing the thread count is an easy workaround.
@pyavdr if you have a chance please give the above two patches a try. Thanks |
As far as for now: no changes ... already spinning cpus at 100% until system becomes unresponsible and all programms freezed/died. Needed to hard reset the system. what i did: got an already installed opensuse 12.1 on vmware (8 cpu´s 12 GB RAM, 5 drive raidz1) with spl&zfs at level 0.6.44 , edited yr patches to the two source files. did make clean; ./configure; make; make install for spl and zfs files. reboot. by the way zfs does not automount on suse 12.1 ( systemd??) zfs mount -a ... Emptied my test zfs (500 GB). started to copying my testfiles (10 files each about 50 GB) from remote windows via smb to the zfs suse. started another copy with more files( source windows). while filling the zfs with test files from remote started to copy the files locally with suse shell ( cp zfs1/* zfs2 ) to another zfs with enabled dedup. After the last command the system died instantly as described, by closing the smb connections to windows, freezing "top", no keystrokes. im looking for more usefull informations. but there is no logging or debug info available. looks like the system does´nt write any messages at this state.... don´t know if it will recover after a while ..., cant say which process was on top at this state. |
That is different the reproducer you originally described (removing 55GB file with no dedup). Has that case gotten better? Deduplication may need much more memory than the 12 GB you have in your VM. |
You are right, im mixing up different issues. just was too enthusiastic with yr patches. again : i checked the copying/removal of my big files without dedup. testing about 3 hours ... it runs smooth and fine after the patches. no more freezes. thank you, i think this basic operation without dedup is working now. Is it possible to apply this patches to the daily ppa? |
Yes, once Brian inspects, tests, and merges them they'll make their way into the ppa. |
As described in Issue #458 and #258, unlinking large amounts of data can cause the threads in the zio free wait queue to start spinning. Reducing the number of z_fr_iss threads from a fixed value of 100 to 1 per cpu signficantly reduces contention on the taskq spinlock and improves throughput. Instrumenting the taskq code showed that __taskq_dispatch() can spend a long time holding tq->tq_lock if there are a large number of threads in the queue. It turns out the time spent in wake_up() scales linearly with the number of threads in the queue. When a large number of short work items are dispatched, as seems to be the case with unlink, the worker threads drain the queue faster than the dispatcher can fill it. They then all pile into the work wait queue to wait for new work items. So if 100 threads are in the queue, wake_up() takes about 100 times as long, and the woken threads have to spin until the dispatcher releases the lock. Reducing the number of threads helps with the symptoms, but doesn't get to the root of the problem. It would seem that wake_up() shouldn't scale linearly in time with queue depth, particularly if we are only trying to wake up one thread. In that vein, I tried making all of the waiting processes exclusive to prevent the scheduler from iterating over the entire list, but I still saw the linear time scaling. So further investigation is needed, but in the meantime reducing the thread count is an easy workaround. Signed-off-by: Brian Behlendorf <[email protected]> Issue #258 Issue #458
Merged in to master. I'm going to close this bug, if your still having issues with dedup please move a clear description in to a new issue. We'll track the remaining spl taskq optimization work under openzfs/spl#32 |
Signed-off-by: Pawan <[email protected]>
It happened when a large file (20GB) was being overwritten from a pool which has dedup and compression on. The system becomes completely unresponsive during this and stays unresponsive for upto many minutes. I was recording the CPU usage of various processes and z_wr_iss tasks were all spinning with 100% CPU usage. There were more such tasks than the number of cores I have in my system. And of course, kernel threads did not let anything in user space run (non-PREEMPT setup).
It happened multiple times during the overwrite of the file which took a LONNNGGG time.
The way to reproduce this would be:
time dd if=/dev/urandom of=/var/tmp/tempfile count=20000 bs=1M
The size of the pool I have is about 90% used, so this should come close to reproducing it.
Also, I am using pool version 26 (to still have the possibility of recovery from zfs-fuse or the OS install I have).
The text was updated successfully, but these errors were encountered: