-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.6.3 zfs hang, known issue? #3532
Comments
I strongly recommend upgrading to v0.6.4.1 - it may include fixes to your issue. |
Better yet, hold for just a little bit and jump to 0.6.4.2 which have several additional fixes in this area. |
Oh nice, when is 0.6.4.2 expected to be released? @dasjoe yea that should be useful. Basically what I am seeing in the rsync jobs that run that are reading in the gazillion files, and writing them to another path. I see my arccache (which is now set to 64GB) grow very very quickly. Looking at arcstat I can see mdmiss% at 100% which my guess is upon reboot, the first scan is a miss and is showing up. In some situations like this I am wondering if primarycache=metadata might be better suited for this workload. |
0.6.4.2 was released yesterday :) |
Oh cool I'll check that out in my RnD lab!
|
I got a similar problem in 0.6.4.2. The problem happens after running rsync. The machine is sort of usable but needs a reboot to make zfs work again. The metaslab_group thread look suspicious as zfs has been reentered via spl and kernel eviction.
|
It happened again with similar stack traces. Load is increasing but the machine is idle, no significant CPU usage, i.e. deadlock rather than infinite loop. |
@wellhardh Good catch. It seems we need to lock down the metaslab preload threads. In the mean time, you can likely work around it by setting |
Reclaim during metaslab preloading can cause deadlocks involving znode z_lock and ARC buffer header ht_lock. Fixes openzfs#3532.
@wellhardh Please try #3557. |
Can this parameter be changed at any time - even with high load - without trouble (meaning there will be no locking when applying metaslab_preload_enabled=0) ? |
@dweeezil Disabling metaslab_preload with
solved the problem. I managed to complete the rsync run and the script has been running some days without problems. I managed to compile your fix and it is running now. I enabled metaslab_preload again. The fix looks to be active:
I will return with the result when the tests are done. |
@odoucet yes it can be changed safely at run time. @wellhardh the proposed fix has been merged to master if you'd rather run with that code. |
Reclaim during metaslab preloading can cause deadlocks involving znode z_lock and ARC buffer header ht_lock. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#3532.
I have a dozen or so production 0.6.3 ZFS servers that export NFS with billions of small files. We used to get hangs similar to this when we left the default arc cache value but recently we reduced it to around 16GB (512GB host), and starting to see them pop up again. The host gets into this state which we eventually deadlock on. Freeing the page cache does nothing, and the only resort is to reboot. I bumped up the arc cache to 64GB this morning hoping it would help. I can easily reproduce the hang if someone does an rsync of the data (either locally or over NFS).
The text was updated successfully, but these errors were encountered: