-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop in metaslab_group_alloc #1720
Comments
If it is possible for us to spin in metaslab_group_alloc() that would go a long way towards explaining some of the reported hangs in this area of the code. It would have to be a tight loop otherwise we'd be making calls to the scheduler and you wouldn't see the stall. This might be possible if a very large value for 'd' was passed in. If you can use gdb to resolve the exact line for |
I seem to have built this kernel with a generic configuration by mistake. Unfortunately, the generic configuration omits debug symbols. |
I might have found a possible cause. e51be06 was supposed to have ported illumos/illumos-gate@03f8c36, but it really did not. I discovered this when comparing our metaslab.c to the illumos metaslab.c. |
It seems that e51be06 did properly port the 3552 patch, but it introduced a race condition regression, which was later fixed by illumos/illumos-gate@03f8c36. That patch cited Illumos issue 3552, but lacked a number of its own, so all of us mistook it for the Illumos 3552 fix that we already had. |
@ryao Good catch! Definitely something to keep in mind in the future. |
…read (fix race condition) Ported-by: Richard Yao <[email protected]> References: https://www.illumos.org/issues/3552 illumos/illumos-gate@03f8c36 Porting notes: This fixes an upstream regression that was introduced in openzfs/zfs@e51be06, which ported the Illumos 3552 changes. This fix was added to upstream rather quickly, but at the time of the port, no one spotted it and the race was rare enough that it passed our regression test. I discovered this when comparing our metaslab.c to the illumos metaslab.c. Closes: openzfs#1687 openzfs#1720 openzfs#1747
@ryao Nice catch, this does appear to cleanly explain the recently introduced metaslab issues. I just wish upstream would have include some comment about symptoms or the exact race in the commit comment. |
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread (fix race condition) References: https://www.illumos.org/issues/3552 illumos/illumos-gate@03f8c36 Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Porting notes: This fixes an upstream regression that was introduced in commit openzfs/zfs@e51be06, which ported the Illumos 3552 changes. This fix was added to upstream rather quickly, but at the time of the port, no one spotted it and the race was rare enough that it passed our regression tests. I discovered this when comparing our metaslab.c to the illumos metaslab.c. Without this change it is possible for metaslab_group_alloc() to consume a large amount of cpu time. Since this occurs under a mutex in a rcu critical section the kernel will log this to the console as a self-detected cpu stall as follows: INFO: rcu_sched self-detected stall on CPU { 0} (t=60000 jiffies g=11431890 c=11431889 q=18271) Closes openzfs#1687 Closes openzfs#1720 Closes openzfs#1731 Closes openzfs#1747
My system deadlocked on me. It could just be that the hardware is going bad as the system is dying, but I managed to get a backtrace from a stall that RCU detected. The stall detection uses a rather simple mechanism. The thread scheduler maintains updates a per-CPU each time that a CPU is scheduled a new thread and it is checked each time RCU runs. RCU runs in natural preemption points provided by things such as mutexes, which causes them to show up whenever RCU detects a stall. This does not mean that the mutex operation is stuck, but that the thread is not being rescheduled (usually because something is wrong).
With that in mind, it appears that the code was looping infinitely in metaslab_group_alloc(), which did not appear in the backtrace because it was a static function. We probably should add an loop counter to detect excessive iterations, print useful diagnostic information and panic the system.
That is all that was printed on the console before I forced a hard reboot. I obtained the output after the fact from my laptop, which is connected to my desktop via a serial line.
The text was updated successfully, but these errors were encountered: