-
Notifications
You must be signed in to change notification settings - Fork 178
Handle kthread_create() failures properly in taskq_create(). #340
Conversation
Thanks. This looks good. |
Hi this won't solve the issue. |
I would suggest that whomever is seeing this problem add a In either case, the somewhat larger problem is that there are places throughout the ZFS code base that don't properly handle a failure of It sounds like additional debugging is in order from someone who can reproduce the problem. I suspect we'd like to see the return value from |
For short term hack, we could probably temporary mask the SIGKILL in pending signal and retry. |
I was finally able to create a set rig to reliably reproduce the interrupted |
Thanks for looking at this @tuxoko. I just pushed dweeezil/spl@5ac44ed, and as I mentioned will try to test some more of these cases. I'm sure you noticed by now that openzfs/zfs#2243 is likely cause by this as well. I'm afraid we're going to get a flood of these now that the big distros are starting to roll out 3.13. |
I wrote a reliable script to test SIGKILL on import. Note that the sleep time may not apply to everyone.
@dweeezil Your patch seems OK to me. Anyway, I just want to ask if we should re-enable the signal or not. |
@tuxoko I was wondering the same thing but noticed that without re-enabling it, my test program (zpool import) was, in fact terminated. FYI, my testing rig uses an IOPS-limited KVM guest to give me more windows in which to send signals. The system was otherwise too fast and/or my test pools too simply for a script like you outlined to work. |
@dweeezil @tuxoko Thank you for working on this, I suspect more people are starting to see this so I really appreciate you guys running it to ground. The proposed approach to catch and handle the signal looks reasonable to me. But I think we should move this logic in to a
For example (untested, please rework as needed),
|
@behlendorf Sounds like a good idea. I'll re-work it as a |
I just pushed dweeezil/spl@f3dee6f which provides the The other consumers of EDIT: Tweaked the comments. |
This looks good to me. I've queued it up for testing. |
Provide spl_kthread_create() as a wrapper to the kernel's kthread_create() to provide pre-3.13 semantics. Re-try if the call is interrupted or if it would have returned -ENOMEM. Otherwise return NULL.
@tuxoko Thanks for looking at this. I've re-pushed dweeezil/spl@4a0984b which uses |
@dweezil Hi Tim could you please give me a high level idea of what symptoms this should address? I'm attempting to run on a box that will have sustained CPU and IO load, is that going to be safe/wise? |
@gitboy This patch should cause I have also created fixes for most of the places within the main ZFS code where these failures aren't handled and will be working up a separate patch for those, however, with this fix, they should be no more likely to occur than they did pre-3.13. |
When will the patch be merged? I want to solve my issue openzfs/zfs#2238 |
Closing issue, the patch has been merged to master. |
The return value should be checked with IS_ERR() to deal with
possible -ENOMEM or other errors.