-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
userquota_updates_task NULL deref #7147
Comments
As I said in #7059:
If you hit this issue again after that, let me know and I'll investigate further. |
I suppose you've realized by now that I'm the same guy in all of the threads 😄 . The situation is as follows. I have two pools, call them I've first seen the This morning I started a scrub on Now I'm no longer able to boot from
Additionally, I got the error in this issue at boot:
Both pools are online right now, EDIT: As mentioned below, I tried to start a scrub on the backup pool, then one on |
The plot thickens... I started a scrub on the back-up pool, then tried to start one on
while Since the root fs on |
while you have the system in this state can you run If your kernel is crashed there is no need to do this, the crash is almost certainly causing the problem. |
Some things are still working, but it's mostly crashed, I suppose. I was able to export the back-up pool. |
OK. Sorry to ask this but humor me for a second. You have opened several issues and at this point I'm having a hard time keeping it all straight. Could you quickly go through the timeline of what you have, what you did, and when the issues you've encountered happened? I'm going to guess that they are all probably related, so having the full timeline for all of them will probably help a lot. |
That would be #7147 (comment) above, I suppose. |
Just so I have it right, here is my summary of that (along with the questions I still have). Can you confirm this and answer the questions in parenthesis as best you can? I'm also not sure at what point you started encountering the
|
I saw the error message, but there was no apparent crash, and everything seemed to work afterwards. I didn't reboot. I don't know for which pool it was.
I started a scrub on
Due to a
Right. The
Before the first crash, sending
No crash, same |
Alright I think I have a grasp on everything now and I'll try to take a look today. |
My apologies. I have run out of time for today and will need to look into this more on Monday. |
@lnicola Would you mind running:
This should help em figure out where the initial crash happened. |
Sorry, I updated my kernel and ZFS version in the meanwhile, EDIT: The file system in cause got rolled back somehow, so I'm no longer having any issues. If you feel like closing this, that's fine. |
|
@lnicola would you mind trying this again with the most recent code from master. We recently fixed a related issue to this and I think the problem might be solved. |
I'll try it on the next distro package update, but the only issue I still have is the slow |
Ok. That problem should mostly be alleviated by #7197 so hopefully all your problems should be fixed soon. |
@lnicola any update here? |
I mentioned before, the file system or pool got rolled back at the time. I lost 9 hours of changes or so, and I didn't encounter any issues afterwards, even with that same ZFS version. So I don't know if it's fixed, but I no longer have a way to reproduce these issues. |
I just got a new one, in
The application (a EDIT: Unlike the last time, the pool came back fine (I think) after the reboot. EDIT 2: Happened again during a |
Call traces of processes in
|
@lnicola That is expected. Once the GPF happens, that thread will stop running, causing any threads that rely on it to also become stuck. Unfortunately, we need to find a way to reproduce the issue at least somewhat consistently if we are going to have a chance to fix it. |
Well, it seems I can still reproduce this, so tell me if there's anything I can do... |
@lnicola can you provide steps that I can use to reproduce this? As of yet I have not been able to do so. |
The PANIC actually shows up after the list_del corruption though |
Here's the log |
It should be large dnodes, I'll check once I reboot. Looking at that function not sure how that VERIFY fails though.. it checks that ref count is not zero, then adds 1 and checks that it's greater than 1, and that fails. |
Yes large_dnodes is active |
In dnode_move, don't we have to worry about dn_dirty_link in addition to dn_link? |
@nivedita76 If you are able to reproduce the problem consistently, would it be possible to provide a stack trace of the |
How should I do that? The last log I attached was what got dumped into dmesg |
Sorry. I missed it (needed to scroll down a bit farther). What you have provided is fine. |
Might have an idea of what's causing this..... I'll get back soon if I figure anything out. |
I just had a list_del corruption even after configuring zfs_multilist_num_sublists to 1. Unfortunately this time it didn't save anything to the log before crashing. |
@nivedita76 You can try modprobe netconsole
dmesg -n 8
cd /sys/kernel/config/netconsole
mkdir -p target1
cd target1
echo XXXX > local_ip
echo XXXX > remote_ip
echo enXXXX > dev_name
echo XX:XX:XX:XX:XX:XX > remote_mac
echo 1 > enabled |
@nivedita76 Try applying this patch:
This needs some work before I merge it, but in our testing it seemed to fix the problem. |
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds a check for whether dn_dirty_link is active to determine if we are in this state. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
@nivedita76 if you get a chance try the patch from #7388 and see if the problem is fixed. |
Got an oops. [ 62.868131] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 |
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds a check for whether dn_dirty_link is active to determine if we are in this state, avoiding the crash. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds the ability for dnodes to know which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds the ability for dnodes to know which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds the ability for dnodes to know which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Requires-spl: spl-0.7-release Issue openzfs#7147 Issue openzfs#7388 Issue openzfs#7997
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Requires-spl: spl-0.7-release Issue openzfs#7147 Issue openzfs#7388 Issue openzfs#7997
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#7147 Closes openzfs#7388
Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7147 Closes #7388
System information
Describe the problem you're observing
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: